摘要:具体代码可参看这里列表页是指端的入口,如电影抓取后数据如下大毛狗何明翰张璇历史万次播放
具体代码可参看Knowsmore
这里列表页是指PC端的入口,如电影
抓取后数据如下:
{ "link" : "//v.youku.com/v_show/id_XMzMyMzE2MTMxNg==.html", "thumb_img" : "http://r1.ykimg.com/051600005AD944F0859B5E040E03BD62", "title" : "大毛狗", "tag" : [ "VIP" ], "actors" : [ "何明翰", "张璇" ], "play_times" : " 历史 2,236万次播放 " }
# -*- coding: utf-8 -*- import scrapy import re import json from scrapy import Selector, Request from knowsmore.items import YoukuListItem from ..common import * from ..model.mongodb import * class YoukuListSpider(scrapy.Spider): name = "youku_list" custom_settings = { "DOWNLOADER_MIDDLEWARES" : { } } start_urls = [ "https://list.youku.com/category/show/c_96_s_1_d_4_p_29.html" ] def parse(self, response): GRID_SELECTOR = ".panel .mr1" for grid in response.css(GRID_SELECTOR): THUMB_IMG_SELECTOR = ".p-thumb img::attr(_src)" LINK_SELECTOR = ".info-list .title a::attr(href)" TITLE_SELECTOR = ".info-list .title a::text" ACTORS_SELECTOR = ".info-list .actor a::text" TAG_SELECTOR = ".p-thumb .p-thumb-tagrt span::text" PLAY_TIMES_SELECTOR = ".info-list li:nth-child(3)::text" item_thumb_img = grid.css( THUMB_IMG_SELECTOR).extract_first() item_link = grid.css( LINK_SELECTOR).extract_first() item_title = grid.css( TITLE_SELECTOR).extract_first() item_actors = grid.css( ACTORS_SELECTOR).extract() item_tag = grid.css( TAG_SELECTOR).extract() item_play_times = grid.css( PLAY_TIMES_SELECTOR).extract_first() # Build Scrapy Item youku_item = YoukuListItem( thumb_img = item_thumb_img, link = item_link, title = item_title, actors = item_actors, play_times = item_play_times, tag = item_tag ) # Send to Pipelines yield youku_item NEXT_PAGE_SELECTOR = ".yk-pages .next a::attr(href)" next_page = response.css(NEXT_PAGE_SELECTOR).extract_first() if next_page is not None: print next_page yield response.follow(next_page)
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/42980.html
摘要:在线体验地址源码项目预览主页面登录页面注册页面会员中心电影播放页面电影弹幕功能视频网站项目已经完功能如下当前最新版本增加自动抓取功能,网站数据定期实时更新电影和电视剧数据抓取电影数据信息前端展现电影页面图片的自动抓取下载和展示代码结构调整简 在线体验地址:http://vip.52tech.tech/ GIthub源码:https://github.com/xiugangzha......
阅读 2037·2021-10-08 10:05
阅读 1880·2021-09-22 15:31
阅读 3001·2021-09-22 15:13
阅读 3478·2021-09-09 09:34
阅读 2071·2021-09-03 10:46
阅读 3112·2019-08-30 15:56
阅读 1695·2019-08-30 15:53
阅读 2349·2019-08-30 15:44