使用python scrapy爬取网页中带有地图展示的数据

Bryan 发布于2019-07-31 10:20 / 2056人阅读

摘要：例如这个界面，我要获取全中国各大城市的物流园区分布信息，并且要获取详情信息，这个页面里面是有个地图镶嵌，每个城市物流信息你要多带带点击地图上的信息才能显示。

最近有个需求，是要爬取某个物流公司的官网信息，我看了下官网，基本上都是静态页面比较好抓取，不像那种资讯类，电子商务类型的网站结果复杂，反爬严格，AJAX众多，还内心暗自庆幸，当我进一步分析时候发现并非普通的静态页面。
例如这个URL界面，我要获取全中国各大城市的物流园区分布信息，并且要获取详情信息，
这个页面里面是有个地图镶嵌，每个城市物流信息你要多带带点击地图上的信息才能显示。
https://www.glprop.com.cn/our...

我刚开始想，这种会不会是ajax请求呢，通过chrmoe抓包并没有发现，然后我查看网页源代码
发现所有城市信息在一个scripts里面
如图：

然后各个园区的信息在一个叫park={xx}里面存着

原来都在这里面，直接获取源代码，正则匹配，开干。
item：

#普洛斯
class PuluosiNewsItem(scrapy.Item):
    newstitle=scrapy.Field()
    newtiems=scrapy.Field()
    newslink=scrapy.Field()
class PuluosiItem(scrapy.Item):
    assetstitle = scrapy.Field()
    assetaddress=scrapy.Field()
    assetgaikuang=scrapy.Field()
    assetpeople=scrapy.Field()
    asseturl = scrapy.Field()

pipelines：

class PuluosiNewsPipeline(object):
    def __init__(self):
        self.wb=Workbook()
        self.ws=self.wb.active
        #设置表头
        self.ws.append(["普洛斯新闻标题","新闻发布时间","新闻URL"])
        self.wb2 = Workbook()
        self.ws2 = self.wb2.active
        self.ws2.append(["资产标题", "资产地址", "资产概况","其他信息","URL"])
    def process_item(self,item,spider):
        if isinstance(item, PuluosiNewsItem):
            line = [item["newstitle"], item["newtiems"], item["newslink"]]  # 把数据中每一项整理出来
            self.ws.append(line)
            self.wb.save("PuluosiNews.xlsx")  # 保存xlsx文件
        elif isinstance(item,PuluosiItem):
            line = [item["assetstitle"], item["assetaddress"], item["assetgaikuang"],item["assetpeople"],item["asseturl"]]
            self.ws2.append(line)
            self.wb2.save("PuluosiAsset.xlsx")  # 保存xlsx文件
        return item

spider：

# -*- coding: utf-8 -*-
import scrapy,re,json
from news.items import PuluosiNewsItem,PuluosiItem
from scrapy.linkextractors import LinkExtractor

class PuluosiSpider(scrapy.Spider):
    name = "puluosi"
    allowed_domains = ["glprop.com.cn"]
    # start_urls = ["https://www.glprop.com.cn/press-releases.html"]

    def start_requests(self):
        yield scrapy.Request("https://www.glprop.com.cn/press-releases.html", self.parse1)
        yield scrapy.Request("https://www.glprop.com.cn/in-the-news.html", self.parse2)
        yield scrapy.Request("https://www.glprop.com.cn/proposed-privatization.html", self.parse3)
        yield scrapy.Request("https://www.glprop.com.cn/our-network/network-detail.html", self.parse4)

    def parse1(self, response):
        print("此时启动的爬虫为：puluosi" )
        item=PuluosiNewsItem()
        web=response.xpath("//tbody/tr")
        web.pop(0)
        for node in  web:
            item["newstitle"] = node.xpath(".//a/text()").extract()[0].strip()
            print(item["newstitle"])
            item["newtiems"] = node.xpath(".//td/text()").extract()[0].strip()
            print(item["newtiems"])
            # urljoin创建绝对的links路径，始用于网页中的href值为相对路径的连接
            item["newslink"] = response.urljoin(web.xpath(".//a/@href").extract()[0])
            # print(item["newslink"])
            yield item
        #加入try 来判断当前年份的新闻是否有下一页出现
        try:
            next_url_tmp = response.xpath("//div[@class="page"]/a[contains(text(),"下一页")]/@href").extract()[0]
            if next_url_tmp:
                next_url = "https://www.glprop.com.cn" + next_url_tmp
                yield scrapy.Request(next_url,callback=self.parse1)
        except Exception as e:
            print("当前页面没有下一页")
        href=response.xpath("//ul[@class="timeList"]/li/a/@href")
        for nexturl in href:
            url1 =nexturl.extract()
            if url1:
                url="https://www.glprop.com.cn"+url1
                yield scrapy.Request(url,callback=self.parse1)

    def parse2(self,response):
        item = PuluosiNewsItem()
        web = response.xpath("//tbody/tr")
        web.pop(0)
        for node in  web:
            item["newstitle"] = node.xpath(".//a/text()").extract()[0].strip()
            print(item["newstitle"])
            item["newtiems"] = node.xpath(".//td/text()").extract()[0].strip()
            print(item["newtiems"])
            # urljoin创建绝对的links路径，始用于网页中的href值为相对路径的连接
            item["newslink"] = response.urljoin(web.xpath(".//a/@href").extract()[0])
            print(item["newslink"])
            yield item
        #加入try 来判断当前年份的新闻是否有下一页出现
        try:
            next_url_tmp = response.xpath("//div[@class="page"]/a[contains(text(),"下一页")]/@href").extract()[0]
            if next_url_tmp:
                next_url = "https://www.glprop.com.cn" + next_url_tmp
                yield scrapy.Request(next_url,callback=self.parse2)
        except Exception as e:
            print("当前页面没有下一页")
        href=response.xpath("//ul[@class="timeList"]/li/a/@href")
        for nexturl in href:
            url1 =nexturl.extract()
            if url1:
                url="https://www.glprop.com.cn"+url1
                yield scrapy.Request(url,callback=self.parse2)

    def parse3(self,response):
        item=PuluosiNewsItem()
        web=response.xpath("//tbody/tr")
        web.pop()
        for node in  web:
            item["newstitle"] = node.xpath(".//a/text()").extract()[0].strip()
            print(item["newstitle"])
            item["newtiems"] = node.xpath(".//td/text()").extract()[0].strip()
            print(item["newtiems"])
            # urljoin创建绝对的links路径，始用于网页中的href值为相对路径的连接
            item["newslink"] = response.urljoin(web.xpath(".//a/@href").extract()[0])
            print(item["newslink"])
            yield item

    def parse4(self,response):
        link=LinkExtractor(restrict_xpaths="//div[@class="net_pop1"]//div[@class="city"]")
        links=link.extract_links(response)
        #获取所有城市的links
        for i in links:
            detailurl=i.url
            yield scrapy.Request(url=detailurl,callback=self.parse5)

    def parse4(self, response):
        item = PuluosiItem()
        citycode=re.findall("var cities =(.*);",response.text )
        citycodejson=json.loads(("".join(citycode)))
        #把每个城市的id和name取出来放到一个字典
        dictcity={}
        for i in citycodejson:
            citycodename=i["name"]
            citycodenm=i["id"]
            dictcity[citycodenm]=citycodename
        detail=re.findall("var parks =(.*);",response.text )
        jsonBody = json.loads(("".join(detail)))
        list = []
        for key1 in jsonBody:
            for key2  in jsonBody[key1]:
                tmp=jsonBody[key1][key2]
                list.append(jsonBody[key1][key2])
        for node in list:
            assetaddress = node["city_id"]
            item["assetaddress"] = dictcity[assetaddress]
            # print(item["assetaddress"])
            item["assetstitle"] = node["name"]
            # print(item["assetstitle"])
            item["assetgaikuang"] = node["detail_single"].strip().replace(" ", "").replace(" ", "")
            # print(item["assetgaikuang"])
            assetpeople = node["description"]
            item["assetpeople"] = re.sub(r"<.*?>", "", (assetpeople.strip())).replace(" ", "")
            item["asseturl"]="https://www.glprop.com.cn/network-city-detail.html?city="+item["assetaddress"]
            # print(item["assetpeople"])
            yield item

然后我顺便把页面的新闻信息也爬取了。

GPU云服务器云服务器 python爬取网页图片 scrapy爬取实例 java网页爬取数据多地图展示

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/43670.html

首次公开，整理12年积累的博客收藏夹，零距离展示《收藏夹吃灰》系列博客

摘要：时间永远都过得那么快，一晃从年注册，到现在已经过去了年那些被我藏在收藏夹吃灰的文章，已经太多了，是时候把他们整理一下了。那是因为收藏夹太乱，橡皮擦给设置私密了，不收拾不好看呀。 ...

Harriet666 2021-09-10 10:51 评论0 收藏0
Python Scrapy爬虫框架学习

摘要：组件引擎负责控制数据流在系统中所有组件中流动，并在相应动作发生时触发事件。下载器下载器负责获取页面数据并提供给引擎，而后提供给。下载器中间件下载器中间件是在引擎及下载器之间的特定钩子，处理传递给引擎的。 Scrapy 是用Python实现一个为爬取网站数据、提取结构性数据而编写的应用框架。一、Scrapy框架简介 Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。 ...

harriszh 2019-07-31 11:00 评论0 收藏0
爬虫入门

摘要：通用网络爬虫通用网络爬虫又称全网爬虫，爬取对象从一些种子扩充到整个。为提高工作效率，通用网络爬虫会采取一定的爬取策略。介绍是一个国人编写的强大的网络爬虫系统并带有强大的。爬虫简单的说网络爬虫（Web crawler）也叫做网络铲（Web scraper）、网络蜘蛛（Web spider），其行为一般是先爬到对应的网页上，再把需要的信息铲下来。分类网络爬虫按照系统结构和实现技术，...

defcon 2019-07-30 17:07 评论0 收藏0
爬虫入门

摘要：通用网络爬虫通用网络爬虫又称全网爬虫，爬取对象从一些种子扩充到整个。为提高工作效率，通用网络爬虫会采取一定的爬取策略。介绍是一个国人编写的强大的网络爬虫系统并带有强大的。爬虫简单的说网络爬虫（Web crawler）也叫做网络铲（Web scraper）、网络蜘蛛（Web spider），其行为一般是先爬到对应的网页上，再把需要的信息铲下来。分类网络爬虫按照系统结构和实现技术，...

Invoker 2019-08-30 15:54 评论0 收藏0
为你的爬虫提提速？

摘要：项目介绍本文将展示如何利用中的异步模块来提高爬虫的效率。使用用的爬虫爬取了条数据，耗时小时，该爬虫爬取条数据，耗时半小时。如果是同样的数据量，那么爬取条数据耗时约小时，该爬虫仅用了爬虫的四分之一的时间就出色地完成了任务。项目介绍本文将展示如何利用Pyhton中的异步模块来提高爬虫的效率。我们需要爬取的目标为：融360网站上的理财产品信息（https://www.rong36...

yanest 2019-07-31 11:13 评论0 收藏0