资讯专栏INFORMATION COLUMN

Scrapy爬虫 - 获取知乎用户数据

Miyang / 1803人阅读

摘要:爬虫获取知乎用户数据安装爬虫框架关于如何安装以及框架,这里不做介绍,请自行网上搜索。

2016-04-10

Scrapy爬虫 - 获取知乎用户数据 安装Scrapy爬虫框架

关于如何安装Python以及Scrapy框架,这里不做介绍,请自行网上搜索。

初始化

安装好Scrapy后,执行 scrapy startproject myspider
接下来你会看到 myspider 文件夹,目录结构如下:

scrapy.cfg

myspider

items.py

pipelines.py

settings.py

__init__.py

spiders

__init__.py

编写爬虫文件

在spiders目录下新建 users.py

# -*- coding: utf-8 -*-
import scrapy
import os
import time
from zhihu.items import UserItem
from zhihu.myconfig import UsersConfig # 爬虫配置

class UsersSpider(scrapy.Spider):
    name = "users"
    domain = "https://www.zhihu.com"
    login_url = "https://www.zhihu.com/login/email"
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "www.zhihu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
    }

    def __init__(self, url = None):
        self.user_url = url

    def start_requests(self):
        yield scrapy.Request(
            url = self.domain,
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": 1
            },
            callback = self.request_captcha
        )

    def request_captcha(self, response):
        # 获取_xsrf值
        _xsrf = response.css("input[name="_xsrf"]::attr(value)").extract()[0]
        # 获取验证码地址
        captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)
        # 准备下载验证码
        yield scrapy.Request(
            url = captcha_url,
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "_xsrf": _xsrf
            },
            callback = self.download_captcha
        )

    def download_captcha(self, response):
        # 下载验证码
        with open("captcha.gif", "wb") as fp:
            fp.write(response.body)
        # 用软件打开验证码图片
        os.system("start captcha.gif")
        # 输入验证码
        print "Please enter captcha: "
        captcha = raw_input()

        yield scrapy.FormRequest(
            url = self.login_url,
            headers = self.headers,
            formdata = {
                "email": UsersConfig["email"],
                "password": UsersConfig["password"],
                "_xsrf": response.meta["_xsrf"],
                "remember_me": "true",
                "captcha": captcha
            },
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"]
            },
            callback = self.request_zhihu
        )

    def request_zhihu(self, response):
        yield scrapy.Request(
            url = self.user_url + "/about",
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "from": {
                    "sign": "else",
                    "data": {}
                }
            },
            callback = self.user_item,
            dont_filter = True
        )

        yield scrapy.Request(
            url = self.user_url + "/followees",
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "from": {
                    "sign": "else",
                    "data": {}
                }
            },
            callback = self.user_start,
            dont_filter = True
        )

        yield scrapy.Request(
            url = self.user_url + "/followers",
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "from": {
                    "sign": "else",
                    "data": {}
                }
            },
            callback = self.user_start,
            dont_filter = True
        )

    def user_start(self, response):
        sel_root = response.xpath("//h2[@class="zm-list-content-title"]")
        # 判断关注列表是否为空
        if len(sel_root):
            for sel in sel_root:
                people_url = sel.xpath("a/@href").extract()[0]

                yield scrapy.Request(
                    url = people_url + "/about",
                    headers = self.headers,
                    meta = {
                        "proxy": UsersConfig["proxy"],
                        "cookiejar": response.meta["cookiejar"],
                        "from": {
                            "sign": "else",
                            "data": {}
                        }
                    },
                    callback = self.user_item,
                    dont_filter = True
                )

                yield scrapy.Request(
                    url = people_url + "/followees",
                    headers = self.headers,
                    meta = {
                        "proxy": UsersConfig["proxy"],
                        "cookiejar": response.meta["cookiejar"],
                        "from": {
                            "sign": "else",
                            "data": {}
                        }
                    },
                    callback = self.user_start,
                    dont_filter = True
                )

                yield scrapy.Request(
                    url = people_url + "/followers",
                    headers = self.headers,
                    meta = {
                        "proxy": UsersConfig["proxy"],
                        "cookiejar": response.meta["cookiejar"],
                        "from": {
                            "sign": "else",
                            "data": {}
                        }
                    },
                    callback = self.user_start,
                    dont_filter = True
                )

    def user_item(self, response):
        def value(list):
            return list[0] if len(list) else ""

        sel = response.xpath("//div[@class="zm-profile-header ProfileCard"]")

        item = UserItem()
        item["url"] = response.url[:-6]
        item["name"] = sel.xpath("//a[@class="name"]/text()").extract()[0].encode("utf-8")
        item["bio"] = value(sel.xpath("//span[@class="bio"]/@title").extract()).encode("utf-8")
        item["location"] = value(sel.xpath("//span[contains(@class, "location")]/@title").extract()).encode("utf-8")
        item["business"] = value(sel.xpath("//span[contains(@class, "business")]/@title").extract()).encode("utf-8")
        item["gender"] = 0 if sel.xpath("//i[contains(@class, "icon-profile-female")]") else 1
        item["avatar"] = value(sel.xpath("//img[@class="Avatar Avatar--l"]/@src").extract())
        item["education"] = value(sel.xpath("//span[contains(@class, "education")]/@title").extract()).encode("utf-8")
        item["major"] = value(sel.xpath("//span[contains(@class, "education-extra")]/@title").extract()).encode("utf-8")
        item["employment"] = value(sel.xpath("//span[contains(@class, "employment")]/@title").extract()).encode("utf-8")
        item["position"] = value(sel.xpath("//span[contains(@class, "position")]/@title").extract()).encode("utf-8")
        item["content"] = value(sel.xpath("//span[@class="content"]/text()").extract()).strip().encode("utf-8")
        item["ask"] = int(sel.xpath("//div[contains(@class, "profile-navbar")]/a[2]/span[@class="num"]/text()").extract()[0])
        item["answer"] = int(sel.xpath("//div[contains(@class, "profile-navbar")]/a[3]/span[@class="num"]/text()").extract()[0])
        item["agree"] = int(sel.xpath("//span[@class="zm-profile-header-user-agree"]/strong/text()").extract()[0])
        item["thanks"] = int(sel.xpath("//span[@class="zm-profile-header-user-thanks"]/strong/text()").extract()[0])

        yield item
添加爬虫配置文件

在myspider目录下新建myconfig.py,并添加以下内容,将你的配置信息填入相应位置

# -*- coding: utf-8 -*-
UsersConfig = {
    # 代理
    "proxy": "",

    # 知乎用户名和密码
    "email": "your email",
    "password": "your password",
}

DbConfig = {
    # db config
    "user": "db user",
    "passwd": "db password",
    "db": "db name",
    "host": "db host",
}
修改items.py
# -*- coding: utf-8 -*-
import scrapy

class UserItem(scrapy.Item):
    # define the fields for your item here like:
    url = scrapy.Field()
    name = scrapy.Field()
    bio = scrapy.Field()
    location = scrapy.Field()
    business = scrapy.Field()
    gender = scrapy.Field()
    avatar = scrapy.Field()
    education = scrapy.Field()
    major = scrapy.Field()
    employment = scrapy.Field()
    position = scrapy.Field()
    content = scrapy.Field()
    ask = scrapy.Field()
    answer = scrapy.Field()
    agree = scrapy.Field()
    thanks = scrapy.Field()
将用户数据存入mysql数据库

修改pipelines.py

# -*- coding: utf-8 -*-
import MySQLdb
import datetime
from zhihu.myconfig import DbConfig

class UserPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user = DbConfig["user"], passwd = DbConfig["passwd"], db = DbConfig["db"], host = DbConfig["host"], charset = "utf8", use_unicode = True)
        self.cursor = self.conn.cursor()
        # 清空表
        # self.cursor.execute("truncate table weather;")
        # self.conn.commit()

    def process_item(self, item, spider):
        curTime = datetime.datetime.now()
        try:
            self.cursor.execute(
                """INSERT IGNORE INTO users (url, name, bio, location, business, gender, avatar, education, major, employment, position, content, ask, answer, agree, thanks, create_at)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",
                (
                    item["url"],
                    item["name"],
                    item["bio"],
                    item["location"],
                    item["business"],
                    item["gender"],
                    item["avatar"],
                    item["education"],
                    item["major"],
                    item["employment"],
                    item["position"],
                    item["content"],
                    item["ask"],
                    item["answer"],
                    item["agree"],
                    item["thanks"],
                    curTime
                )
            )
            self.conn.commit()
        except MySQLdb.Error, e:
            print "Error %d %s" % (e.args[0], e.args[1])

        return item
修改settings.py

找到 ITEM_PIPELINES,改为:

ITEM_PIPELINES = {
   "myspider.pipelines.UserPipeline": 300,
}

在末尾添加,设置爬虫的深度

DEPTH_LIMIT=10
爬取知乎用户数据

确保MySQL已经打开,在项目根目录下打开终端,
执行 scrapy crawl users -a url=https://www.zhihu.com/people/
其中user为爬虫的第一个用户,之后会根据该用户关注的人和被关注的人进行爬取数据
接下来会下载验证码图片,若未自动打开,请到根目录下打开 captcha.gif,在终端输入验证码
数据爬取Loading...

源码

源码可以在这里找到 github

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/45426.html

相关文章

  • 23个Python爬虫开源项目代码,包含微信、淘宝、豆瓣、知乎、微博等

    摘要:今天为大家整理了个爬虫项目。地址新浪微博爬虫主要爬取新浪微博用户的个人信息微博信息粉丝和关注。代码获取新浪微博进行登录,可通过多账号登录来防止新浪的反扒。涵盖链家爬虫一文的全部代码,包括链家模拟登录代码。支持微博知乎豆瓣。 showImg(https://segmentfault.com/img/remote/1460000018452185?w=1000&h=667); 今天为大家整...

    jlanglang 评论0 收藏0
  • 零基础如何学爬虫技术

    摘要:楚江数据是专业的互联网数据技术服务,现整理出零基础如何学爬虫技术以供学习,。本文来源知乎作者路人甲链接楚江数据提供网站数据采集和爬虫软件定制开发服务,服务范围涵盖社交网络电子商务分类信息学术研究等。 楚江数据是专业的互联网数据技术服务,现整理出零基础如何学爬虫技术以供学习,http://www.chujiangdata.com。 第一:Python爬虫学习系列教程(来源于某博主:htt...

    KunMinX 评论0 收藏0
  • 22、Python快速开发分布式搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识

    【百度云搜索,搜各种资料:http://www.bdyss.cn】 【搜网盘,搜各种资料:http://www.swpan.cn】 第一步。首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://github.com/muchrooms/... 注意:此程序依赖以下模块包   Keras==2.0.1  Pillow==3.4.2  jupyter==1.0.0  matplotli...

    array_huang 评论0 收藏0
  • Python爬虫Scrapy学习(基础篇)

    摘要:下载器下载器负责获取页面数据并提供给引擎,而后提供给。下载器中间件下载器中间件是在引擎及下载器之间的特定钩子,处理传递给引擎的。一旦页面下载完毕,下载器生成一个该页面的,并将其通过下载中间件返回方向发送给引擎。 作者:xiaoyu微信公众号:Python数据科学知乎:Python数据分析师 在爬虫的路上,学习scrapy是一个必不可少的环节。也许有好多朋友此时此刻也正在接触并学习sc...

    pkhope 评论0 收藏0
  • scrapy模拟登陆知乎--抓取热点话题

    摘要:在抓取数据之前,请在浏览器中登录过知乎,这样才使得是有效的。所谓的模拟登陆,只是在中尽量的模拟在浏览器中的交互过程,使服务端无感抓包过程。若是帮你解决了问题,或者给了你启发,不要吝啬给加一星。 折腾了将近两天,中间数次想要放弃,还好硬着头皮搞下去了,在此分享出来,希望有同等需求的各位能少走一些弯路。 源码放在了github上, 欢迎前往查看。 若是帮你解决了问题,或者给了你启发,不要吝...

    leanxi 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<