资讯专栏INFORMATION COLUMN

爬取知乎“凡尔赛语录”话题下的所有回答,我知道点开看你的很帅气,但还是没我帅

fevin / 2619人阅读

摘要:普通的炫耀,无非在社交网络发发跑车照片,或不经意露出名牌包包,但凡尔赛文学还不这么直接。爬取的网站在知乎搜索凡尔赛语录,第二个比较适合,就用这个。特别是后面的一串数字是问题,作为知乎问题的唯一标识。

凡尔赛文学火了。这种特殊的网络文体,常出现在朋友圈或微博,以波澜不惊的口吻,假装不经意地炫富、秀恩爱。
普通的炫耀,无非在社交网络发发跑车照片,或不经意露出名牌包包 logo,但凡尔赛文学还不这么直接。微博博主还专门制作过凡尔赛文学教学视频,讲解其三大精髓要素:

在豆瓣上,也有一个名叫凡尔赛学研习小组,组员们将凡尔赛定义为一种表演高级人生的精神,好了,进入主题,今天来快速爬取知乎里有关凡尔赛语录有关的回答,开始。

1.爬取的网站

在知乎搜索凡尔赛语录,第二个比较适合,就用这个。

点进去后可以发现关于这个提问共有 393 个回答。

网址:https://www.zhihu.com/question/429548386/answer/1575062220

去掉 answer 以及后面的部分就是这个要爬取的问题网址。特别是后面的一串数字是问题 id:https://www.zhihu.com/question/429548386,作为知乎问题的唯一标识。

2.爬取问题有关的回答

研究一下上面的网址,我们发现需要爬取两部分数据:

  1. 爬取的详情,包括创建时间、关注人数、浏览量、问题描述等
  2. 爬取的回答,包括每个答主的用户名、粉丝数等信息,问题回答的具体内容、发布时间、评论数、点赞数等信息

其中,这个问题详情可以直接爬取上面的网址,通过 bs4 解析页面内容拿到数据,而问题的回答则需要通过下面的链接,通过设置每页的起始下标和页面内容偏移量确定,有点类似于分页内容的爬取。

def init_url(question_id, limit, offset):      base_url_start = "https://www.zhihu.com/api/v4/questions/"      base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={0}&offset={1}".format(limit, offset)      return base_url_start + question_id + base_url_end

设置每页回答数 limit=20,offset 则可以是0、20、40…而 question_id 则是上面提到的网址后面的一串数字,这里是 429548386,逻辑想明白之后就是通过写爬虫获取数据了,下面是完整的爬虫代码,运行的时候你只需要修改问题的 id 即可。

3.完整代码

# 导入相应的库import jsonimport reimport timefrom datetime import datetimefrom time import sleepimport pandas as pdimport numpy as npimport warningsimport requestsfrom bs4 import BeautifulSoupimport randomimport warningswarnings.filterwarnings("ignore")def get_ua():    """    在UA库中随机选择一个UA    :return: 返回一个库中的随机UA    """    ua_list = [        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",        "Opera/8.0 (Windows NT 5.1; U; en)",        "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",        "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",        "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13",        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",        "Mozilla/5.0 (Macintosh; U; IntelMac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50",        "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]    return random.choice(ua_list)    def filter_emoij(text):    """    过滤emoij表情符    @param text:    @return:    """    try:        co = re.compile(u"[/U00010000-/U0010ffff]")    except re.error:        co = re.compile(u"[/uD800-/uDBFF][/uDC00-/uDFFF]")    text = co.sub("", text)    return textdef get_question_base_info(url):    """    获取问题的详细描述    @param url:    @return:    """    response = requests.get(url=url, headers={"User-Agent": get_ua()}, timeout=10)    """获取数据并解析"""    soup = BeautifulSoup(response.text, "lxml")    # 问题标题    title = soup.find("h1", {"class": "QuestionHeader-title"}).text    # 具体问题    question = ""    try:        question = soup.find("div", {"class": "QuestionRichText--collapsed"}).text.replace("/u200b", "")    except Exception as e:        print(e)    # 关注者    follower = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[0].text.strip().replace(",", ""))    # 被浏览    watched = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[1].text.strip().replace(",", ""))    # 问题回答次数    answer_str = soup.find_all("h4", {"class": "List-headerText"})[0].span.text.strip()    # 抽取xxx 个回答中的数字:【正则】数字出现次数>=0    answer_count = int(re.findall("/d*", answer_str)[0])    # 问题标签    tag_list = []    tags = soup.find_all("div", {"class": "QuestionTopic"})    for tag in tags:        tag_list.append(tag.text)    return title, question, follower, watched, answer_count, tag_listdef init_url(question_id, limit, offset):    """    构造每一页访问的url    @param question_id:    @param limit:    @param offset:    @return:    """    base_url_start = "https://www.zhihu.com/api/v4/questions/"    base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed" /                   "%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by" /                   "%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" /                   "%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info" /                   "%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting" /                   "%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B" /                   "%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics" /                   "&limit={0}&offset={1}".format(limit, offset)    return base_url_start + question_id + base_url_enddef get_time_str(timestamp):    """    将时间戳转换为标准日期字符    @param timestamp:    @return:    """    datetime_str = ""    try:        # 时间戳timestamp 转datetime时间格式        datetime_time = datetime.fromtimestamp(timestamp)        # datetime时间格式转为日期字符串        datetime_str = datetime_time.strftime("%Y-%m-%d %H:%M:%S")    except Exception as e:        print(e)        print("日期转换错误")    return datetime_strdef get_answer_info(url, index):    """    解析问题回答    @param url:    @param index:    @return:    """    response = requests.get(url=url, headers={"User-Agent": get_ua()}, timeout=10)    text = response.text.replace("/u200b", "")    per_answer_list = []    try:        question_json = json.loads(text)        """获取当前页的回答数据"""        print("爬取第{0}页回答列表,当前页获取到{1}个回答".format(index + 1, len(question_json["data"])))        for data in question_json["data"]:            """问题的相关信息"""            # 问题的问题类型、id、提问类型、创建时间、修改时间            question_type = data["question"]["type"]            question_id = data["question"]["id"]            question_question_type = data["question"]["question_type"]            question_created = get_time_str(data["question"]["created"])            question_updated_time = get_time_str(data["question"]["updated_time"])            """答主的相关信息"""            # 答主的用户名、签名、性别、粉丝数            author_name = data["author"]["name"]            author_headline = data["author"]["headline"]            author_gender = data["author"]["gender"]            author_follower_count = data["author"]["follower_count"]            """回答的相关信息"""            # 问题回答id、创建时间、更新时间、赞同数、评论数、具体内容            id = data["id"]            created_time = get_time_str(data["created_time"])            updated_time = get_time_str(data["updated_time"])            voteup_count = data["voteup_count"]            comment_count = data["comment_count"]            content = data["content"]            per_answer_list.append([question_type, question_id, question_question_type, question_created,                                    question_updated_time, author_name, author_headline, author_gender,                                    author_follower_count, id, created_time, updated_time, voteup_count, comment_count,                                    content                                    ])    except:        print("Json格式校验错误")    finally:        answer_column = ["问题类型", "问题id", "问题提问类型", "问题创建时间", "问题更新时间",                         "答主用户名", "答主签名", "答主性别", "答主粉丝数",                         "答案id", "答案创建时间", "答案更新时间", "答案赞同数", "答案评论数", "答案具体内容"]        per_answer_data = pd.DataFrame(per_answer_list, columns=answer_column)    return per_answer_dataif __name__ == "__main__":    # question_id = "424516487"    question_id = "429548386"    url = "https://www.zhihu.com/question/" + question_id    """获取问题的详细描述"""    title, question, follower, watched, answer_count, tag_list = get_question_base_info(url)    print("问题url:"+ url)    print("问题标题:" + title)    print("问题描述:" + question)    print("该问题被定义的标签为:" + "、".join(tag_list))    print("该问题关注人数:{0},已经被 {1} 人浏览过".format(follower, watched))    print("截止 {},该问题有 {} 个回答".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), answer_count))    """获取问题的回答数据"""    # 构造url    limit, offset = 20, 0    page_cnt = int(answer_count/limit) + 1    answer_data = pd.DataFrame()    for page_index in range(page_cnt):        answer_url = init_url(question_id, limit, offset+page_index*limit)        # 获取数据        data_per_page = get_answer_info(answer_url, page_index)        answer_data = answer_data.append(data_per_page)        sleep(3)        print("/n爬取完成,数据已保存!!")    answer_data.to_csv("凡尔赛沙雕语录_{0}.csv".format(question_id), encoding="utf-8", index=False)

4.结果

一共爬取到 393 个答案,需要注意一下,最后保存的文件格式为 UTF-8,读取乱码的同学请先检查格式是否一致。

爬取的结果部分截图如下:


感谢看到这里,更多Python精彩内容可以关注我看我主页,你们的三连(点赞,收藏,评论)是我持续更新下去的动力,感谢。

点击领取? Q群号: 675240729(纯技术交流和资源共享)以自助拿走。

①行业咨询、专业解答
②Python开发环境安装教程
③400集自学视频
④软件开发常用词汇
⑤最新学习路线图
⑥3000多本Python电子书

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/123098.html

相关文章

  • 从零转行数据分析的亲身经历

    摘要:我的转行经历博主从开公众号起前个月开始接触语言,然后接触到了数据方面的技术,包括爬虫,数据分析,数据挖掘,机器学习等,一直到现在仍然在坚持自学,我相信只要坚持结果总不会太差。对于数据分析而言,机器学习和爬虫等并不是必须,但是加分项。 作者:xiaoyu 微信公众号:Python数据科学 知乎:python数据分析师 showImg(https://segmentfault.com/i...

    Rocture 评论0 收藏0
  • 一只node爬虫的升级打怪之路

    摘要:我是一个知乎轻微重度用户,之前写了一只爬虫帮我爬取并分析它的数据,我感觉这个过程还是挺有意思,因为这是一个不断给自己创造问题又去解决问题的过程。所以这只爬虫还有登陆知乎搜索题目的功能。 我一直觉得,爬虫是许多web开发人员难以回避的点。我们也应该或多或少的去接触这方面,因为可以从爬虫中学习到web开发中应当掌握的一些基本知识。而且,它还很有趣。 我是一个知乎轻微重度用户,之前写了一只爬...

    shiweifu 评论0 收藏0
  • [PHP] 又是知乎,用 Beanbun 爬取知乎用户

    摘要:最近看了很多关于爬虫入门的文章,发现其中大部分都是以知乎为爬取对象,所以这次我也以知乎为目标来进行爬取的演示,用到的爬虫框架为编写的。项目地址这次写的内容为爬取知乎的用户,下面就是详细说一下写爬虫的过程了。 最近看了很多关于爬虫入门的文章,发现其中大部分都是以知乎为爬取对象,所以这次我也以知乎为目标来进行爬取的演示,用到的爬虫框架为 PHP 编写的 Beanbun。 项目地址:http...

    tomato 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<