摘要:普通的炫耀,无非在社交网络发发跑车照片,或不经意露出名牌包包,但凡尔赛文学还不这么直接。爬取的网站在知乎搜索凡尔赛语录,第二个比较适合,就用这个。特别是后面的一串数字是问题,作为知乎问题的唯一标识。
凡尔赛文学火了。这种特殊的网络文体,常出现在朋友圈或微博,以波澜不惊的口吻,假装不经意地炫富、秀恩爱。
普通的炫耀,无非在社交网络发发跑车照片,或不经意露出名牌包包 logo,但凡尔赛文学还不这么直接。微博博主还专门制作过凡尔赛文学教学视频,讲解其三大精髓要素:
在豆瓣上,也有一个名叫凡尔赛学研习小组,组员们将凡尔赛定义为一种表演高级人生的精神,好了,进入主题,今天来快速爬取知乎里有关凡尔赛语录有关的回答,开始。
在知乎搜索凡尔赛语录,第二个比较适合,就用这个。
点进去后可以发现关于这个提问共有 393 个回答。
网址:https://www.zhihu.com/question/429548386/answer/1575062220
去掉 answer 以及后面的部分就是这个要爬取的问题网址。特别是后面的一串数字是问题 id:https://www.zhihu.com/question/429548386,作为知乎问题的唯一标识。
研究一下上面的网址,我们发现需要爬取两部分数据:
其中,这个问题详情可以直接爬取上面的网址,通过 bs4 解析页面内容拿到数据,而问题的回答则需要通过下面的链接,通过设置每页的起始下标和页面内容偏移量确定,有点类似于分页内容的爬取。
def init_url(question_id, limit, offset): base_url_start = "https://www.zhihu.com/api/v4/questions/" base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={0}&offset={1}".format(limit, offset) return base_url_start + question_id + base_url_end
设置每页回答数 limit=20,offset 则可以是0、20、40…而 question_id 则是上面提到的网址后面的一串数字,这里是 429548386,逻辑想明白之后就是通过写爬虫获取数据了,下面是完整的爬虫代码,运行的时候你只需要修改问题的 id 即可。
# 导入相应的库import jsonimport reimport timefrom datetime import datetimefrom time import sleepimport pandas as pdimport numpy as npimport warningsimport requestsfrom bs4 import BeautifulSoupimport randomimport warningswarnings.filterwarnings("ignore")def get_ua(): """ 在UA库中随机选择一个UA :return: 返回一个库中的随机UA """ ua_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60", "Opera/8.0 (Windows NT 5.1; U; en)", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0", "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Macintosh; U; IntelMac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"] return random.choice(ua_list) def filter_emoij(text): """ 过滤emoij表情符 @param text: @return: """ try: co = re.compile(u"[/U00010000-/U0010ffff]") except re.error: co = re.compile(u"[/uD800-/uDBFF][/uDC00-/uDFFF]") text = co.sub("", text) return textdef get_question_base_info(url): """ 获取问题的详细描述 @param url: @return: """ response = requests.get(url=url, headers={"User-Agent": get_ua()}, timeout=10) """获取数据并解析""" soup = BeautifulSoup(response.text, "lxml") # 问题标题 title = soup.find("h1", {"class": "QuestionHeader-title"}).text # 具体问题 question = "" try: question = soup.find("div", {"class": "QuestionRichText--collapsed"}).text.replace("/u200b", "") except Exception as e: print(e) # 关注者 follower = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[0].text.strip().replace(",", "")) # 被浏览 watched = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[1].text.strip().replace(",", "")) # 问题回答次数 answer_str = soup.find_all("h4", {"class": "List-headerText"})[0].span.text.strip() # 抽取xxx 个回答中的数字:【正则】数字出现次数>=0 answer_count = int(re.findall("/d*", answer_str)[0]) # 问题标签 tag_list = [] tags = soup.find_all("div", {"class": "QuestionTopic"}) for tag in tags: tag_list.append(tag.text) return title, question, follower, watched, answer_count, tag_listdef init_url(question_id, limit, offset): """ 构造每一页访问的url @param question_id: @param limit: @param offset: @return: """ base_url_start = "https://www.zhihu.com/api/v4/questions/" base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed" / "%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by" / "%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" / "%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info" / "%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting" / "%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B" / "%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics" / "&limit={0}&offset={1}".format(limit, offset) return base_url_start + question_id + base_url_enddef get_time_str(timestamp): """ 将时间戳转换为标准日期字符 @param timestamp: @return: """ datetime_str = "" try: # 时间戳timestamp 转datetime时间格式 datetime_time = datetime.fromtimestamp(timestamp) # datetime时间格式转为日期字符串 datetime_str = datetime_time.strftime("%Y-%m-%d %H:%M:%S") except Exception as e: print(e) print("日期转换错误") return datetime_strdef get_answer_info(url, index): """ 解析问题回答 @param url: @param index: @return: """ response = requests.get(url=url, headers={"User-Agent": get_ua()}, timeout=10) text = response.text.replace("/u200b", "") per_answer_list = [] try: question_json = json.loads(text) """获取当前页的回答数据""" print("爬取第{0}页回答列表,当前页获取到{1}个回答".format(index + 1, len(question_json["data"]))) for data in question_json["data"]: """问题的相关信息""" # 问题的问题类型、id、提问类型、创建时间、修改时间 question_type = data["question"]["type"] question_id = data["question"]["id"] question_question_type = data["question"]["question_type"] question_created = get_time_str(data["question"]["created"]) question_updated_time = get_time_str(data["question"]["updated_time"]) """答主的相关信息""" # 答主的用户名、签名、性别、粉丝数 author_name = data["author"]["name"] author_headline = data["author"]["headline"] author_gender = data["author"]["gender"] author_follower_count = data["author"]["follower_count"] """回答的相关信息""" # 问题回答id、创建时间、更新时间、赞同数、评论数、具体内容 id = data["id"] created_time = get_time_str(data["created_time"]) updated_time = get_time_str(data["updated_time"]) voteup_count = data["voteup_count"] comment_count = data["comment_count"] content = data["content"] per_answer_list.append([question_type, question_id, question_question_type, question_created, question_updated_time, author_name, author_headline, author_gender, author_follower_count, id, created_time, updated_time, voteup_count, comment_count, content ]) except: print("Json格式校验错误") finally: answer_column = ["问题类型", "问题id", "问题提问类型", "问题创建时间", "问题更新时间", "答主用户名", "答主签名", "答主性别", "答主粉丝数", "答案id", "答案创建时间", "答案更新时间", "答案赞同数", "答案评论数", "答案具体内容"] per_answer_data = pd.DataFrame(per_answer_list, columns=answer_column) return per_answer_dataif __name__ == "__main__": # question_id = "424516487" question_id = "429548386" url = "https://www.zhihu.com/question/" + question_id """获取问题的详细描述""" title, question, follower, watched, answer_count, tag_list = get_question_base_info(url) print("问题url:"+ url) print("问题标题:" + title) print("问题描述:" + question) print("该问题被定义的标签为:" + "、".join(tag_list)) print("该问题关注人数:{0},已经被 {1} 人浏览过".format(follower, watched)) print("截止 {},该问题有 {} 个回答".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), answer_count)) """获取问题的回答数据""" # 构造url limit, offset = 20, 0 page_cnt = int(answer_count/limit) + 1 answer_data = pd.DataFrame() for page_index in range(page_cnt): answer_url = init_url(question_id, limit, offset+page_index*limit) # 获取数据 data_per_page = get_answer_info(answer_url, page_index) answer_data = answer_data.append(data_per_page) sleep(3) print("/n爬取完成,数据已保存!!") answer_data.to_csv("凡尔赛沙雕语录_{0}.csv".format(question_id), encoding="utf-8", index=False)
一共爬取到 393 个答案,需要注意一下,最后保存的文件格式为 UTF-8,读取乱码的同学请先检查格式是否一致。
爬取的结果部分截图如下:
感谢看到这里,更多Python精彩内容可以关注我看我主页,你们的三连(点赞,收藏,评论)是我持续更新下去的动力,感谢。
点击领取? Q群号: 675240729(纯技术交流和资源共享)以自助拿走。
①行业咨询、专业解答
②Python开发环境安装教程
③400集自学视频
④软件开发常用词汇
⑤最新学习路线图
⑥3000多本Python电子书
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/123098.html
摘要:我的转行经历博主从开公众号起前个月开始接触语言,然后接触到了数据方面的技术,包括爬虫,数据分析,数据挖掘,机器学习等,一直到现在仍然在坚持自学,我相信只要坚持结果总不会太差。对于数据分析而言,机器学习和爬虫等并不是必须,但是加分项。 作者:xiaoyu 微信公众号:Python数据科学 知乎:python数据分析师 showImg(https://segmentfault.com/i...
摘要:我是一个知乎轻微重度用户,之前写了一只爬虫帮我爬取并分析它的数据,我感觉这个过程还是挺有意思,因为这是一个不断给自己创造问题又去解决问题的过程。所以这只爬虫还有登陆知乎搜索题目的功能。 我一直觉得,爬虫是许多web开发人员难以回避的点。我们也应该或多或少的去接触这方面,因为可以从爬虫中学习到web开发中应当掌握的一些基本知识。而且,它还很有趣。 我是一个知乎轻微重度用户,之前写了一只爬...
摘要:最近看了很多关于爬虫入门的文章,发现其中大部分都是以知乎为爬取对象,所以这次我也以知乎为目标来进行爬取的演示,用到的爬虫框架为编写的。项目地址这次写的内容为爬取知乎的用户,下面就是详细说一下写爬虫的过程了。 最近看了很多关于爬虫入门的文章,发现其中大部分都是以知乎为爬取对象,所以这次我也以知乎为目标来进行爬取的演示,用到的爬虫框架为 PHP 编写的 Beanbun。 项目地址:http...
阅读 2641·2021-11-11 16:55
阅读 680·2021-09-04 16:40
阅读 3076·2019-08-30 15:54
阅读 2614·2019-08-30 15:54
阅读 2401·2019-08-30 15:46
阅读 402·2019-08-30 15:43
阅读 3225·2019-08-30 11:11
阅读 2981·2019-08-28 18:17