摘要:结果可见这些线程是真的没有并发执行,而是顺序执行的,并没有达到多线程的目的。综上由我自己了解的知识和本实验而言,我的结论是用上多线程下载速度能够比过,但是解析网页这种事没有快,毕竟原生就是为了写网页,而且复杂的爬虫总不能都用字符串去找吧。
前言
早就听说Nodejs的异步策略是多么的好,I/O是多么的牛逼......反正就是各种好。今天我就准备给nodejs和python来做个比较。能体现异步策略和I/O优势的项目,我觉得莫过于爬虫了。那么就以一个爬虫项目来一较高下吧。
爬虫项目众筹网-众筹中项目 http://www.zhongchou.com/brow...,我们就以这个网站为例,我们爬取它所有目前正在众筹中的项目,获得每一个项目详情页的URL,存入txt文件中。
实战比较 python原始版# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time from BeautifulSoup import BeautifulSoup # HTML #请求头 headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":1, "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } # 获得项目url列表 def getItems(allpage): no = 0 items = open("pystandard.txt","a") for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) # print url #① r1 = requests.get(url,headers=headers) html = r1.text.encode("utf8") soup = BeautifulSoup(html); lists = soup.findAll(attrs={"class":"ssCardItem"}) for i in range(len(lists)): href = lists[i].a["href"] items.write(href+" ") no +=1 items.close() return no if __name__ == "__main__": start = time.clock() allpage = 30 no = getItems(allpage) end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,no))
实验5次的结果:
it takes 48.1727159614 Seconds to get 720 items it takes 45.3397999415 Seconds to get 720 items it takes 44.4811429862 Seconds to get 720 items it takes 44.4619293082 Seconds to get 720 items it takes 46.669706593 Seconds to get 720 itemspython多线程版
# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time,threading from BeautifulSoup import BeautifulSoup # HTML #请求头 headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":1, "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } items = open("pymulti.txt","a") no = 0 lock = threading.Lock() # 获得项目url列表 def getItems(urllist): # print urllist #① global items,no,lock for url in urllist: r1 = requests.get(url,headers=headers) html = r1.text.encode("utf8") soup = BeautifulSoup(html); lists = soup.findAll(attrs={"class":"ssCardItem"}) for i in range(len(lists)): href = lists[i].a["href"] lock.acquire() items.write(href+" ") no +=1 # print no lock.release() if __name__ == "__main__": start = time.clock() allpage = 30 allthread = 30 per = (int)(allpage/allthread) urllist = [] ths = [] for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) urllist.append(url) for i in range(allthread): # print urllist[i*(per):(i+1)*(per)] th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],)) th.start() th.join() items.close() end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,no))
实验5次的结果:
it takes 45.5222291114 Seconds to get 720 items it takes 46.7097831417 Seconds to get 720 items it takes 45.5334646156 Seconds to get 720 items it takes 48.0242797553 Seconds to get 720 items it takes 44.804855018 Seconds to get 720 items
这个多线程并没有优势,经过 #① 的注释与否发现,这个所谓的多线程也是按照单线程运行的。
python改进 单线程首先我们把解析html的步骤改进一下,分析发现
lists = soup.findAll("a",attrs={"class":"siteCardICH3"})
比
lists = soup.findAll(attrs={"class":"ssCardItem"})
更好,因为它是直接找 a ,而不是先找 div 再找 div 下的 a
改进后实验5次结果如下,可见有进步:
it takes 41.0018861912 Seconds to get 720 items it takes 42.0260390497 Seconds to get 720 items it takes 42.249635988 Seconds to get 720 items it takes 41.295524133 Seconds to get 720 items it takes 42.9022894154 Seconds to get 720 items多线程
修改 getItems(urllist) 为 getItems(urllist,thno)
函数起止加入 print thno," begin at",time.clock() 和 print thno," end at",time.clock()。结果:
0 begin at 0.00100631078628 0 end at 1.28625832936 1 begin at 1.28703230691 1 end at 2.61739476075 2 begin at 2.61801291642 2 end at 3.92514717937 3 begin at 3.9255829208 3 end at 5.38870235361 4 begin at 5.38921134066 4 end at 6.670658786 5 begin at 6.67125734731 5 end at 8.01520989534 6 begin at 8.01566383155 6 end at 9.42006780585 7 begin at 9.42053340537 7 end at 11.0386755513 8 begin at 11.0391565464 8 end at 12.421359168 9 begin at 12.4218294329 9 end at 13.9932716671 10 begin at 13.9939957256 10 end at 15.3535799145 11 begin at 15.3540870354 11 end at 16.6968289314 12 begin at 16.6972665389 12 end at 17.9798803157 13 begin at 17.9804714125 13 end at 19.326706238 14 begin at 19.3271438455 14 end at 20.8744308886 15 begin at 20.8751017624 15 end at 22.5306500245 16 begin at 22.5311450156 16 end at 23.7781693541 17 begin at 23.7787245279 17 end at 25.1775114499 18 begin at 25.178350742 18 end at 26.5497330734 19 begin at 26.5501776789 19 end at 27.970799259 20 begin at 27.9712727895 20 end at 29.4595075375 21 begin at 29.4599959972 21 end at 30.9507299602 22 begin at 30.9513989679 22 end at 32.2762763982 23 begin at 32.2767182045 23 end at 33.6476256057 24 begin at 33.648137392 24 end at 35.1100517711 25 begin at 35.1104907783 25 end at 36.462657099 26 begin at 36.4632234696 26 end at 37.7908515759 27 begin at 37.7912845182 27 end at 39.4359928956 28 begin at 39.436448698 28 end at 40.9955021593 29 begin at 40.9960871912 29 end at 42.6425665264 it takes 42.6435882327 Seconds to get 720 items
可见这些线程是真的没有并发执行,而是顺序执行的,并没有达到多线程的目的。问题在哪里呢?原来
我的循环中
th.start() th.join()
两行代码是紧接着的,所以新的线程会等待上一个线程执行完毕才会start,修改为
for i in range(allthread): # print urllist[i*(per):(i+1)*(per)] th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i)) ths.append(th) for th in ths: th.start() for th in ths: th.join()
结果:
0 begin at 0.0010814225325 1 begin at 0.00135201143191 2 begin at 0.00191744892518 3 begin at 0.0021311208492 4 begin at 0.00247495536449 5 begin at 0.0027334144167 6 begin at 0.00320601192551 7 begin at 0.00379011072218 8 begin at 0.00425431064445 9 begin at 0.00511692939449 10 begin at 0.0132038052264 11 begin at 0.0165926979253 12 begin at 0.0170886220634 13 begin at 0.0174665134574 14 begin at 0.018348726576 15 begin at 0.0189780790334 16 begin at 0.0201896641572 17 begin at 0.0220576606283 18 begin at 0.0231484138125 19 begin at 0.0238804034387 20 begin at 0.0273901280772 21 begin at 0.0300363009005 22 begin at 0.0362878375422 23 begin at 0.0395512329756 24 begin at 0.0431556637289 25 begin at 0.0459581249682 26 begin at 0.0482254733323 27 begin at 0.0535430117384 28 begin at 0.0584971212607 29 begin at 0.0598136762161 16 end at 65.2657542222 24 end at 66.2951247811 21 end at 66.3849747583 4 end at 66.6230160119 5 end at 67.5501632164 29 end at 67.7516992283 23 end at 68.6985322418 7 end at 69.1060433231 22 end at 69.2743398214 2 end at 69.5523713152 14 end at 69.6454986837 15 end at 69.8333400981 12 end at 69.9508018062 10 end at 70.2860348602 26 end at 70.3670659719 13 end at 70.3847232972 27 end at 70.3941635841 11 end at 70.5132838156 1 end at 70.7272351926 0 end at 70.9115253609 6 end at 71.0876563409 8 end at 71.112480539825 end at 71.1145248855 3 end at 71.4606034226 19 end at 71.6103622486 18 end at 71.6674453096 20 end at 71.725601862 17 end at 71.7778992318 9 end at 71.7847479301 28 end at 71.7921004837 it takes 71.7931912368 Seconds to get 720 items反思
上面的的多线是并发了,可是比单线程运行时间长了太多......我还没找出来原因,猜想是不是beautifulsoup不支持多线程?请各位多多指教。为了验证这个想法,我准备不用beautifulsoup,直接使用字符串查找。首先还是从单线程的修改:
# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time from BeautifulSoup import BeautifulSoup # HTML #请求头 headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } # 获得项目url列表 def getItems(allpage): no = 0 data = set() for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) # print url #① r1 = requests.get(url,headers=headers) html = r1.text.encode("utf8") start = 5000 while True: index = html.find("deal-show", start) if index == -1: break # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+" " # time.sleep(100) data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+" ") start = index + 1000 items = open("pystandard.txt","a") items.write("".join(data)) items.close() return len(data) if __name__ == "__main__": start = time.clock() allpage = 30 no = getItems(allpage) end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,no))
实验3次,结果:
it takes 11.6800132309 Seconds to get 720 items it takes 11.3621804427 Seconds to get 720 items it takes 11.6811991567 Seconds to get 720 items
然后对多线程进行修改:
# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time,threading from BeautifulSoup import BeautifulSoup # HTML #请求头 header = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } data = set() no = 0 lock = threading.Lock() # 获得项目url列表 def getItems(urllist,thno): # print urllist # print thno," begin at",time.clock() global no,lock,data for url in urllist: r1 = requests.get(url,headers=header) html = r1.text.encode("utf8") start = 5000 while True: index = html.find("deal-show", start) if index == -1: break lock.acquire() data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+" ") start = index + 1000 lock.release() # print thno," end at",time.clock() if __name__ == "__main__": start = time.clock() allpage = 30 #页数 allthread = 10 #线程数 per = (int)(allpage/allthread) urllist = [] ths = [] for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) urllist.append(url) for i in range(allthread): # print urllist[i*(per):(i+1)*(per)] low = i*allpage/allthread#注意写法 high = (i+1)*allpage/allthread # print low," ",high th = threading.Thread(target = getItems,args= (urllist[low:high],i)) ths.append(th) for th in ths: th.start() for th in ths: th.join() items = open("pymulti.txt","a") items.write("".join(data)) items.close() end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,len(data)))
实验3次,结果:
it takes 1.4781525123 Seconds to get 720 items it takes 1.44905954029 Seconds to get 720 items it takes 1.49297891786 Seconds to get 720 items
可见多线程确实比单线程快好多倍。对于简单的爬取任务而言,用字符串的内置方法比用beautifulsoup解析html快很多。
NodeJs// npm install request -g #貌似不行,要进入代码所在目录:npm install --save request // npm install cheerio -g #npm install --save cheerio var request = require("request"); var cheerio = require("cheerio"); var fs = require("fs"); var t1 = new Date().getTime(); var allpage = 30; var urllist = new Array() var urldata = ""; var mark = 0; var no = 0; for (var i=0; i= 0; i--) { // console.log(href[i].attribs["href"]); urldata += (href[i].attribs["href"]+" "); no += 1; } mark += 1; if (mark==allpage) { // console.log(urldata); fs.writeFile("./nodestandard.txt",urldata,function(err){ if(err) throw err; }); var t2 = new Date().getTime(); console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items"); } }
实验5次的结果:
it takes 3.949 Seconds to get 720 items it takes 3.642 Seconds to get 720 items it takes 3.641 Seconds to get 720 items it takes 3.938 Seconds to get 720 items it takes 3.783 Seconds to get 720 items
可见同样是用解析html的方法,nodejs速度完虐python。字符串查找呢?
var request = require("request"); var cheerio = require("cheerio"); var fs = require("fs"); var t1 = new Date().getTime(); var allpage = 30; var urllist = new Array() ; var urldata = new Array(); var mark = 0; var no = 0; for (var i=0; i实验5次的结果:
it takes 3.695 Seconds to get 720 items it takes 3.781 Seconds to get 720 items it takes 3.94 Seconds to get 720 items it takes 3.705 Seconds to get 720 items it takes 3.601 Seconds to get 720 items可见和解析起来的时间是差不多的。
综上由我自己了解的知识和本实验而言,我的结论是:python用上多线程下载速度能够比过nodejs,但是解析网页这种事python没有nodejs快,毕竟js原生就是为了写网页,而且复杂的爬虫总不能都用字符串去找吧。
2016.9.13-补充评论中提到的time.time(),感谢老司机指出我的错误,我在python多线程,字符串查找版本中使用了,实验3次过后依然是快于nodejs版本的平均用时2.3S,不知道是不是您和我的网络环境不一样导致?我准备换个教室试试......至于有没有误导人,我想读者会自己去尝试,得出自己的结论。
Python的确有异步(twisted),nodejs也的确有多进程(child_process),我想追求极致的性能比较还需要对这两种语言有更深入的研究,这个我目前也是半知不解,我会尽快花时间了解,争取实现比较(这里不是追求编程方法的比较,就是单纯的想比较在同一台机器同一个网络下,两种语言能做到的极致。道阻且长啊。)
还有解析方法,我这里用的是python自带的解析,官网说lxml的确比自带的快,但是我这里换了过后多线程依然没有体现出来优势,所以我还是很疑惑,是不是beautifulsoup不支持多线程?,我在官网没找到相关文档,请各位指教。另外from BeautifulSoup import BeautifulSoup的确是比from bs4 import BeautifulSoup 慢多了,这是BeautifulSoup的版本原因,感谢评论者指出。
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/38147.html
摘要:也就是说,我的篇文章的请求对应个实例,这些实例都请求完毕后,执行以下逻辑他的目的在于对每一个返回值这个返回值为单篇文章的内容,进行方法处理。 英国人Robert Pitt曾在Github上公布了他的爬虫脚本,导致任何人都可以容易地取得Google Plus的大量公开用户的ID信息。至今大概有2亿2千5百万用户ID遭曝光。 亮点在于,这是个nodejs脚本,非常短,包括注释只有71行。 ...
摘要:,大家好,很荣幸有这个机会可以通过写博文的方式,把这些年在后端开发过程中总结沉淀下来的经验和设计思路分享出来模块化设计根据业务场景,将业务抽离成独立模块,对外通过接口提供服务,减少系统复杂度和耦合度,实现可复用,易维护,易拓展项目中实践例子 Hi,大家好,很荣幸有这个机会可以通过写博文的方式,把这些年在后端开发过程中总结沉淀下来的经验和设计思路分享出来 模块化设计 根据业务场景,将业务...
摘要:所以与多线程相比,线程的数量越多,协程性能的优势越明显。值得一提的是,在此过程中,只有一个线程在执行,因此这与多线程的概念是不一样的。 真正有知识的人的成长过程,就像麦穗的成长过程:麦穗空的时候,麦子长得很快,麦穗骄傲地高高昂起,但是,麦穗成熟饱满时,它们开始谦虚,垂下麦芒。 ——蒙田《蒙田随笔全集》 上篇论述了关于python多线程是否是鸡肋的问题,得到了一些网友的认可,当然也有...
摘要:用将倒放这次让我们一个用做一个小工具将动态图片倒序播放发现引力波的机构使用的包美国科学家日宣布,他们去年月首次探测到引力波。宣布这一发现的,是激光干涉引力波天文台的负责人。这个机构诞生于上世纪年代,进行引力波观测已经有近年。 那些年我们写过的爬虫 从写 nodejs 的第一个爬虫开始陆陆续续写了好几个爬虫,从爬拉勾网上的职位信息到爬豆瓣上的租房帖子,再到去爬知乎上的妹子照片什么的,爬虫...
阅读 3648·2021-10-09 09:58
阅读 1187·2021-09-22 15:20
阅读 2495·2019-08-30 15:54
阅读 3509·2019-08-30 14:08
阅读 886·2019-08-30 13:06
阅读 1817·2019-08-26 12:16
阅读 2678·2019-08-26 12:11
阅读 2507·2019-08-26 10:38