摘要:当前版本是一个服务器端的的。也可以说是无界面浏览器。安装不是程序,去官网下载对应系统版本的安装即可。方法会一直等到页面被完全加载,然后才会继续程序,但是对于是无可奈何的。安装设置的查看所有可用的属性。
selenium:https://github.com/SeleniumHQ...
当前版本3.0.1
A browser automation framework and ecosystem
phantomjs:http://phantomjs.org/
是一个服务器端的 JavaScript API 的 WebKit。也可以说是无界面浏览器。其支持各种Web标准: DOM 处理, CSS 选择器, JSON, Canvas, 和 SVG.
大部分的网页抓取用urllib都可以搞定,但是涉及到JavaScript及Ajax渲染的时候,urlopen就完全傻逼了,所以不得不用模拟浏览器,方法也有很多,此处采用的是selenium2+phantomjs
selenium2支持所有主流的浏览器和phantomjs这些无界面的浏览器。
安装:
pip install selenium
phantomjs不是python程序,去官网下载对应系统版本的安装即可。
from selenium import webdriver import time driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close() from selenium import webdriver driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe") driver.set_window_size(1120, 550) driver.get("http://duckduckgo.com/") driver.find_element_by_id("search_form_input_homepage").send_keys("Nirvana") driver.find_element_by_id("search_button_homepage").click() print(driver.current_url) driver.close()
get方法会一直等到页面被完全加载,然后才会继续程序,但是对于ajax是无可奈何的。
send_keys就是填充input表单
#等待页面渲染完成 from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap) driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close()处理Javascript重定向
#处理Javascript重定向 from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") #抛出StaleElementReferenceException异常说明elem元素已经消失了,也就说明页面已经跳转了。 except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source)设置PHANTOMJS的USER-AGENT
有些网站的WebServer对User-Agent有限制,可能会拒绝不熟悉的User-Agent的访问。
设置PhantomJS的user-agent,是要设置“phantomjs.page.settings.userAgent”这个desired_capability.
from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap) driver.get("http://dianping.com/") cap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities属性。 for key in cap_dict: print "%s: %s" % (key, cap_dict[key]) print driver.current_url driver.quit()Demo
github
#pip install selenium #安装phantomjs from selenium import webdriver import time from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap) driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close() #设置PHANTOMJS的USER-AGENT from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap) driver.get("http://dianping.com/") cap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities属性。 for key in cap_dict: print("%s: %s" % (key, cap_dict[key])) print(driver.current_url) driver.quit() #等待页面渲染完成 from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close() #处理Javascript重定向 from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source) ################################################################################## #模拟拖拽 from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains driver = webdriver.PhantomJS(executable_path="phantomjs/bin/phantomjs") driver.get("http://pythonscraping.com/pages/javascript/draggableDemo.html") print(driver.find_element_by_id("message").text) element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform() print(driver.find_element_by_id("message").text) ################################################################################## #截屏 driver.get_screenshot_as_file("tmp/pythonscraping.png") #### ################################################################################## #登陆知乎,然后能自动点击页面下方的“更多”,以载入更多的内容 from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver import ActionChains import time import sys driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe") driver.get("http://www.zhihu.com/#signin") #driver.find_element_by_name("email").send_keys("your email") driver.find_element_by_xpath("//input[@name="password"]").send_keys("your password") #driver.find_element_by_xpath("//input[@name="password"]").send_keys(Keys.RETURN) time.sleep(2) driver.get_screenshot_as_file("show.png") #driver.find_element_by_xpath("//button[@class="sign-button"]").click() driver.find_element_by_xpath("//form[@class="zu-side-login-box"]").submit() try: #等待页面加载完毕 dr=WebDriverWait(driver,5) dr.until(lambda the_driver:the_driver.find_element_by_xpath("//a[@class="zu-top-nav-userinfo "]").is_displayed()) except: print("登录失败") sys.exit(0) driver.get_screenshot_as_file("show.png") #user=driver.find_element_by_class_name("zu-top-nav-userinfo ") #webdriver.ActionChains(driver).move_to_element(user).perform() #移动鼠标到我的用户名 loadmore=driver.find_element_by_xpath("//a[@id="zh-load-more"]") actions = ActionChains(driver) actions.move_to_element(loadmore) actions.click(loadmore) actions.perform() time.sleep(2) driver.get_screenshot_as_file("show.png") print(driver.current_url) print(driver.page_source) driver.quit() ##################################################################################
参考:
http://www.cnblogs.com/chenqi...
http://www.realpython.com/blo...
http://selenium-python.readth...
http://www.cnblogs.com/paisen...
http://smilejay.com/2013/12/s...
更多参考:
selenium webdriver的各种driver
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/44221.html
摘要:包括爬虫编写爬虫避禁动态网页数据抓取部署分布式爬虫系统监测共六个内容,结合实际定向抓取腾讯新闻数据,通过测试检验系统性能。 1 项目介绍 本项目的主要内容是分布式网络新闻抓取系统设计与实现。主要有以下几个部分来介绍: (1)深入分析网络新闻爬虫的特点,设计了分布式网络新闻抓取系统爬取策略、抓取字段、动态网页抓取方法、分布式结构、系统监测和数据存储六个关键功能。 (2)结合程序代码分解说...
摘要:,集搜客开源代码下载源开源网络爬虫源,文档修改历史,增补文字说明,增加第五章源代码下载源,并更换源的网址 showImg(https://segmentfault.com/img/bVvMn3); 1,引言 在Python网络爬虫内容提取器一文我们详细讲解了核心部件:可插拔的内容提取器类gsExtractor。本文记录了确定gsExtractor的技术路线过程中所做的编程实验。这是第二...
摘要:,源代码爬取京东商品列表,以手机商品列表为例示例网址版本京东手机列表源代码下载位置请看文章末尾的源。,抓取结果运行上面的代码,就会爬取京东手机品类页面的所有手机型号价格等信息,并保存到本地文件京东手机列表中。 showImg(https://segmentfault.com/img/bVxXHW); 1,引言 在上一篇《python爬虫实战:爬取Drupal论坛帖子列表》,爬取了一个用...
摘要:,用库实现网页内容提取是的一个库,可以迅速灵活地处理。,集搜客开源代码下载源开源网络爬虫源,文档修改历史,增补文字说明把跟帖的代码补充了进来,增加最后一章源代码下载源 showImg(https://segmentfault.com/img/bVvBTt); 1,引言 在Python网络爬虫内容提取器一文我们详细讲解了核心部件:可插拔的内容提取器类gsExtractor。本文记录了确定...
摘要:遇到的问题近来在写个人博客的时候遇到了大家可能都会遇到的问题单页面在时显得很无力,尤其是百度不会抓取动态脚本配合前后端分离无法让标签在蜘蛛抓取时动态填充单页面又是大势所趋,写起来也不止是一个爽,当然也可以选择多页面但即使是多页面在面对文章 遇到的问题: 近来在写个人博客的时候遇到了大家可能都会遇到的问题 Vue单页面在SEO时显得很无力,尤其是百度不会抓取动态脚本 Vue-Router...
阅读 2692·2021-10-22 09:55
阅读 2025·2021-09-27 13:35
阅读 1278·2021-08-24 10:02
阅读 1507·2019-08-30 15:55
阅读 1208·2019-08-30 14:13
阅读 3483·2019-08-30 13:57
阅读 1983·2019-08-30 11:07
阅读 2460·2019-08-29 17:12