Python_selenium_phantomjs动态抓取

zacklee 发布于2019-07-31 10:51 / 3616人阅读

摘要：当前版本是一个服务器端的的。也可以说是无界面浏览器。安装不是程序，去官网下载对应系统版本的安装即可。方法会一直等到页面被完全加载，然后才会继续程序，但是对于是无可奈何的。安装设置的查看所有可用的属性。

selenium:https://github.com/SeleniumHQ...
当前版本3.0.1
A browser automation framework and ecosystem

phantomjs:http://phantomjs.org/
是一个服务器端的 JavaScript API 的 WebKit。也可以说是无界面浏览器。其支持各种Web标准： DOM 处理, CSS 选择器, JSON, Canvas, 和 SVG.

大部分的网页抓取用urllib都可以搞定，但是涉及到JavaScript及Ajax渲染的时候，urlopen就完全傻逼了，所以不得不用模拟浏览器，方法也有很多，此处采用的是selenium2+phantomjs
selenium2支持所有主流的浏览器和phantomjs这些无界面的浏览器。
安装：

pip install selenium

phantomjs不是python程序，去官网下载对应系统版本的安装即可。

from selenium import webdriver
import time
 
driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()
from selenium import webdriver
 
driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe")
driver.set_window_size(1120, 550)
driver.get("http://duckduckgo.com/")
driver.find_element_by_id("search_form_input_homepage").send_keys("Nirvana")
driver.find_element_by_id("search_button_homepage").click()
print(driver.current_url)
driver.close()

get方法会一直等到页面被完全加载，然后才会继续程序，但是对于ajax是无可奈何的。
send_keys就是填充input表单

等待页面渲染完成

#等待页面渲染完成
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap)
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()

处理Javascript重定向

#处理Javascript重定向
from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException
 
def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print("Timing out after 10 seconds and returning")
            return
        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name("html")
        #抛出StaleElementReferenceException异常说明elem元素已经消失了，也就说明页面已经跳转了。
        except StaleElementReferenceException:  
            return
 
driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)

设置PHANTOMJS的USER-AGENT

有些网站的WebServer对User-Agent有限制，可能会拒绝不熟悉的User-Agent的访问。
设置PhantomJS的user-agent，是要设置“phantomjs.page.settings.userAgent”这个desired_capability.

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
 
 
driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap)
driver.get("http://dianping.com/")
cap_dict = driver.desired_capabilities  #查看所有可用的desired_capabilities属性。
for key in cap_dict:
    print "%s: %s" % (key, cap_dict[key])
print driver.current_url
driver.quit()

Demo

github

#pip install selenium
#安装phantomjs

from selenium import webdriver
import time
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap)
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

#设置PHANTOMJS的USER-AGENT
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

 
driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap)
driver.get("http://dianping.com/")

cap_dict = driver.desired_capabilities  #查看所有可用的desired_capabilities属性。
for key in cap_dict:
    print("%s: %s" % (key, cap_dict[key]))
print(driver.current_url)
driver.quit()

#等待页面渲染完成
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()

#处理Javascript重定向
from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException

def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print("Timing out after 10 seconds and returning")
            return
        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name("html")
        except StaleElementReferenceException:
            return

driver = webdriver.PhantomJS(executable_path=r"C:Users	aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)
##################################################################################
#模拟拖拽
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver import ActionChains

driver = webdriver.PhantomJS(executable_path="phantomjs/bin/phantomjs")
driver.get("http://pythonscraping.com/pages/javascript/draggableDemo.html")

print(driver.find_element_by_id("message").text)

element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()

print(driver.find_element_by_id("message").text)
##################################################################################
#截屏
driver.get_screenshot_as_file("tmp/pythonscraping.png")

####
##################################################################################
#登陆知乎，然后能自动点击页面下方的“更多”，以载入更多的内容
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
import time
import sys

driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe")
driver.get("http://www.zhihu.com/#signin")
#driver.find_element_by_name("email").send_keys("your email")
driver.find_element_by_xpath("//input[@name="password"]").send_keys("your password")
#driver.find_element_by_xpath("//input[@name="password"]").send_keys(Keys.RETURN)
time.sleep(2)
driver.get_screenshot_as_file("show.png")
#driver.find_element_by_xpath("//button[@class="sign-button"]").click()
driver.find_element_by_xpath("//form[@class="zu-side-login-box"]").submit()

try:
    #等待页面加载完毕
    dr=WebDriverWait(driver,5)
    dr.until(lambda the_driver:the_driver.find_element_by_xpath("//a[@class="zu-top-nav-userinfo "]").is_displayed())
except:
    print("登录失败")
    sys.exit(0)
driver.get_screenshot_as_file("show.png")
#user=driver.find_element_by_class_name("zu-top-nav-userinfo ")
#webdriver.ActionChains(driver).move_to_element(user).perform() #移动鼠标到我的用户名
loadmore=driver.find_element_by_xpath("//a[@id="zh-load-more"]")
actions = ActionChains(driver)
actions.move_to_element(loadmore)
actions.click(loadmore)
actions.perform()
time.sleep(2)
driver.get_screenshot_as_file("show.png")
print(driver.current_url)
print(driver.page_source)
driver.quit()
##################################################################################

参考：
http://www.cnblogs.com/chenqi...
http://www.realpython.com/blo...
http://selenium-python.readth...
http://www.cnblogs.com/paisen...
http://smilejay.com/2013/12/s...
更多参考：
selenium webdriver的各种driver

GPU云服务器云服务器 jsoup抓取动态数据 java抓取动态js 重复抓取抓取分析

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/44221.html

从0-1打造最强性能Scrapy爬虫集群

摘要：包括爬虫编写爬虫避禁动态网页数据抓取部署分布式爬虫系统监测共六个内容，结合实际定向抓取腾讯新闻数据，通过测试检验系统性能。 1 项目介绍本项目的主要内容是分布式网络新闻抓取系统设计与实现。主要有以下几个部分来介绍：（1）深入分析网络新闻爬虫的特点，设计了分布式网络新闻抓取系统爬取策略、抓取字段、动态网页抓取方法、分布式结构、系统监测和数据存储六个关键功能。（2）结合程序代码分解说...

vincent_xyb 2019-07-30 14:46 评论0 收藏0
Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容

摘要：，集搜客开源代码下载源开源网络爬虫源，文档修改历史，增补文字说明，增加第五章源代码下载源，并更换源的网址 showImg(https://segmentfault.com/img/bVvMn3); 1，引言在Python网络爬虫内容提取器一文我们详细讲解了核心部件：可插拔的内容提取器类gsExtractor。本文记录了确定gsExtractor的技术路线过程中所做的编程实验。这是第二...

ymyang 2019-07-25 10:26 评论0 收藏0
Python爬虫实战（2）：爬取京东商品列表

摘要：，源代码爬取京东商品列表，以手机商品列表为例示例网址版本京东手机列表源代码下载位置请看文章末尾的源。，抓取结果运行上面的代码，就会爬取京东手机品类页面的所有手机型号价格等信息，并保存到本地文件京东手机列表中。 showImg(https://segmentfault.com/img/bVxXHW); 1，引言在上一篇《python爬虫实战：爬取Drupal论坛帖子列表》，爬取了一个用...

shevy 2019-07-31 12:21 评论0 收藏0
Python使用xslt提取网页数据

摘要：，用库实现网页内容提取是的一个库，可以迅速灵活地处理。，集搜客开源代码下载源开源网络爬虫源，文档修改历史，增补文字说明把跟帖的代码补充了进来，增加最后一章源代码下载源 showImg(https://segmentfault.com/img/bVvBTt); 1，引言在Python网络爬虫内容提取器一文我们详细讲解了核心部件：可插拔的内容提取器类gsExtractor。本文记录了确定...

mdluo 2019-07-25 10:22 评论0 收藏0
在不使用ssr的情况下解决Vue单页面SEO问题

摘要：遇到的问题近来在写个人博客的时候遇到了大家可能都会遇到的问题单页面在时显得很无力，尤其是百度不会抓取动态脚本配合前后端分离无法让标签在蜘蛛抓取时动态填充单页面又是大势所趋，写起来也不止是一个爽，当然也可以选择多页面但即使是多页面在面对文章遇到的问题: 近来在写个人博客的时候遇到了大家可能都会遇到的问题 Vue单页面在SEO时显得很无力，尤其是百度不会抓取动态脚本 Vue-Router...

Aceyclee 2019-07-01 10:51 评论0 收藏0