资讯专栏INFORMATION COLUMN

Webmagic+Selenium+PhantomJS实战

zhangxiangliang / 2184人阅读

摘要:还是直接贴代码说明比较实在。重新调整窗口大小,以适应页面,需要耗费一定时间。建议等待合理的时间。负责抠图指定坐标不保持比例,调用进程,返回识别结果。

还是直接贴代码说明比较实在。
感觉webmagic-selenium这个模块有点鸡肋,但还是有可借鉴之处。借鉴它写了一个SeleniumDownloader,如下:

import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.Downloader;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.UrlUtils;

import java.util.Map;

/**
 * @author taojw
 *
 */
public class SeleniumDownloader  implements Downloader{
    private static final Logger log=LoggerFactory.getLogger(SeleniumDownloader.class);
    private int sleepTime=3000;//3s
    private SeleniumAction action=null;
    private WebDriverPool webDriverPool=new WebDriverPool();
    public SeleniumDownloader(){
    }
    public SeleniumDownloader(int sleepTime,WebDriverPool pool){
        this(sleepTime,pool,null);
    }
    public SeleniumDownloader(int sleepTime,WebDriverPool pool,SeleniumAction action){
        this.sleepTime=sleepTime;
        this.action=action;
        if(pool!=null){
            webDriverPool=pool;
        }
    }
    public SeleniumDownloader setSleepTime(int sleepTime) {
        this.sleepTime = sleepTime;
        return this;
    }
    public void setOperator(SeleniumAction action){
        this.action=action;
    }
    @Override
    public Page download(Request request, Task task) {
        WebDriver webDriver;
        try {
            webDriver = webDriverPool.get();
        } catch (InterruptedException e) {
            log.warn("interrupted", e);
            return null;
        }
        log.info("downloading page " + request.getUrl());
        Page page = new Page();
        try {
            webDriver.get(request.getUrl());
            Thread.sleep(sleepTime);
        } catch (InterruptedException e) {
            e.printStackTrace();
        } catch (Exception e) {
            webDriverPool.close(webDriver);
            page.setSkip(true);
            return page;
        }
//        WindowUtil.changeWindow(webDriver);
        WebDriver.Options manage = webDriver.manage();
        Site site = task.getSite();
        if (site.getCookies() != null) {
            for (Map.Entry cookieEntry : site.getCookies()
                    .entrySet()) {
                Cookie cookie = new Cookie(cookieEntry.getKey(),
                        cookieEntry.getValue());
                manage.addCookie(cookie);
            }
        }
        manage.window().maximize();
        if(action!=null){
            action.execute(webDriver);
        }
        SeleniumAction reqAction=(SeleniumAction) request.getExtra("action");
        if(reqAction!=null){
            reqAction.execute(webDriver);
        }

        WebElement webElement = webDriver.findElement(By.xpath("/html"));
        String content = webElement.getAttribute("outerHTML");
        
        page.setRawText(content);
        page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content,
                webDriver.getCurrentUrl())));
        page.setUrl(new PlainText(webDriver.getCurrentUrl()));
        page.setRequest(request);
        webDriverPool.returnToPool(webDriver);
        return page;
    }

    @Override
    public void setThread(int thread) {
        
    }

}

功能:
支持在Spider.setDownloader的时候添加钩子SeleniumAction来实现自定义selenium的通用操作。加强了灵活性
支持对每个请求添加action参数,参数值为SeleniumAction对象,进而可以对每个请求实现自定义selenium操作.加强了灵活性

import org.openqa.selenium.WebDriver;

/**
 * @author taojw
 *
 */
public interface SeleniumAction {
    void execute(WebDriver driver);
}

WebDriverPool实现:注意对WebDriver的池化来保证性能
也是参考webmagic-selenium作了些修改。

import com.fh.util.FileUtil;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.concurrent.BlockingDeque;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * @author taojw
 */
public class WebDriverPool {
    private Logger logger = LoggerFactory.getLogger(getClass());

    private int CAPACITY = 5;
    private AtomicInteger refCount = new AtomicInteger(0);
    private static final String DRIVER_PHANTOMJS = "phantomjs";

    /**
     * store webDrivers available
     */
    private BlockingDeque innerQueue = new LinkedBlockingDeque(
            CAPACITY);

    private static String PHANTOMJS_PATH;
    private static DesiredCapabilities caps = DesiredCapabilities.phantomjs();
    static {
        PHANTOMJS_PATH = FileUtil.getCommonProp("phantomjs.path");
        caps.setJavascriptEnabled(true);
        caps.setCapability(
                PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
                PHANTOMJS_PATH);
        caps.setCapability("takesScreenshot", true);
        caps.setCapability(
                PhantomJSDriverService.PHANTOMJS_PAGE_CUSTOMHEADERS_PREFIX
                        + "User-Agent",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36");
        caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
                "--load-images=no");

    }

    public WebDriverPool() {
    }

    public WebDriverPool(int poolsize) {
        this.CAPACITY = poolsize;
        innerQueue = new LinkedBlockingDeque(poolsize);
    }

    public WebDriver get() throws InterruptedException {
        WebDriver poll = innerQueue.poll();
        if (poll != null) {
            return poll;
        }
        if (refCount.get() < CAPACITY) {
            synchronized (innerQueue) {
                if (refCount.get() < CAPACITY) {

                    WebDriver mDriver = new PhantomJSDriver(caps);
                    // 尝试性解决:https://github.com/ariya/phantomjs/issues/11526问题
                    mDriver.manage().timeouts()
                            .pageLoadTimeout(60, TimeUnit.SECONDS);
                    // mDriver.manage().window().setSize(new Dimension(1366,
                    // 768));
                    innerQueue.add(mDriver);
                    refCount.incrementAndGet();
                }
            }
        }
        return innerQueue.take();
    }

    public void returnToPool(WebDriver webDriver) {
        // webDriver.quit();
        // webDriver=null;
        innerQueue.add(webDriver);
    }

    public void close(WebDriver webDriver) {
        refCount.decrementAndGet();
        webDriver.close();
        webDriver.quit();
        webDriver = null;
    }

    public void shutdown() {
        try {
            for (WebDriver driver : innerQueue) {
                close(driver);
            }
            innerQueue.clear();
        } catch (Exception e) {
//            e.printStackTrace();
            logger.warn("webdriverpool关闭失败",e);
        }
    }
}

修改后:
仅支持PhantomJS作为浏览器驱动。
增加phantomjs相关配置
修改队列大小控制逻辑

WindowUtil
注意这个loadAll方法的实现很巧妙哦,由于涉及滚动加载页面的时候,如果一下子滚到底部可能会造成中间部分没有加载出来,这样就不得不针对每个页面进行满满滚动。而loadAll采取的思路是直接获取页面可滚动大小,然后将浏览器窗口调成对应大小,刷新之后所有内容便加载出来了。

import org.apache.commons.io.FileUtils;
import org.openqa.selenium.*;

import java.io.File;
import java.io.IOException;

/**
 * @author taojw
 *
 */
public class WindowUtil {
    
    /**
     * 滚动窗口。
     * @param driver
     * @param height
     */
    public static void scroll(WebDriver driver,int height){
        ((JavascriptExecutor)driver).executeScript("window.scrollTo(0,"+height+" );");    
    }
    /**
     * 重新调整窗口大小,以适应页面,需要耗费一定时间。建议等待合理的时间。
     * @param driver
     */
    public static void loadAll(WebDriver driver){
        Dimension od=driver.manage().window().getSize();
        int width=driver.manage().window().getSize().width;
        //尝试性解决:https://github.com/ariya/phantomjs/issues/11526问题
        driver.manage().timeouts().pageLoadTimeout(60, TimeUnit.SECONDS); 
        long height=(Long)((JavascriptExecutor)driver).executeScript("return document.body.scrollHeight;");
        driver.manage().window().setSize(new Dimension(width, (int)height));
        driver.navigate().refresh();
    }
    public static void taskScreenShot(WebDriver driver,File saveFile){
        File src=((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
        try {
            FileUtils.copyFile(src, saveFile);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    public static void changeWindow(WebDriver driver){
        // 获取当前页面句柄
        String handle = driver.getWindowHandle();
        // 获取所有页面的句柄,并循环判断不是当前的句柄,就做选取switchTo()
        for (String handles : driver.getWindowHandles()) {
            if (handles.equals(handle))
                continue;
            driver.switchTo().window(handles);
        }
    }
}

至此对爬虫框架的扩展高一段落。

实战部分 抓取淘宝店铺信息
/**
 * 店铺销售信息
 *
 * @author taojw
 */
@Scope("prototype")
@Component
public class TaoBaoShopInfoProcessor implements PageProcessor {
    private static final Logger log = LoggerFactory
            .getLogger(TaoBaoShopInfoProcessor.class);

    @Autowired
    private TaoBaoShopInfoService service;

    private Site site = Site
            .me()
            .setCharset("UTF-8")
            .setCycleRetryTimes(3)
            .setSleepTime(3 * 1000)
            .addHeader("Connection", "keep-alive")
            .addHeader("Cache-Control", "max-age=0")
            .addHeader("User-Agent",
                    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0");

    private AtomicBoolean isPageAdd = new AtomicBoolean(false);
    private static AtomicBoolean running = new AtomicBoolean(false);
    private WebDriverPool pool=new WebDriverPool();
    @Override
    public Site getSite() {
        return this.site;
    }

    @Override
    public void process(Page page) {
        if (islistPage(page)) {
            List urls = page.getHtml()
                    .$("dl.item a.J_TGoldData", "href").all();
            List targetUrls = new ArrayList();
            for (String url : urls) {
                targetUrls.add(url.trim());
            }
            page.addTargetRequests(targetUrls);
            if (isPageAdd.compareAndSet(false, true)) {
                // 分页处理
                String pageinfo = page.getHtml()
                        .$(".pagination .page-info", "text").get();
                int pageCount = Integer.valueOf(pageinfo.split("/")[1]);
                String cururl = page.getUrl().get();
                //只抓前5页
                if(pageCount>5){
                    pageCount=5;
                }
                for (int i = 1; i < pageCount; i++) {
                    String tmp = cururl + "&pageNo=" + (i + 1);
                    page.addTargetRequest(tmp);
                }
            }
            return;
        }

        // 商品页面
        String curUrl = page.getUrl().get();
        boolean isTaoBao=curUrl.startsWith("https://item.taobao.com");
        boolean isTmall=curUrl.startsWith("https://detail.tmall.com");
        
        String tmpspm = curUrl.split("?")[1].split("&")[0];
        // spm码
        String spm = tmpspm.split("=")[1];
        // 网店地址
        String shopUrl="";
     // 商品名称
        String name="";
     // 价格
        double price =0; 
     // 30天交易总数
        int sellCount=0;
     // 交易总价
        double allPrice=0;
        if(isTaoBao){
            shopUrl= page.getHtml()
                    .xpath("//div[@class="tb-shop-name"]/dl/dd/strong/a/@href")
                    .get();
            shopUrl = shopUrl.split("?")[0];
            
            name = page.getHtml().xpath("//*[@id="J_Title"]/h3/text()")
                    .get();
            try{
                price=Double.valueOf(page.getHtml()
                        .$("#J_PromoPriceNum", "text").get().split("-")[0].trim());
                }catch(Exception e){
                    
                    price=Double.valueOf(page.getHtml()
                            .$("#J_StrPrice .tb-rmb-num", "text").get().split("-")[0].trim());
                }
            sellCount = Integer.valueOf(page.getHtml()
                    .$("#J_SellCounter", "text").get());
            allPrice = Double.valueOf(price) * Double.valueOf(sellCount);
        }else if(isTmall){
            shopUrl= page.getHtml()
                    .xpath("//*[@id="side-shop-info"]/div/h3/div/a/@href")
                    .get();
            shopUrl = shopUrl.split("?")[0];
            
            name = page.getHtml().$(".tb-detail-hd h1","text")
                    .get().trim();
        
            price=Double.valueOf(page.getHtml()
                        .$(".tm-price", "text").get().split("-")[0].trim());
                
            sellCount = Integer.valueOf(page.getHtml()
                    .$(".tm-count", "text").get().trim());
            allPrice = Double.valueOf(price) * Double.valueOf(sellCount);
        }

        // 采集日期
        // Timestamp recordDate=new Timestamp(new Date().getTime());
        String recordDate = DateUtil.formatDate(new Date(), "yyyy-MM-dd");

        log.debug(shopUrl + ":" + spm + ":" + name + ":" + price + ":"
                + sellCount + ":" + allPrice + ":" + recordDate);

        PageData pd = new PageData();
        pd.put("id", UUID.randomUUID().toString());
        pd.put("shopUrl", shopUrl);
        pd.put("spm", spm);
        pd.put("name", name);
        pd.put("price", price);
        pd.put("sellCount", sellCount);
        pd.put("allPrice", allPrice);
        pd.put("recordDate", recordDate);
        service.saveData(pd);
    }

    private boolean islistPage(Page page) {
        String tmp = page.getHtml().$("#J_PromoPrice").get();
        if (StringUtils.isBlank(tmp)) {
            return true;
        }
        return false;
    }

    public void start() {
        if (running.compareAndSet(false, true)) {
            try {
                service.emptyTable();
                List urls = service.getShopUrl();
                if (urls == null) {
                    log.error("店铺url获取异常,终止抓取");
                }
                String[] urlStrs=null;
                int size=50;
//                int size=urls.size();
                if(urls.size()
抓取猫眼票房数据

由于猫眼票房数据采用加密字体图标,而且每个数字对应的加密码每次都变化。所以此次采用selenium加载页面,截图,抠图(给每个数字),考虑到猫眼票房数据的规则性,结合google的 Tesseract-OCR 训练模型来识别我们抠出来的数字图片。

ImageUtil 负责抠图

import net.coobird.thumbnailator.Thumbnails;
import net.coobird.thumbnailator.geometry.Position;
import net.coobird.thumbnailator.geometry.Size;

/**
 * @author taojw
 *
 */
public class ImageUtil {
    public static void crop(String srcfile,String destfile,ImageRegion region){
        //指定坐标  
        try {
            Thumbnails.of(srcfile)  
                    .sourceRegion(region.x, region.y, region.width, region.height)  
                    .size(region.width, region.height).outputQuality(1.0) 
                    //.keepAspectRatio(false)  //不保持比例 
                    .toFile(destfile);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  
    }
    public static void main(String[] args) {
        crop("D:data111.png","D:data1112.png",new ImageRegion(66, 264, 422, 426));
    }
}
/**
 * @author taojw
 *
 */
public class ImageRegion {
    public int x;
    public int y;
    public int width;
    public int height;
    public ImageRegion(int x,int y,int width,int height){
        this.x=x;
        this.y=y;
        this.width=width;
        this.height=height;
    }
}

TesseractOcrUtil,调用tesseract进程,返回识别结果。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.UUID;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.fh.util.FileUtil;

/**
 * @author taojw
 *
 */
public class TesseractOcrUtil {
    private static final Logger log = LoggerFactory
            .getLogger(TesseractOcrUtil.class);
    private static final String tessPath;
    private static final String basePath;
    static {
        tessPath = FileUtil.getCommonProp("tesseract.path");
        basePath = new File(tessPath).getParentFile().getAbsolutePath();
    }

    public static String getByLangNum(String imagePath) {
        return get(imagePath, "num");
    }

    public static String getByLangChi(String imagePath) {
        return get(imagePath, "chi_sim");
    }

    public static String getByLangEng(String imagePath) {
        return get(imagePath, "eng");
    }

    public static String get(String imagePath, String lang) {
        String outName = UUID.randomUUID().toString();
        String outPath = basePath + File.separator
                + outName + ".txt";
//        String cmd = tessPath + " " + imagePath + " " + outName + " -l " + lang;
        ProcessBuilder pb = new ProcessBuilder();
        pb.directory(new File(basePath));
        
        pb.command(tessPath,imagePath,outName,"-l",lang);
        
        pb.redirectErrorStream(true);
        
        Process process=null;
        String errormsg = "";
        String res = null;
        try {
            process = pb.start();
            // tesseract.exe 1.jpg 1 -l chi_sim
            int excode = process.waitFor();
            
            if (excode == 0) {
                BufferedReader in = new BufferedReader(new InputStreamReader(
                        new FileInputStream(outPath), "UTF-8"));
                res = in.readLine();
                IOUtils.closeQuietly(in);
            } else {
                switch (excode) {
                case 1:
                    errormsg = "Errors accessing files.There may be spaces in your image"s filename.";
                    break;
                case 29:
                    errormsg = "Cannot recongnize the image or its selected region.";
                    break;
                case 31:
                    errormsg = "Unsupported image format.";
                    break;
                default:
                    errormsg = "Errors occurred.";
                }
                log.error("when ocr picture " + imagePath
                        + " an error occured. " + errormsg);
            }

        } catch (IOException e) {
            e.printStackTrace();
            log.warn("orc process occurs an io error",e);
        } catch (InterruptedException e) {
            e.printStackTrace();
            log.warn("orc process was interrupt unexpected!",e);
        }finally{
            FileUtils.deleteQuietly(new File(imagePath));
            FileUtils.deleteQuietly(new File(outPath));
        }
        if(res!=null){
            res=res.trim();
        }
        return res;
    }
}
/**
 * @author taojw
 *
 */
public class MaoyanTest implements PageProcessor{
    private static Site site=Site.me().setCharset("UTF-8").setUserAgent(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
    @Override
    public Site getSite() {
        return site;
    }

    @Override
    public void process(Page page) {

    }
    public  void start() {
        
        Spider cnSpider = Spider.create(this).setDownloader(new SeleniumDownloader(5000,null,new TestAction()))
//                .addUrl("https://shop34068488.taobao.com/?spm=a230r.7195193.1997079397.2.JLFlPa")
//                .addUrl("http://piaofang.maoyan.com/company/cinema?date=2017-01-18&webCityId=288&cityTier=0&page=1&cityName=%E6%8F%AD%E9%98%B3");
                .addUrl("http://piaofang.maoyan.com/company/cinema?date=2017-01-18&webCityId=84&cityTier=0&page=1&cityName=%E4%BF%9D%E5%AE%9A");
//                .addPipeline(new JsonFilePipeline("D:datawebmagicfile.json"))
        
        //SpiderMonitor.instance().register(cnSpider);
        cnSpider.run();
    }
    public static void main(String[] args) {
        new MaoyanTest().start();
    }
    
    private class TestAction implements SeleniumAction{

        @Override
        public void execute(WebDriver driver) {
            WindowUtil.loadAll(driver);
            try {
                Thread.sleep(5000);
                //WebDriverWait wait = new WebDriverWait(driver, 10);
                //wait.until(ExpectedConditions.presenceOfElementLocated(By.id("J_PromoPriceNum")));
                
                File src=((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
                String srcfile="D:data"+UUID.randomUUID().toString()+".png";
                FileUtils.copyFile(src, new File(srcfile));
                List movielist=driver.findElements(By.xpath("//*[@id="cinema-tbody"]/tr"));
//                movielist.remove(0);
                for(int i=1;i

可供参考链接:
selenium系列文章:http://www.cnblogs.com/TankXi...
selenium api:http://seleniumhq.github.io/s...
tesseract-ocr样本训练: http://blog.csdn.net/firehood...
selenium多窗口切换:http://blog.csdn.net/meyoung0...

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/66543.html

相关文章

  • 爬虫框架WebMagic源码分析之Selenium

    摘要:有一个模块其中实现了一个。但是感觉灵活性不大。接口如下它会获得一个实例,你可以在里面进行任意的操作。本部分到此结束。 webmagic有一个selenium模块,其中实现了一个SeleniumDownloader。但是感觉灵活性不大。所以我就自己参考实现了一个。 首先是WebDriverPool用来管理WebDriver池: import java.util.ArrayList; im...

    MarvinZhang 评论0 收藏0
  • 优雅的使用WebMagic框架写Java爬虫

    摘要:优雅的使用框架,爬取唐诗别苑网的诗人诗歌数据同时在几种动态加载技术中对比作选择虽然差不多两年没有维护,但其本身是一个优秀的爬虫框架的实现,源码中有很多值得参考的地方,特别是对爬虫多线程的控制。 优雅的使用WebMagic框架,爬取唐诗别苑网的诗人诗歌数据 同时在几种动态加载技术(HtmlUnit、PhantomJS、Selenium、JavaScriptEngine)中对比作选择 We...

    leejan97 评论0 收藏0
  • Python3网络爬虫实战---2、请求库安装:GeckoDriver、PhantomJS、Aioh

    摘要:上一篇文章网络爬虫实战请求库安装下一篇文章网络爬虫实战解析库的安装的安装在上一节我们了解了的配置方法,配置完成之后我们便可以用来驱动浏览器来做相应网页的抓取。上一篇文章网络爬虫实战请求库安装下一篇文章网络爬虫实战解析库的安装 上一篇文章:Python3网络爬虫实战---1、请求库安装:Requests、Selenium、ChromeDriver下一篇文章:Python3网络爬虫实战--...

    Cristalven 评论0 收藏0
  • selenium实战-同步网易云音乐歌单到qq音乐

    摘要:对于这次的爬虫来说,由于网易云音乐以及音乐网页中大部分元素都是使用渲染生成的,因此选择使用来完成这次的脚本。可以发现网易云音乐的手机版歌单地址是。现在已经支持网易云音乐与音乐歌单的互相同步。 本文主要介绍selenium在爬虫脚本的实际应用。适合刚接触python,没使用过selenium的童鞋。(如果你是老司机路过的话,帮忙点个star吧) 项目地址 https://github.c...

    dailybird 评论0 收藏0
  • Python爬虫实战(4):豆瓣小组话题数据采集—动态网页

    摘要:,引言注释上一篇爬虫实战安居客房产经纪人信息采集,访问的网页是静态网页,有朋友模仿那个实战来采集动态加载豆瓣小组的网页,结果不成功。 showImg(https://segmentfault.com/img/bVzdNZ); 1, 引言 注释:上一篇《Python爬虫实战(3):安居客房产经纪人信息采集》,访问的网页是静态网页,有朋友模仿那个实战来采集动态加载豆瓣小组的网页,结果不成功...

    blastz 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<