页面下载器(我的Java爬虫之一)

wfc_666 发布于2019-08-15 12:27 / 2934人阅读

摘要：这种方法打成的包如何运行两种方法将依赖通过全部指定，然后运行，类全名类命名，此方法貌似不再支持页面下载器前期准备导入依赖下载器第一版此处可以自己写个的解析方法第二版匿名内部类版本匿名内部类可以使用表达式来替代，写法为第三版使用包中的

说点别的 maven打包 官方定制的打包方式

使用maven assembly plugin插件完成打包操作，插件配置在pom.xml文件的build标签中，格式如下。


    [...]
    
      
        
        maven-assembly-plugin
        3.1.0
        
          
            jar-with-dependencies

executions用于将目标和maven的某个生命周期进行绑定


  
    make-assembly 
    package 
    
      single

创建可执行的jar包


   [...]
   
     
       maven-assembly-plugin
       3.1.0
       
         [...]
         
           
             org.sample.App
           
         
       
       [...]
     
     [...]

自定义打包方式

上文已提到使用官方定制的打包方式，使用标签即可；如果使用自定义的打包方式，使用标签。


  [...]
  
    [...]
    
      
        maven-assembly-plugin
        3.1.0
        
          
            src/assembly/src.xml
          
        
        [...]

src.xml的格式大致如下


    snapshot
    
        jar
    
    
        
            /lib

使用允许用户通过文件或目录的粒度来控制打包，往往配置一个bin目录，里面存放可运行的脚本。这种方法打成的包如何运行？
两种方法：

将依赖通过cp全部指定，然后运行，java -cp lib/dependency1:lib/dependency2 类全名

java -Djava.ext.dirs=lib 类命名，此方法貌似java 9不再支持

页面下载器 前期准备

maven导入依赖


    org.apache.httpcomponents
    httpclient
    4.5.3


    org.apache.httpcomponents
    fluent-hc
    4.5.3

下载器第一版

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.RequestBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.nio.charset.Charset;

public void testGet1() {
  CloseableHttpClient clients = HttpClients.createDefault();
  RequestBuilder builder = requestBuilder.get("http://www.qq.com");
  HttpGet httpGet = new HttpGet(builder.build().getURI());
  CloseableHttpResponse execute = null;
  try {
    execute = clients.execute(httpGet);
    HttpEntity entity = execute.getEntity();
    //此处可以自己写个charset的解析方法
    String page = EntityUtils.toString(entity);
    System.out.println(page);
  } catch (Exception e) {
    e.printStackTrace();
  } finally {
    if (execute != null) {
      try {
        execute.close();
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

第二版

匿名内部类版本

public void testGet2() {
  CloseableHttpClient clients = HttpClients.createDefault();
  RequestBuilder builder = RequestBuilder.get("http://www.qq.com");
  HttpGet httpGet = new HttpGet(builder.build().getURI());
  try {
    String page = clients.execute(httpGet, new ResponseHandler() {
      @Override
      public String handleResponse(HttpResponse HttpResponse) throws ClientProtocolException, IOException
      HttpEntity entity = httpResponse.getEntity();
      String s = EntityUtils.toString(entity);
      return s;
    });
    System.out.println(page);
  } catch (Exception e) {
    e.printStackTrace();
  }
}

匿名内部类可以使用lambda表达式来替代，写法为

String page = clients.execute(httpGet, (HttpResponse HttpResponse) -> {
    HttpEntity entity = HttpResponse.getEntity();
    String s = EntityUtils.toString(entity);
    return s;
  });

第三版

使用org.apache.http.client.fluent包中的api

public void testGet3() {
  Response response = Request.Get("http://www.qq.com").execute();
  String s = response.returnContent().asString(Charset.forName("gb2312"));
  System.out.println(s);
}

云服务器 GPU云服务器 java爬虫抓取页面页面爬虫页面爬虫程序爬虫抓取页面数据

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/67968.html

爬虫入门

摘要：通用网络爬虫通用网络爬虫又称全网爬虫，爬取对象从一些种子扩充到整个。为提高工作效率，通用网络爬虫会采取一定的爬取策略。介绍是一个国人编写的强大的网络爬虫系统并带有强大的。爬虫简单的说网络爬虫（Web crawler）也叫做网络铲（Web scraper）、网络蜘蛛（Web spider），其行为一般是先爬到对应的网页上，再把需要的信息铲下来。分类网络爬虫按照系统结构和实现技术，...

defcon 2019-07-30 17:07 评论0 收藏0
爬虫入门

摘要：通用网络爬虫通用网络爬虫又称全网爬虫，爬取对象从一些种子扩充到整个。为提高工作效率，通用网络爬虫会采取一定的爬取策略。介绍是一个国人编写的强大的网络爬虫系统并带有强大的。爬虫简单的说网络爬虫（Web crawler）也叫做网络铲（Web scraper）、网络蜘蛛（Web spider），其行为一般是先爬到对应的网页上，再把需要的信息铲下来。分类网络爬虫按照系统结构和实现技术，...

Invoker 2019-08-30 15:54 评论0 收藏0
后端技术 - 收藏集 - 掘金

摘要：理解迭代对象迭代器生成器后端掘金本文源自作者的一篇博文，原文是，俺写的这篇文章是按照自己的理解做的参考翻译。比较的是两个对象的内容是后端掘金黑魔法之协程异步后端掘金本文为作者原创，转载请先与作者联系。完全理解关键字with与上下文管理器 - 掘金如果你有阅读源码的习惯，可能会看到一些优秀的代码经常出现带有 with 关键字的语句，它通常用在什么场景呢？今天就来说说 with 和上下...

oujie 2019-07-31 10:57 评论0 收藏0
爬虫 - 收藏集 - 掘金

摘要：在这之前，还是有必要对一些概念超轻量级反爬虫方案后端掘金前言爬虫和反爬虫日益成为每家公司的标配系统。爬虫修炼之道——从网页中提取结构化数据并保存（以爬取糗百文本板块所有糗事为例） - 后端 - 掘金欢迎大家关注我的专题：爬虫修炼之道上篇爬虫修炼之道——编写一个爬取多页面的网络爬虫主要讲解了如何使用python编写一个可以下载多页面的爬虫，如何将相对URL转为绝对URL，如何限速，...

1fe1se 2019-07-31 10:58 评论0 收藏0