首页 > 编程语言> > Java（Jsoup）实现图书爬虫

Java（Jsoup）实现图书爬虫

2022-03-21 15:34:54 作者：互联网

Java（Jsoup）实现图书爬虫

初始准备
项目开始

初始准备

本项目后续会发布在git上会更新。

1.使用的网址为：https://www.qb5.tw/

在这里插入图片描述

该程序将基于此页面进行爬虫
2.创建的数据库有：
1.novel 记录小说的基本信息
在这里插入图片描述

2.novel_chapter存放小说的章节名称
在这里插入图片描述

3.novel_detail 存放每章小说的内容
在这里插入图片描述
3.本项目基于springboot进行开发使用jsoup进行爬虫.。需要创建springboot项目。添加的依赖有：

 <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.11</version>
        </dependency>

        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

application.yml文件内容如下:

spring:
  # 数据库配置 mysql8
  datasource:
    driver-class-name: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://127.0.0.1:3306/crawler?useSSL=false&useUnicode=true&characterEncoding=utf-8&serverTimezone=Asia/Shanghai
    username: root
    password: 123456
  # jpa配置
  jpa:
    database: MySQL
    show-sql: true

项目开始

1.首先创建文章对应数据库的pojo类，

@Entity
@Table(name = "novel")
public class Novel {
        //主键
        @Id
        @GeneratedValue(strategy = GenerationType.IDENTITY)
        private Long id;
        //小说名称
        private String novel_name;
        //作者
        private String author;
         //封面
        private String img;
        //文章类型
        private String type;
        //文章状态
        private String status;
        //文章受欢迎程度
        private String pop;
         //文章简介
        private String brief;
        //章节个数
        private Long chapter_num;

        // 生成get/set方法
}

@Entity
@Table(name = "novel_chapter")
public class NovelChapter {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //小说id
    private Long novel_id;
    //章节名称
    private String chapter_name;

    public NovelChapter() {
    }
}


@Entity
@Table(name = "novel_detail")
public class NovelDetail {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //章节id
    private Long chapter_id;
    //章节内容
    private String chapter_content;

    public NovelDetail() {}
    public NovelDetail( Long chapter_id, String chapter_content) {
        this.chapter_id = chapter_id;
        this.chapter_content = chapter_content;
    }
}

2.创建对应的dao service impl
dao继承JpaRepository类，使用它对数据库进行操作。
dao:

public  interface NovelDao extends JpaRepository<Novel,Long> {

  }

public interface NovelDetailDao extends JpaRepository<NovelDetail,Long> {
}

public interface NovelChapterDao extends JpaRepository<NovelChapter,Long> {
}

service 实现JpaRepository的方法 save 保存到数据库，findall 从数据库中查找存在该数据。

public interface NovelService {
    public void save(Novel item);

    public List<Novel> findAll(Novel item);
}

public interface NovelDetailService {
    public void save(NovelDetail item);

    public List<NovelDetail> findAll(NovelDetail item);

}

public interface NovelChapterService {
    public void save(NovelChapter item);

    public List<NovelChapter> findAll(NovelChapter item);

}

impl实现service的方法其余类与其类似

@Service
public class NovelChapterServiceImpl implements NovelChapterService {
    @Autowired
    private NovelChapterDao itemDao;
    @Override
    @Transactional
    public void save(NovelChapter item) {
        this.itemDao.save(item);
    }
    @Override
    public List<NovelChapter> findAll(NovelChapter item) {
        //声明查询条件
        Example<NovelChapter> example = Example.of(item);
        //根据查询条件进行查询
        List<NovelChapter> list = this.itemDao.findAll(example);
        return list;
    }


}

爬虫内容

package com.itcrawler.itcrawler.jd.controller;
import com.itcrawler.itcrawler.jd.pojo.Novel;
import com.itcrawler.itcrawler.jd.pojo.NovelChapter;
import com.itcrawler.itcrawler.jd.pojo.NovelDetail;
import com.itcrawler.itcrawler.jd.service.NovelChapterService;
import com.itcrawler.itcrawler.jd.service.NovelDetailService;
import com.itcrawler.itcrawler.jd.service.NovelService;
import com.itcrawler.itcrawler.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.util.ArrayList;
import java.util.List;

@Component
public class ItemTask {
    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private NovelService itemService;
    @Autowired
    private NovelChapterService novelChapterService;
    @Autowired
    private NovelDetailService novelDetailService;



    //爬取首页
    @Scheduled(fixedDelay = 100*1000 )  //当下载任务完成后 间隔多长时间进行下一次的任务   
    public String first(){
        String url="https://www.qb5.tw/";
        String html = httpUtils.doGetHtml(url);
        Document doc = Jsoup.parse(html);
        //第一个页面的数据  # 表示id .表示class 即下面select的内容可以理解为：先找标签为div 同时类名为main  找到后 向下找 div 的标签为mainleft 后找div类为titletop的  如此类推 最后找到我们需要的信息 对该方面不熟悉的  可以看看 jsoup相关内容
        Elements ele=doc.select("div#main div#mainleft div.titletop li.top  div.pic a");
        for (int i = 0; i < ele.size(); i++) {
            String href = ele.get(i).attr("href");
      //      System.out.println(href);
            this.parse(href);
        }
      return "first";
    }


    //解析页面 获取商品数据并存储
    private void parse(String url){
        //声明需要解析的初始地址   先对一个文章页面进行爬虫
        String html = httpUtils.doGetHtml(url);
        Document doc = Jsoup.parse(html);
        Novel novel=new Novel();
        //根据文章内容 获取 文章的题目 img(封面)  作者  简介 type  文章状态（连载 等） 热度  需要一个章节数
        Elements ele=doc.select("div#main div#bookdetail div.nav-mbx a[target=_blank]");
        //类型
        novel.setType(ele.get(1).text());
      String  tit= doc.select("div#main div#bookdetail div#info h1").text();
        String[] split = tit.split("/");
        novel.setNovel_name(split[0]);
        novel.setAuthor(split[1]);
        novel.setImg(doc.select("div#main div#bookdetail div#picbox div.img_in img").attr("src"));
        Elements select = doc.select("div#main div#bookdetail div#info p.booktag span");
        novel.setPop(select.get(0).text());
       novel.setStatus(select.get(1).text());
     //限制brief的长度
      String  brief=doc.select("div#main div#bookdetail div#info div#intro").text().substring(0,200);
        brief=brief.replace("<br>", "").replace("&nbsp;", "");
        novel.setBrief(brief);

        System.out.println(novel);
                List<Novel> list = this.itemService.findAll(novel);
              if(list.size()==0) {  //之前不存在
                  //保存到数据库
                  this.itemService.save(novel);
                  //获取章节信息 及内容
                  Elements as = doc.select("div.zjbox dl.zjlist dd a");
                  //内存有限 只爬取文章的前50章
                  for (int i = 0; i < as.size()&&i<10; i++) {
                      Element a=as.get(i);
                      String href = a.attr("href"); //文章内容
                      String title = a.text();   //章节名称

                      List<Novel> all = this.itemService.findAll(novel);
                      long artid = all.get(0).getId();
                      //  System.out.println(all.get(0).id);
                      //将文章章节添加进去
                      NovelChapter novelChapter = new NovelChapter(artid, title);
                      if (this.novelChapterService.findAll(novelChapter).size() == 0) {
                          this.novelChapterService.save(novelChapter);
                          //获取到网址 和章节名称
                          System.out.println("href:" + href + "title:" + title);
                          this.addToDb(url,novelChapter, href);
                      }
                  }

              }}

    private  void addToDb(String url,NovelChapter novelChapter,String href){
        System.out.println(novelChapter);
        if(novelChapter.getId()==null)return;
        Long chapterid=novelChapter.getId();
        url=url+href;
        System.out.println("url:"+url);
        String html = httpUtils.doGetHtml(url);
        Document doc = Jsoup.parse(html);
        //获取小说的正文
        String content=doc.select("div#main div#readbox div#content ").html();
        //处理特殊数据
        //去掉前面 网页内容相关的
        content=content.substring(90);
        content=content.replace("<br>", " ");
        content=content.replace("\n","");
        String test="<br>&nbsp;";  //防止出现 bsp等信息
        for (int i = 0; i < test.length(); i++) {
            test=test.substring(i);
            content=content.replace(test, "");
        }
        NovelDetail novelDetail = new NovelDetail(chapterid, content);
        System.out.println(novelDetail);
        if(this.novelDetailService.findAll(novelDetail).size()==0){
            this.novelDetailService.save(novelDetail);
        }
    }
}

httpUtils 对网页请求进行封装:

package com.itcrawler.itcrawler.jd.util;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;

@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager cm;
public HttpUtils(){
    this.cm=new PoolingHttpClientConnectionManager();
    //设置最大连接数
    this.cm.setMaxTotal(100);
    //设置每个主机的最大连接数
    this.cm.setDefaultMaxPerRoute(10);
}
//根据请求地址下载页面数据
public String doGetHtml(String url){
//获取httpClient对象
    CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
    //创建httpget请求对象 设置url地址
    HttpGet httpGet = new HttpGet(url);
httpGet.setConfig(this.getConfig());
//设置请求头 否则无法正常爬取
    httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
    httpGet.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
    httpGet.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
    httpGet.setHeader("Accept-Encoding", "gzip, deflate");
    httpGet.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
    httpGet.setHeader("Connection", "keep-alive");
    httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
    CloseableHttpResponse response=null;
    try {
        //使用httpClient发起请求 获取相应
        response = httpClient.execute(httpGet);
        if(response.getStatusLine().getStatusCode()==200){
            //判断响应体是否为空
            if(response.getEntity()!=null){
                String content = EntityUtils.toString(response.getEntity(), "gbk");
                return content;
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }finally {
        try {
            if(response!=null){
                response.close();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    //解析响应 返回结果
    return "";
}

//设置请求信息
private RequestConfig getConfig(){
    RequestConfig config = RequestConfig.custom()
            .setConnectTimeout(1000)//创建连接的最长时间
            .setConnectionRequestTimeout(500) //获取连接的最长时间
            .setSocketTimeout(10000) //数据传输的最长时间
            .build();
    return config;
}

}

最后别忘记启动类中：

@SpringBootApplication
@EnableScheduling  //开启定时任务 不可少
public class ItcrawlerJdApplication {

    public static void main(String[] args) {
        SpringApplication.run(ItcrawlerJdApplication.class, args);
    }

}

最后就结束了爬取的数据如下用户后面可以根据自己的不同需求修改select下面的条件：
在这里插入图片描述

在这里插入图片描述

标签：Java,String,爬虫,private,Jsoup,org,import,div,public
来源： https://blog.csdn.net/qq_43505820/article/details/123635864