编程语言
首页 > 编程语言> > Java(Jsoup)实现图书爬虫

Java(Jsoup)实现图书爬虫

作者:互联网

Java(Jsoup)实现图书爬虫

初始准备

本项目后续会发布在git上会更新。

1.使用的网址为:https://www.qb5.tw/

在这里插入图片描述

该程序将基于此页面 进行爬虫
2.创建的数据库有:
1.novel 记录小说的基本信息
在这里插入图片描述

2.novel_chapter存放小说的章节名称
在这里插入图片描述

3.novel_detail 存放每章小说的内容
在这里插入图片描述
3.本项目基于springboot进行开发 使用jsoup进行爬虫.。需要创建springboot项目。添加的依赖有:

 <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.11</version>
        </dependency>

        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

application.yml文件内容如下:

spring:
  # 数据库配置 mysql8
  datasource:
    driver-class-name: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://127.0.0.1:3306/crawler?useSSL=false&useUnicode=true&characterEncoding=utf-8&serverTimezone=Asia/Shanghai
    username: root
    password: 123456
  # jpa配置
  jpa:
    database: MySQL
    show-sql: true


项目开始

1.首先创建文章对应数据库的pojo类,

@Entity
@Table(name = "novel")
public class Novel {
        //主键
        @Id
        @GeneratedValue(strategy = GenerationType.IDENTITY)
        private Long id;
        //小说名称
        private String novel_name;
        //作者
        private String author;
         //封面
        private String img;
        //文章类型
        private String type;
        //文章状态
        private String status;
        //文章受欢迎程度
        private String pop;
         //文章简介
        private String brief;
        //章节个数
        private Long chapter_num;

        // 生成get/set方法
}

@Entity
@Table(name = "novel_chapter")
public class NovelChapter {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //小说id
    private Long novel_id;
    //章节名称
    private String chapter_name;

    public NovelChapter() {
    }
}

@Entity
@Table(name = "novel_detail")
public class NovelDetail {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //章节id
    private Long chapter_id;
    //章节内容
    private String chapter_content;

    public NovelDetail() {}
    public NovelDetail( Long chapter_id, String chapter_content) {
        this.chapter_id = chapter_id;
        this.chapter_content = chapter_content;
    }
}

2.创建对应的dao service impl
dao继承JpaRepository类,使用它对数据库进行操作。
dao:

public  interface NovelDao extends JpaRepository<Novel,Long> {

  }

public interface NovelDetailDao extends JpaRepository<NovelDetail,Long> {
}

public interface NovelChapterDao extends JpaRepository<NovelChapter,Long> {
}

service 实现JpaRepository的方法 save 保存到数据库,findall 从数据库中查找存在该数据。

public interface NovelService {
    public void save(Novel item);

    public List<Novel> findAll(Novel item);
}

public interface NovelDetailService {
    public void save(NovelDetail item);

    public List<NovelDetail> findAll(NovelDetail item);

}

public interface NovelChapterService {
    public void save(NovelChapter item);

    public List<NovelChapter> findAll(NovelChapter item);

}

impl实现service的方法 其余类与其类似

@Service
public class NovelChapterServiceImpl implements NovelChapterService {
    @Autowired
    private NovelChapterDao itemDao;
    @Override
    @Transactional
    public void save(NovelChapter item) {
        this.itemDao.save(item);
    }
    @Override
    public List<NovelChapter> findAll(NovelChapter item) {
        //声明查询条件
        Example<NovelChapter> example = Example.of(item);
        //根据查询条件进行查询
        List<NovelChapter> list = this.itemDao.findAll(example);
        return list;
    }


}

爬虫内容

package com.itcrawler.itcrawler.jd.controller;
import com.itcrawler.itcrawler.jd.pojo.Novel;
import com.itcrawler.itcrawler.jd.pojo.NovelChapter;
import com.itcrawler.itcrawler.jd.pojo.NovelDetail;
import com.itcrawler.itcrawler.jd.service.NovelChapterService;
import com.itcrawler.itcrawler.jd.service.NovelDetailService;
import com.itcrawler.itcrawler.jd.service.NovelService;
import com.itcrawler.itcrawler.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.util.ArrayList;
import java.util.List;

@Component
public class ItemTask {
    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private NovelService itemService;
    @Autowired
    private NovelChapterService novelChapterService;
    @Autowired
    private NovelDetailService novelDetailService;



    //爬取首页
    @Scheduled(fixedDelay = 100*1000 )  //当下载任务完成后 间隔多长时间进行下一次的任务   
    public String first(){
        String url="https://www.qb5.tw/";
        String html = httpUtils.doGetHtml(url);
        Document doc = Jsoup.parse(html);
        //第一个页面的数据  # 表示id .表示class 即下面select的内容可以理解为:先找标签为div 同时类名为main  找到后 向下找 div 的标签为mainleft 后找div类为titletop的  如此类推 最后找到我们需要的信息 对该方面不熟悉的  可以看看 jsoup相关内容
        Elements ele=doc.select("div#main div#mainleft div.titletop li.top  div.pic a");
        for (int i = 0; i < ele.size(); i++) {
            String href = ele.get(i).attr("href");
      //      System.out.println(href);
            this.parse(href);
        }
      return "first";
    }


    //解析页面 获取商品数据并存储
    private void parse(String url){
        //声明需要解析的初始地址   先对一个文章页面进行爬虫
        String html = httpUtils.doGetHtml(url);
        Document doc = Jsoup.parse(html);
        Novel novel=new Novel();
        //根据文章内容 获取 文章的题目 img(封面)  作者  简介 type  文章状态(连载 等) 热度  需要一个章节数
        Elements ele=doc.select("div#main div#bookdetail div.nav-mbx a[target=_blank]");
        //类型
        novel.setType(ele.get(1).text());
      String  tit= doc.select("div#main div#bookdetail div#info h1").text();
        String[] split = tit.split("/");
        novel.setNovel_name(split[0]);
        novel.setAuthor(split[1]);
        novel.setImg(doc.select("div#main div#bookdetail div#picbox div.img_in img").attr("src"));
        Elements select = doc.select("div#main div#bookdetail div#info p.booktag span");
        novel.setPop(select.get(0).text());
       novel.setStatus(select.get(1).text());
     //限制brief的长度
      String  brief=doc.select("div#main div#bookdetail div#info div#intro").text().substring(0,200);
        brief=brief.replace("<br>", "").replace("&nbsp;", "");
        novel.setBrief(brief);

        System.out.println(novel);
                List<Novel> list = this.itemService.findAll(novel);
              if(list.size()==0) {  //之前不存在
                  //保存到数据库
                  this.itemService.save(novel);
                  //获取章节信息 及内容
                  Elements as = doc.select("div.zjbox dl.zjlist dd a");
                  //内存有限 只爬取文章的前50章
                  for (int i = 0; i < as.size()&&i<10; i++) {
                      Element a=as.get(i);
                      String href = a.attr("href"); //文章内容
                      String title = a.text();   //章节名称

                      List<Novel> all = this.itemService.findAll(novel);
                      long artid = all.get(0).getId();
                      //  System.out.println(all.get(0).id);
                      //将文章章节添加进去
                      NovelChapter novelChapter = new NovelChapter(artid, title);
                      if (this.novelChapterService.findAll(novelChapter).size() == 0) {
                          this.novelChapterService.save(novelChapter);
                          //获取到网址 和章节名称
                          System.out.println("href:" + href + "title:" + title);
                          this.addToDb(url,novelChapter, href);
                      }
                  }

              }}

    private  void addToDb(String url,NovelChapter novelChapter,String href){
        System.out.println(novelChapter);
        if(novelChapter.getId()==null)return;
        Long chapterid=novelChapter.getId();
        url=url+href;
        System.out.println("url:"+url);
        String html = httpUtils.doGetHtml(url);
        Document doc = Jsoup.parse(html);
        //获取小说的正文
        String content=doc.select("div#main div#readbox div#content ").html();
        //处理特殊数据
        //去掉前面 网页内容相关的
        content=content.substring(90);
        content=content.replace("<br>", " ");
        content=content.replace("\n","");
        String test="<br>&nbsp;";  //防止出现 bsp等信息
        for (int i = 0; i < test.length(); i++) {
            test=test.substring(i);
            content=content.replace(test, "");
        }
        NovelDetail novelDetail = new NovelDetail(chapterid, content);
        System.out.println(novelDetail);
        if(this.novelDetailService.findAll(novelDetail).size()==0){
            this.novelDetailService.save(novelDetail);
        }
    }
}

httpUtils 对网页请求进行封装:

package com.itcrawler.itcrawler.jd.util;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;

@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager cm;
public HttpUtils(){
    this.cm=new PoolingHttpClientConnectionManager();
    //设置最大连接数
    this.cm.setMaxTotal(100);
    //设置每个主机的最大连接数
    this.cm.setDefaultMaxPerRoute(10);
}
//根据请求地址下载页面数据
public String doGetHtml(String url){
//获取httpClient对象
    CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
    //创建httpget请求对象 设置url地址
    HttpGet httpGet = new HttpGet(url);
httpGet.setConfig(this.getConfig());
//设置请求头 否则无法正常爬取
    httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
    httpGet.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
    httpGet.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
    httpGet.setHeader("Accept-Encoding", "gzip, deflate");
    httpGet.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
    httpGet.setHeader("Connection", "keep-alive");
    httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
    CloseableHttpResponse response=null;
    try {
        //使用httpClient发起请求 获取相应
        response = httpClient.execute(httpGet);
        if(response.getStatusLine().getStatusCode()==200){
            //判断响应体是否为空
            if(response.getEntity()!=null){
                String content = EntityUtils.toString(response.getEntity(), "gbk");
                return content;
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }finally {
        try {
            if(response!=null){
                response.close();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    //解析响应 返回结果
    return "";
}

//设置请求信息
private RequestConfig getConfig(){
    RequestConfig config = RequestConfig.custom()
            .setConnectTimeout(1000)//创建连接的最长时间
            .setConnectionRequestTimeout(500) //获取连接的最长时间
            .setSocketTimeout(10000) //数据传输的最长时间
            .build();
    return config;
}

}

最后别忘记 启动类中:

@SpringBootApplication
@EnableScheduling  //开启定时任务 不可少
public class ItcrawlerJdApplication {

    public static void main(String[] args) {
        SpringApplication.run(ItcrawlerJdApplication.class, args);
    }

}

最后就结束了 爬取的数据 如下 用户后面可以根据自己的不同需求 修改select下面的条件:
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

标签:Java,String,爬虫,private,Jsoup,org,import,div,public
来源: https://blog.csdn.net/qq_43505820/article/details/123635864