Java(Jsoup)实现图书爬虫
作者:互联网
Java(Jsoup)实现图书爬虫
初始准备
本项目后续会发布在git上会更新。
1.使用的网址为:https://www.qb5.tw/
该程序将基于此页面 进行爬虫
2.创建的数据库有:
1.novel 记录小说的基本信息
2.novel_chapter存放小说的章节名称
3.novel_detail 存放每章小说的内容
3.本项目基于springboot进行开发 使用jsoup进行爬虫.。需要创建springboot项目。添加的依赖有:
<dependencies>
<!--SpringMVC-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!--SpringData Jpa-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<!--MySQL连接包-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.11</version>
</dependency>
<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
<!--Jsoup-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
<!--工具包-->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
application.yml文件内容如下:
spring:
# 数据库配置 mysql8
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://127.0.0.1:3306/crawler?useSSL=false&useUnicode=true&characterEncoding=utf-8&serverTimezone=Asia/Shanghai
username: root
password: 123456
# jpa配置
jpa:
database: MySQL
show-sql: true
项目开始
1.首先创建文章对应数据库的pojo类,
@Entity
@Table(name = "novel")
public class Novel {
//主键
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
//小说名称
private String novel_name;
//作者
private String author;
//封面
private String img;
//文章类型
private String type;
//文章状态
private String status;
//文章受欢迎程度
private String pop;
//文章简介
private String brief;
//章节个数
private Long chapter_num;
// 生成get/set方法
}
@Entity
@Table(name = "novel_chapter")
public class NovelChapter {
//主键
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
//小说id
private Long novel_id;
//章节名称
private String chapter_name;
public NovelChapter() {
}
}
@Entity
@Table(name = "novel_detail")
public class NovelDetail {
//主键
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
//章节id
private Long chapter_id;
//章节内容
private String chapter_content;
public NovelDetail() {}
public NovelDetail( Long chapter_id, String chapter_content) {
this.chapter_id = chapter_id;
this.chapter_content = chapter_content;
}
}
2.创建对应的dao service impl
dao继承JpaRepository类,使用它对数据库进行操作。
dao:
public interface NovelDao extends JpaRepository<Novel,Long> {
}
public interface NovelDetailDao extends JpaRepository<NovelDetail,Long> {
}
public interface NovelChapterDao extends JpaRepository<NovelChapter,Long> {
}
service 实现JpaRepository的方法 save 保存到数据库,findall 从数据库中查找存在该数据。
public interface NovelService {
public void save(Novel item);
public List<Novel> findAll(Novel item);
}
public interface NovelDetailService {
public void save(NovelDetail item);
public List<NovelDetail> findAll(NovelDetail item);
}
public interface NovelChapterService {
public void save(NovelChapter item);
public List<NovelChapter> findAll(NovelChapter item);
}
impl实现service的方法 其余类与其类似
@Service
public class NovelChapterServiceImpl implements NovelChapterService {
@Autowired
private NovelChapterDao itemDao;
@Override
@Transactional
public void save(NovelChapter item) {
this.itemDao.save(item);
}
@Override
public List<NovelChapter> findAll(NovelChapter item) {
//声明查询条件
Example<NovelChapter> example = Example.of(item);
//根据查询条件进行查询
List<NovelChapter> list = this.itemDao.findAll(example);
return list;
}
}
爬虫内容
package com.itcrawler.itcrawler.jd.controller;
import com.itcrawler.itcrawler.jd.pojo.Novel;
import com.itcrawler.itcrawler.jd.pojo.NovelChapter;
import com.itcrawler.itcrawler.jd.pojo.NovelDetail;
import com.itcrawler.itcrawler.jd.service.NovelChapterService;
import com.itcrawler.itcrawler.jd.service.NovelDetailService;
import com.itcrawler.itcrawler.jd.service.NovelService;
import com.itcrawler.itcrawler.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.ArrayList;
import java.util.List;
@Component
public class ItemTask {
@Autowired
private HttpUtils httpUtils;
@Autowired
private NovelService itemService;
@Autowired
private NovelChapterService novelChapterService;
@Autowired
private NovelDetailService novelDetailService;
//爬取首页
@Scheduled(fixedDelay = 100*1000 ) //当下载任务完成后 间隔多长时间进行下一次的任务
public String first(){
String url="https://www.qb5.tw/";
String html = httpUtils.doGetHtml(url);
Document doc = Jsoup.parse(html);
//第一个页面的数据 # 表示id .表示class 即下面select的内容可以理解为:先找标签为div 同时类名为main 找到后 向下找 div 的标签为mainleft 后找div类为titletop的 如此类推 最后找到我们需要的信息 对该方面不熟悉的 可以看看 jsoup相关内容
Elements ele=doc.select("div#main div#mainleft div.titletop li.top div.pic a");
for (int i = 0; i < ele.size(); i++) {
String href = ele.get(i).attr("href");
// System.out.println(href);
this.parse(href);
}
return "first";
}
//解析页面 获取商品数据并存储
private void parse(String url){
//声明需要解析的初始地址 先对一个文章页面进行爬虫
String html = httpUtils.doGetHtml(url);
Document doc = Jsoup.parse(html);
Novel novel=new Novel();
//根据文章内容 获取 文章的题目 img(封面) 作者 简介 type 文章状态(连载 等) 热度 需要一个章节数
Elements ele=doc.select("div#main div#bookdetail div.nav-mbx a[target=_blank]");
//类型
novel.setType(ele.get(1).text());
String tit= doc.select("div#main div#bookdetail div#info h1").text();
String[] split = tit.split("/");
novel.setNovel_name(split[0]);
novel.setAuthor(split[1]);
novel.setImg(doc.select("div#main div#bookdetail div#picbox div.img_in img").attr("src"));
Elements select = doc.select("div#main div#bookdetail div#info p.booktag span");
novel.setPop(select.get(0).text());
novel.setStatus(select.get(1).text());
//限制brief的长度
String brief=doc.select("div#main div#bookdetail div#info div#intro").text().substring(0,200);
brief=brief.replace("<br>", "").replace(" ", "");
novel.setBrief(brief);
System.out.println(novel);
List<Novel> list = this.itemService.findAll(novel);
if(list.size()==0) { //之前不存在
//保存到数据库
this.itemService.save(novel);
//获取章节信息 及内容
Elements as = doc.select("div.zjbox dl.zjlist dd a");
//内存有限 只爬取文章的前50章
for (int i = 0; i < as.size()&&i<10; i++) {
Element a=as.get(i);
String href = a.attr("href"); //文章内容
String title = a.text(); //章节名称
List<Novel> all = this.itemService.findAll(novel);
long artid = all.get(0).getId();
// System.out.println(all.get(0).id);
//将文章章节添加进去
NovelChapter novelChapter = new NovelChapter(artid, title);
if (this.novelChapterService.findAll(novelChapter).size() == 0) {
this.novelChapterService.save(novelChapter);
//获取到网址 和章节名称
System.out.println("href:" + href + "title:" + title);
this.addToDb(url,novelChapter, href);
}
}
}}
private void addToDb(String url,NovelChapter novelChapter,String href){
System.out.println(novelChapter);
if(novelChapter.getId()==null)return;
Long chapterid=novelChapter.getId();
url=url+href;
System.out.println("url:"+url);
String html = httpUtils.doGetHtml(url);
Document doc = Jsoup.parse(html);
//获取小说的正文
String content=doc.select("div#main div#readbox div#content ").html();
//处理特殊数据
//去掉前面 网页内容相关的
content=content.substring(90);
content=content.replace("<br>", " ");
content=content.replace("\n","");
String test="<br> "; //防止出现 bsp等信息
for (int i = 0; i < test.length(); i++) {
test=test.substring(i);
content=content.replace(test, "");
}
NovelDetail novelDetail = new NovelDetail(chapterid, content);
System.out.println(novelDetail);
if(this.novelDetailService.findAll(novelDetail).size()==0){
this.novelDetailService.save(novelDetail);
}
}
}
httpUtils 对网页请求进行封装:
package com.itcrawler.itcrawler.jd.util;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;
@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager cm;
public HttpUtils(){
this.cm=new PoolingHttpClientConnectionManager();
//设置最大连接数
this.cm.setMaxTotal(100);
//设置每个主机的最大连接数
this.cm.setDefaultMaxPerRoute(10);
}
//根据请求地址下载页面数据
public String doGetHtml(String url){
//获取httpClient对象
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
//创建httpget请求对象 设置url地址
HttpGet httpGet = new HttpGet(url);
httpGet.setConfig(this.getConfig());
//设置请求头 否则无法正常爬取
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
httpGet.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
httpGet.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
httpGet.setHeader("Accept-Encoding", "gzip, deflate");
httpGet.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
httpGet.setHeader("Connection", "keep-alive");
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
CloseableHttpResponse response=null;
try {
//使用httpClient发起请求 获取相应
response = httpClient.execute(httpGet);
if(response.getStatusLine().getStatusCode()==200){
//判断响应体是否为空
if(response.getEntity()!=null){
String content = EntityUtils.toString(response.getEntity(), "gbk");
return content;
}
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try {
if(response!=null){
response.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
//解析响应 返回结果
return "";
}
//设置请求信息
private RequestConfig getConfig(){
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(1000)//创建连接的最长时间
.setConnectionRequestTimeout(500) //获取连接的最长时间
.setSocketTimeout(10000) //数据传输的最长时间
.build();
return config;
}
}
最后别忘记 启动类中:
@SpringBootApplication
@EnableScheduling //开启定时任务 不可少
public class ItcrawlerJdApplication {
public static void main(String[] args) {
SpringApplication.run(ItcrawlerJdApplication.class, args);
}
}
最后就结束了 爬取的数据 如下 用户后面可以根据自己的不同需求 修改select下面的条件:
标签:Java,String,爬虫,private,Jsoup,org,import,div,public 来源: https://blog.csdn.net/qq_43505820/article/details/123635864