用python实现自动化办公------爬取小说天堂所有小说
作者:互联网
用python实现自动化办公------爬取小说天堂所有小说
摘要
所谓爬虫,其实就是用户和后端码农的相互较量。用户想要从服务器端爬取数据,服务器端的码农不乐意了,LZ辛辛苦苦收集到的数据岂能让你写几行代码就给偷走了,不行不行,必须加点反爬手段。随着一轮又一轮的较量,爬虫兴起,反爬手段也越来越多。对于我们普通大众来说,平时想爬点小电影、小说之类的,愈发变得艰难。
但是并不是所有网站的后端码农都是反爬党,今天的目标TXT小说天堂首页,就是要将该网站所有小说一网打尽。我会将源码在下文中全部展示出来,但是警告大家,如果你的磁盘容量太小,可能冒烟哦!
声明
对于本文涉及的程序请谨慎使用,如读者使用此程序造成该网站瘫痪,带来的后果自负。
本程序只用作爬虫技术交流,请勿用于商业用途!
后续本人将设计爬虫GUI软件,将免费送给读者朋友们。
一 致敬青春
整理房间的时候,发现了我尘封十几年的mp3,太惊讶了!那时候在被窝里看小说的场景浮现在眼前。那时候十岁左右的年纪,跑去黑网吧,专门下载玄幻、暴力小说,记得当时的《坏蛋是怎样炼成的》中的谢文东一直是我的偶像,为了下载到此小说,不惜挨一顿暴打从家里偷出5元钱跑去黑网吧。十几年过去了,我已不是当时的少年,也不需要偷偷跑去黑网吧,此刻,我能爬取大多数小说网站,但是再也回不到那时候童真的年纪了,致敬青春!!!
二 网站技术分析
这个网站经过我多次无headers爬取,居然没有触发反爬措施,所以我可以判定这个网站没有反爬手段,读者朋友们可以随意爬取。对于网站的整体架构,没有渲染、设计简单,所以爬虫难度很小。很容易就可以将所有小说打包回家。还有就是,这个网站不能下载,只能在线阅读,点击“立即下载”也不会有任何下载弹出窗口。
三 爬虫流程
对于资源网站,要想将其一网打尽,那必然要获取底层的url。怎么获得底层url呢?那么该网站将所有小说分成了21个大类,每个大类里边包括各种小说,某一小说又分为各个章节,每一个章节才会将小说内容显示出来。所以在爬取过程中,我们先要获取大类的url,再到小说的url,再到小说章节的url,这样子形成多级列表。
四 精准爬取
按照第三节的流程,我们按如下步骤进行:
第一步:获取大类的url
#官网:'https://www.xstt5.com/'
start_url = 'https://www.xstt5.com/'
#获取官网响应
response_all=session.get(start_url,headers=headers).content
response_all = etree.HTML(response_all)
#获取官网所有大类的url列表
name_list_all = response_all.xpath('//ul/li/a/@href')
第二步:获取小说的url
#获取各个大类之下的小类的url
start_url = i
response = session.get(start_url, headers=headers).content
response = etree.HTML(response)
#获取小说url列表
txt_url_list = response.xpath('//div[@class="w990"]/div[2]/div/div/h3/a/@href')
#获取小说名
txt_name_list = response.xpath('//div[@class="w990"]/div[2]/div/div/h3/a/text()')
第三步:获取小说章节的url
response_index = session.get('https://www.xstt5.com' + u, headers=headers).content
response_index = etree.HTML(response_index)
#获取章节的url地址
txt_url_chapter_list = response_index.xpath('//td/ul/li/a/@href')
五 分布式爬虫
现在大型爬虫都是采用分布式爬虫模式,就如Python的多级列表一样,逐步深入。
对于此网站的章节部分,设计者用了table标签,章节数是被打乱的,所以需要将章节数进行排序,不然爬取的小说章节是错乱的,没有任何意义。
#对章节url进行排序
txt_url_chapter_list.sort()
现在我们就可以将第四节中三个列表进行联立,使用分布式爬虫了。源代码如下:
"""User-agent大列表"""
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
]
from requests_html import HTMLSession
import os, xlwt, xlrd, random
import threading
from lxml import etree
session = HTMLSession()
class DFSpider(object):
def __init__(self,start_url,txt_name):
# 起始的请求地址、小说名----初始化
self.start_url = start_url
self.txt_name=txt_name
def parse_start_url(self):
"""
发送请求,获取响应
:return:
"""
# 请求头
headers = {
# 通过随机模块提供的随机拿取数据方法
'User-Agent': random.choice(USER_AGENT_LIST)
}
# 发送请求,获取响应字节数据
response = session.get(self.start_url, headers=headers).content
"""序列化对象,将字节内容数据,经过转换,变成可进行xpath操作的对象"""
response = etree.HTML(response)
"""调用提取第二份响应数据"""
self.parse_response_data(response)
def parse_response_data(self, response):
"""
解析response响应数据,提取
:return:
"""
#小说内容
name_list_1 = response.xpath('//div[@class="zw"]/p/text()')
#小说章节
name_list_2 = response.xpath('//div[@class="atitle"]/h1/text()')
"""调用保存小说的方法"""
self.save_txt_file(name_list_1,name_list_2)
def save_txt_file(self,name_list,name_list2):
with open(f'./数据/{self.txt_name}.txt','a+',encoding='utf-8')as f:
print(f"正在爬取{name_list2[0]}")
f.write(str(name_list2[0])+'\n')
for i in name_list:
# print(i)
f.write(i+'\n')
headers = {
# 通过随机模块提供的随机拿取数据方法
'User-Agent': random.choice(USER_AGENT_LIST)
}
if __name__ == '__main__':
#官网:'https://www.xstt5.com/'
start_url = 'https://www.xstt5.com/'
#获取官网响应
response_all=session.get(start_url,headers=headers).content
response_all = etree.HTML(response_all)
#获取官网所有大类的url列表
name_list_all = response_all.xpath('//ul/li/a/@href')
#遍历
for i in name_list_all:
try:
#获取各个大类之下的小类的url
start_url = i
response = session.get(start_url, headers=headers).content
response = etree.HTML(response)
#获取小说url列表
txt_url_list = response.xpath('//div[@class="w990"]/div[2]/div/div/h3/a/@href')
#获取小说名
txt_name_list = response.xpath('//div[@class="w990"]/div[2]/div/div/h3/a/text()')
#打包
for u, n in zip(txt_url_list, txt_name_list):
# print(u,n)
print('_'*80)
print(f"正在下载 {n}")
response_index = session.get('https://www.xstt5.com' + u, headers=headers).content
response_index = etree.HTML(response_index)
#获取章节的url地址
txt_url_chapter_list = response_index.xpath('//td/ul/li/a/@href')
#保存小说url和小说名
with open("xiaoshuoliebiao.txt",'a+',encoding='utf-8') as f:
f.write('https://www.xstt5.com'+u+','+n+'\n')
#对章节url进行排序
txt_url_chapter_list.sort()
#逐个爬取保存
# 方法一(单线程)
# for c in txt_url_chapter_list:
# url='https://www.xstt5.com'+c
# object=DFSpider(url,n)
# object.parse_start_url()
# print(f" {n} 下载成功")
#
# 方法二(多线程)爬虫速度加快,但是章节会乱
for c in txt_url_chapter_list:
url = 'https://www.xstt5.com' + c
object = DFSpider(url, n)
fun1=threading.Thread(target=object.parse_start_url)
fun1.start()
fun1.join()
except:
print("出错了!未找到文件")
当然,对于分布式任务,当然可以采用多线程、多进程、协程就行爬取工作,速度更快哦!
for c in txt_url_chapter_list:
url = 'https://www.xstt5.com' + c
object = DFSpider(url, n)
fun1=threading.Thread(target=object.parse_start_url)
fun1.start()
fun1.join()
以下就是运行截图了
今天的分享就到这儿了,感谢大家的阅读,别忘了给我点个赞哦!
如果对我的文章感兴趣,请为我点一个赞,如果有python的知识需要了解或探讨,可以加本人微信cuiliang1666457052
标签:5.0,python,Mozilla,爬取,url,537.36,Safari,小说,Gecko 来源: https://blog.csdn.net/weixin_52134263/article/details/118808263