【初学Python】01-第一个小说爬虫
作者:互联网
在之前建站的时候,用C#做过一个爬图片网站图片的接口,代码写了一大串,最近看到朋友写爬虫,发现代码量是真滴少,于是乎了解学习了一下Python,实现了个最简单的小说爬虫,没有什么高级功能,也没用多线程之类的,就是一个很简单的基础爬虫,因为此前没有学习过Python,上来直接面对菜鸟教程编程,所以实现中还是遇到了一些小问题的,正好也水一篇随笔来记录一下过程。
分析阶段
首先是选择一个被害者,额...目标对象,根据我多年盗版书阅读经验,很快就选好了目标——某趣阁。
随便找一本书,点进去是书籍的目录结构,他们家网站布局还挺不错的,比较规整,目录分为两大部分,最近更新和全部目录,全部目录也没有分页,都直接显示到页面上了,也无须去分页获取了,有点没难度啊。
看完目录之后点进去看文章内容,F12检查可以发现每一句话都是一个P标签,啊这...,好吧,是有点简单。
简单分析完目录和内容之后就可以开始干活了,有了要拿的东西,无非就是考虑怎么实现罢了。既然要使用Python,那肯定得先看看基础语法,emmm,没有大括号,也不用分号,代码块靠缩进区别,这........ 没有关系!上手写两行就适应了,有一说一,适应之后还真的感觉挺简洁的。
思考一下处理步骤,大致可以分为五步:
- 找一本书,拿到目录链接
- 伪装head头,以免直接被ban(但是其实这家也没有检测这个,不过也可能是我爬的不狠)
- 爬到目录页面的目录链接和标题
- 遍历目录,爬到文章内容
- 将文章内容拼接保存文件
完事,开始水文章
编码阶段
-
伪装请求头
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36','Referer':'https://www.xquge.com'}
-
获取目录内容
catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
之所以调用decode()方法是因为获取到的内容是乱码的,所以需要转一下。这个时候遇到了一个问题:
Traceback (most recent call last): File "D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py", line 696, in urlopen self._prepare_proxy(conn) File "D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py", line 964, in _prepare_proxy conn.connect() File "D:\Program Files\Python39\lib\site-packages\urllib3\connection.py", line 359, in connect conn = self._connect_tls_proxy(hostname, conn) File "D:\Program Files\Python39\lib\site-packages\urllib3\connection.py", line 496, in _connect_tls_proxy return ssl_wrap_socket( File "D:\Program Files\Python39\lib\site-packages\urllib3\util\ssl_.py", line 432, in ssl_wrap_socket ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls) File "D:\Program Files\Python39\lib\site-packages\urllib3\util\ssl_.py", line 474, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock) File "D:\Program Files\Python39\lib\ssl.py", line 500, in wrap_socket return self.sslsocket_class._create( File "D:\Program Files\Python39\lib\ssl.py", line 1040, in _create self.do_handshake() File "D:\Program Files\Python39\lib\ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:\Program Files\Python39\lib\site-packages\requests\adapters.py", line 439, in send resp = conn.urlopen( File "D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen retries = retries.increment( File "D:\Program Files\Python39\lib\site-packages\urllib3\util\retry.py", line 573, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.xquge.com', port=443): Max retries exceeded with url: /book/1771.html (Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory'))) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "f:\NewOneDrive\OneDrive\Python\BookReptile.py", line 20, in <module> catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode() File "D:\Program Files\Python39\lib\site-packages\requests\api.py", line 76, in get return request('get', url, params=params, **kwargs) File "D:\Program Files\Python39\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "D:\Program Files\Python39\lib\site-packages\requests\sessions.py", line 542, in request resp = self.send(prep, **send_kwargs) File "D:\Program Files\Python39\lib\site-packages\requests\sessions.py", line 655, in send r = adapter.send(request, **kwargs) File "D:\Program Files\Python39\lib\site-packages\requests\adapters.py", line 510, in send raise ProxyError(e, request=request) requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.xquge.com', port=443): Max retries exceeded with url: /book/1771.html (Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory')))
其实看到Proxy字样就知道大概因为啥子了,把我的小飞机登云梯关闭之后问题消失。但是紧接着又出现了一个新的问题,不算报错,就是一个警告:
D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.xquge.com'. Adding certificate verification is strongly advised.
经过Search之后,了解到这个是因为请求https才出现的,咱们也没携带啥子证书啥的,也用不到,所以就让他不提示就好,给加句代码就好了:
import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
这样子获取到内容也不会报错了,获取到内容之后,要把内容转换为xpath对象,这样去获取节点非常方便,而且浏览器还提供了直接复制某个接点的xpath路径,就很棒,然后我们获取到全部目录里头的节点。
from lxml import etree # 处理成xpath对象 html=etree.HTML(catalog) # 获取到所有的目录节点 chapters=html.xpath('/html/body/div[1]/div[6]/div[5]/div[2]/ul/li/a')
打印一下chapters,输出一个数组如下,里头都不是人言,没关系,至少说明获取到东西了。
[<Element a at 0x26c538cca40>, <Element a at 0x26c5387dd00>, <Element a at 0x26c539c1280>, <Element a at 0x26c539c19c0>, <Element a at 0x26c539c1a00>, <Element a at 0x26c539c1440>, <Element a at 0x26c539c1b40>, <Element a at 0x26c539c18c0>, <Element a at 0x26c539c17c0>, <Element a at 0x26c539c1800>, <Element a at 0x26c539c1780>, <Element a at 0x26c539c1740>, <Element a at 0x26c539c1640>, <Element a at 0x26c539c16c0>, <Element a at 0x26c539c1dc0>, <Element a at 0x26c539c1e40>, <Element a at 0x26c539e1e80>, <Element a at 0x26c539e1e40>, <Element a at 0x26c539ccec0>, <Element a at 0x26c539ccdc0>, <Element a at 0x26c539ccf40>, <Element a at 0x26c539ccf80>, <Element a at 0x26c52dd10c0>, <Element a at 0x26c539f5b40>, <Element a at 0x26c539f5d80>, <Element a at 0x26c539f5cc0>, <Element a at 0x26c539f5e00>]
完事就可以开始处理每个页面里面的内容了,比较习惯定义一个函数,看了下Python的函数定义,也非常简单,那么就来写一个处理函数吧。
amount=len(chapters) nowIndex=0 # 定义函数处理内容页面 def processingChapter(url,title): content=requests.get(url,headers=headers, verify=False).content.decode() html=etree.HTML(content) # 转xpath lines=html.xpath('//*[@id="content"]/p[@class="bodytext"]/text()') # 获取小说文本内容集合 finalStr='\r\n'.join(lines) #使用指定字符将数组拼接为字符串 fileName='files/'+title+'.txt' # 拼接文件名 fileWriter=codecs.open(fileName,'w','utf-8') #打开文件写 fileWriter.writelines(finalStr) # 写入字符串 fileWriter.flush() fileWriter.close() global nowIndex # 在方法外部定义的变量,在方法内部使用时需要使用global关键字,否则会报已释放错误 nowIndex+=1 print(fileName+' 已保存'+str(nowIndex)+'/'+str(amount)) pass
代码里是将每一个章节保存为一个单独的文件,因为写这个例子也不是真的为了去爬书,只是了解学习一下,所以就没有将内容填充到同一个文件。
处理文章内容有了,那么遍历目录进行处理就可以了,再定义一个函数:
# 定义函数遍历目录,发送请求 def processingDirectory(chapters): for chapter in chapters: url=chapter.xpath('./@href')[0] # 获取链接 title=chapter.xpath('./text()')[0] # 获取标题 processingChapter(url,title) # 调用方法处理章节内容 time.sleep(0.8) # 请求太快容易被ban,也会出现跳章的问题 pass
最后调用删除:
# 开始搞事 processingDirectory(chapters)
执行效果
files/001 划重点?.txt 已保存1/1304
files/002 既治病,也要命!.txt 已保存2/1304
files/003 我都要.txt 已保存3/1304
files/004 我祝福你.txt 已保存4/1304
files/005 我再祝福你.txt 已保存5/1304
files/006 一年有三百六十五个日出.txt 已保存6/1304
files/007 减肥.txt 已保存7/1304
files/008 你会恨我的.txt 已保存8/1304
files/009 皮.txt 已保存9/1304
files/010 闯祸?.txt 已保存10/1304
files/011 出发!皮卡皮!.txt 已保存11/1304
至此,一个最简单的小虫子就写好了,没啥技术难度,只是之前完全没有接触过,所以做了解学习用。也许对程序里某些地方用法的理解不太正确,还请多多指正。
最终源码
#小白第一个爬虫
#爬取笔趣阁小说
#0.先伪装一个head
#1.输入一本书的地址 https://www.xquge.com/book/1771.html
#2.爬取目录的链接和标题
#3.遍历目录,请求到文章内容
#4.处理文章将内容输出到文件
#5.完事,开始吹水
import requests
import urllib3
from lxml import etree
import time
import codecs
# 去除https证书的提示
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# 伪装一个请求头,避免直接被ban,不过该网站并没有ban掉没有请求头的
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36','Referer':'https://www.xquge.com'}
# 获取目录页面
catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
# 处理成xpath对象
html=etree.HTML(catalog)
# 获取到所有的目录节点
chapters=html.xpath('/html/body/div[1]/div[6]/div[5]/div[2]/ul/li/a')
amount=len(chapters)
nowIndex=0
# 定义函数处理内容页面
def processingChapter(url,title):
content=requests.get(url,headers=headers, verify=False).content.decode()
html=etree.HTML(content) # 转xpath
lines=html.xpath('//*[@id="content"]/p[@class="bodytext"]/text()') # 获取小说文本内容集合
finalStr='\r\n'.join(lines) #使用指定字符将数组拼接为字符串
fileName='files/'+title+'.txt' # 拼接文件名
fileWriter=codecs.open(fileName,'w','utf-8') #打开文件写
fileWriter.writelines(finalStr) # 写入字符串
fileWriter.flush()
fileWriter.close()
global nowIndex # 在方法外部定义的变量,在方法内部使用时需要使用global关键字,否则会报已释放错误
nowIndex+=1
print(fileName+' 已保存'+str(nowIndex)+'/'+str(amount))
pass
# 定义函数遍历目录,发送请求
def processingDirectory(chapters):
for chapter in chapters:
url=chapter.xpath('./@href')[0] # 获取链接
title=chapter.xpath('./text()')[0] # 获取标题
processingChapter(url,title) # 调用方法处理章节内容
time.sleep(0.8) # 请求太快容易被ban,也会出现跳章的问题
pass
# 开始搞事
processingDirectory(chapters)
标签:Files,01,lib,Python,py,爬虫,Program,File,Python39 来源: https://www.cnblogs.com/LiuDanK/p/14055600.html