首页 > 编程语言> > 【初学Python】01-第一个小说爬虫

【初学Python】01-第一个小说爬虫

2020-11-29 11:05:52 作者：互联网

在之前建站的时候，用C#做过一个爬图片网站图片的接口，代码写了一大串，最近看到朋友写爬虫，发现代码量是真滴少，于是乎了解学习了一下Python，实现了个最简单的小说爬虫，没有什么高级功能，也没用多线程之类的，就是一个很简单的基础爬虫，因为此前没有学习过Python，上来直接面对菜鸟教程编程，所以实现中还是遇到了一些小问题的，正好也水一篇随笔来记录一下过程。

分析阶段

首先是选择一个被害者，额...目标对象，根据我多年盗版书阅读经验，很快就选好了目标——某趣阁。

随便找一本书，点进去是书籍的目录结构，他们家网站布局还挺不错的，比较规整，目录分为两大部分，最近更新和全部目录，全部目录也没有分页，都直接显示到页面上了，也无须去分页获取了，有点没难度啊。

看完目录之后点进去看文章内容，F12检查可以发现每一句话都是一个P标签，啊这...，好吧，是有点简单。

简单分析完目录和内容之后就可以开始干活了，有了要拿的东西，无非就是考虑怎么实现罢了。既然要使用Python，那肯定得先看看基础语法，emmm，没有大括号，也不用分号，代码块靠缩进区别，这........ 没有关系！上手写两行就适应了，有一说一，适应之后还真的感觉挺简洁的。

思考一下处理步骤，大致可以分为五步：

找一本书，拿到目录链接
伪装head头，以免直接被ban（但是其实这家也没有检测这个，不过也可能是我爬的不狠）
爬到目录页面的目录链接和标题
遍历目录，爬到文章内容
将文章内容拼接保存文件
~~完事，开始水文章~~

编码阶段

找本书，https://www.xquge.com/book/1771.html

伪装请求头

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36','Referer':'https://www.xquge.com'}

获取目录内容

catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()

之所以调用decode()方法是因为获取到的内容是乱码的，所以需要转一下。这个时候遇到了一个问题：

Traceback (most recent call last):
  File "D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py", line 696, in urlopen
    self._prepare_proxy(conn)
  File "D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py", line 964, in _prepare_proxy
    conn.connect()
  File "D:\Program Files\Python39\lib\site-packages\urllib3\connection.py", line 359, in connect
    conn = self._connect_tls_proxy(hostname, conn)
  File "D:\Program Files\Python39\lib\site-packages\urllib3\connection.py", line 496, in _connect_tls_proxy
    return ssl_wrap_socket(
  File "D:\Program Files\Python39\lib\site-packages\urllib3\util\ssl_.py", line 432, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls)
  File "D:\Program Files\Python39\lib\site-packages\urllib3\util\ssl_.py", line 474, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock)
  File "D:\Program Files\Python39\lib\ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "D:\Program Files\Python39\lib\ssl.py", line 1040, in _create
    self.do_handshake()
  File "D:\Program Files\Python39\lib\ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python39\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "D:\Program Files\Python39\lib\site-packages\urllib3\util\retry.py", line 573, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.xquge.com', port=443): Max retries exceeded with url: /book/1771.html (Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "f:\NewOneDrive\OneDrive\Python\BookReptile.py", line 20, in <module>
    catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
  File "D:\Program Files\Python39\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Program Files\Python39\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Program Files\Python39\lib\site-packages\requests\sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "D:\Program Files\Python39\lib\site-packages\requests\sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "D:\Program Files\Python39\lib\site-packages\requests\adapters.py", line 510, in send
    raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.xquge.com', port=443): Max retries exceeded with url: /book/1771.html (Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory')))

其实看到Proxy字样就知道大概因为啥子了，把我的小飞机登云梯关闭之后问题消失。但是紧接着又出现了一个新的问题，不算报错，就是一个警告：

D:\Program Files\Python39\lib\site-packages\urllib3\connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.xquge.com'. Adding certificate verification is strongly advised.

经过Search之后，了解到这个是因为请求https才出现的，咱们也没携带啥子证书啥的，也用不到，所以就让他不提示就好，给加句代码就好了：

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

这样子获取到内容也不会报错了，获取到内容之后，要把内容转换为xpath对象，这样去获取节点非常方便，而且浏览器还提供了直接复制某个接点的xpath路径，就很棒，然后我们获取到全部目录里头的节点。

from lxml import etree
# 处理成xpath对象
html=etree.HTML(catalog)
# 获取到所有的目录节点
chapters=html.xpath('/html/body/div[1]/div[6]/div[5]/div[2]/ul/li/a')

打印一下chapters，输出一个数组如下，里头都不是人言，没关系，至少说明获取到东西了。

[<Element a at 0x26c538cca40>, <Element a at 0x26c5387dd00>, <Element a at 0x26c539c1280>, <Element a at 0x26c539c19c0>, <Element a at 0x26c539c1a00>, <Element a at 0x26c539c1440>, <Element a at 0x26c539c1b40>, <Element a at 0x26c539c18c0>, <Element a at 0x26c539c17c0>, <Element a at 0x26c539c1800>, <Element a at 0x26c539c1780>, <Element a at 0x26c539c1740>, <Element a at 0x26c539c1640>, <Element a at 0x26c539c16c0>, <Element a at 0x26c539c1dc0>, <Element a at 0x26c539c1e40>, <Element a at 0x26c539e1e80>, <Element a at 0x26c539e1e40>, <Element a at 0x26c539ccec0>, <Element a at 0x26c539ccdc0>, <Element a at 0x26c539ccf40>, <Element a at 0x26c539ccf80>, <Element a at 0x26c52dd10c0>, <Element a at 0x26c539f5b40>, <Element a at 0x26c539f5d80>, <Element a at 0x26c539f5cc0>, <Element
a at 0x26c539f5e00>]

完事就可以开始处理每个页面里面的内容了，比较习惯定义一个函数，看了下Python的函数定义，也非常简单，那么就来写一个处理函数吧。

amount=len(chapters)
nowIndex=0
# 定义函数处理内容页面
def processingChapter(url,title):
    content=requests.get(url,headers=headers, verify=False).content.decode()
    html=etree.HTML(content) # 转xpath
    lines=html.xpath('//*[@id="content"]/p[@class="bodytext"]/text()') # 获取小说文本内容集合
    finalStr='\r\n'.join(lines) #使用指定字符将数组拼接为字符串
    fileName='files/'+title+'.txt' # 拼接文件名
    fileWriter=codecs.open(fileName,'w','utf-8') #打开文件写
    fileWriter.writelines(finalStr) # 写入字符串
    fileWriter.flush()
    fileWriter.close()
    global nowIndex # 在方法外部定义的变量,在方法内部使用时需要使用global关键字,否则会报已释放错误
    nowIndex+=1
    print(fileName+' 已保存'+str(nowIndex)+'/'+str(amount))
    pass

代码里是将每一个章节保存为一个单独的文件，因为写这个例子也不是真的为了去爬书，只是了解学习一下，所以就没有将内容填充到同一个文件。

处理文章内容有了，那么遍历目录进行处理就可以了，再定义一个函数：

# 定义函数遍历目录，发送请求
def processingDirectory(chapters):
    for chapter in chapters:
        url=chapter.xpath('./@href')[0] # 获取链接
        title=chapter.xpath('./text()')[0] # 获取标题
        processingChapter(url,title) # 调用方法处理章节内容
        time.sleep(0.8) # 请求太快容易被ban,也会出现跳章的问题
    pass

最后调用删除：

# 开始搞事
processingDirectory(chapters)

执行效果

files/001 划重点？.txt 已保存1/1304
files/002 既治病，也要命！.txt 已保存2/1304
files/003 我都要.txt 已保存3/1304
files/004 我祝福你.txt 已保存4/1304
files/005 我再祝福你.txt 已保存5/1304
files/006 一年有三百六十五个日出.txt 已保存6/1304
files/007 减肥.txt 已保存7/1304
files/008 你会恨我的.txt 已保存8/1304
files/009 皮.txt 已保存9/1304
files/010 闯祸？.txt 已保存10/1304
files/011 出发！皮卡皮！.txt 已保存11/1304

至此，一个最简单的小虫子就写好了，没啥技术难度，只是之前完全没有接触过，所以做了解学习用。也许对程序里某些地方用法的理解不太正确，还请多多指正。

最终源码

#小白第一个爬虫
#爬取笔趣阁小说
#0.先伪装一个head
#1.输入一本书的地址 https://www.xquge.com/book/1771.html
#2.爬取目录的链接和标题
#3.遍历目录,请求到文章内容
#4.处理文章将内容输出到文件
#5.完事，开始吹水

import requests
import urllib3
from lxml import etree
import time
import codecs
# 去除https证书的提示
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# 伪装一个请求头，避免直接被ban，不过该网站并没有ban掉没有请求头的
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36','Referer':'https://www.xquge.com'}
# 获取目录页面
catalog=requests.get('https://www.xquge.com/book/1771.html',headers=headers, verify=False).content.decode()
# 处理成xpath对象
html=etree.HTML(catalog)
# 获取到所有的目录节点
chapters=html.xpath('/html/body/div[1]/div[6]/div[5]/div[2]/ul/li/a')
amount=len(chapters)
nowIndex=0
# 定义函数处理内容页面
def processingChapter(url,title):
    content=requests.get(url,headers=headers, verify=False).content.decode()
    html=etree.HTML(content) # 转xpath
    lines=html.xpath('//*[@id="content"]/p[@class="bodytext"]/text()') # 获取小说文本内容集合
    finalStr='\r\n'.join(lines) #使用指定字符将数组拼接为字符串
    fileName='files/'+title+'.txt' # 拼接文件名
    fileWriter=codecs.open(fileName,'w','utf-8') #打开文件写
    fileWriter.writelines(finalStr) # 写入字符串
    fileWriter.flush()
    fileWriter.close()
    global nowIndex # 在方法外部定义的变量,在方法内部使用时需要使用global关键字,否则会报已释放错误
    nowIndex+=1
    print(fileName+' 已保存'+str(nowIndex)+'/'+str(amount))
    pass
# 定义函数遍历目录，发送请求
def processingDirectory(chapters):
    for chapter in chapters:
        url=chapter.xpath('./@href')[0] # 获取链接
        title=chapter.xpath('./text()')[0] # 获取标题
        processingChapter(url,title) # 调用方法处理章节内容
        time.sleep(0.8) # 请求太快容易被ban,也会出现跳章的问题
    pass
# 开始搞事
processingDirectory(chapters)

标签：Files,01,lib,Python,py,爬虫,Program,File,Python39
来源： https://www.cnblogs.com/LiuDanK/p/14055600.html