首页 > 其他分享> > 爬取新笔趣阁小说！适合新手入门的小案例！

爬取新笔趣阁小说！适合新手入门的小案例！

2020-08-28 20:35:05 作者：互联网

爬取笔趣阁小说（搜索+爬取）

首先看看最终效果（gif）：

实现步骤：
1.探查网站“http://www.xbiquge.la/”，看看网站的实现原理。

2.编写搜索功能（获取每本书目录的URL）。

3.编写写入功能（按章节写入文件）。

4.完善代码（修修bug，建了文件夹）。

ps:所需模块：

import requests
import bs4          # 爬网站必备两个模块不解释
import os           # 用来创建文件夹的
import sys          # 没啥用单纯为了好看
import time
import random       # 使用随机数设置延时
123456

一、网站搜索原理，并用Python实现。

我本以为这个网站和一般网站一样，通过修改URL来进行搜索，结果并不然。

可以看出这个网站不会因搜索内容改变而改变URL。
那还有一种可能：通过POST请求，来更新页面。让我们打开Network验证一下。

我的猜想是对的。接下来开始模拟。

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "Cookie": "_abcde_qweasd=0; Hm_lvt_169609146ffe5972484b0957bd1b46d6=1583122664; bdshare_firstime=1583122664212; Hm_lpvt_169609146ffe5972484b0957bd1b46d6=1583145548",
    "Host": "www.xbiquge.la"}      # 设置头尽量多一点 以防万一
x = str(input("输入书名或作者名:"))   # 通过变量来控制我们要搜索的内容
data = {'searchkey': x}
url = 'http://www.xbiquge.la/modules/article/waps.php'
r = requests.post(url, data=data, headers=headers)
soup = bs4.BeautifulSoup(r.text.encode('utf-8'), "html.parser") # 用BeautifulSoup方法方便我们提取网页内容网页
123456789

可是如果现在我printf(soup)后发现里面的中文全为乱码！

这不难看出是编码格式不对，但我们可以用encoding方法来获取编码方式。

改完编码后就可以正常提取了，并且和浏览器显示的一致，都是我们搜索的内容。

二、接下来我们就来在这一堆代码里找到我们想要的内容了（书名，作者，目录URL）

通过元素审查我们很容易就可以定位到它们所在位置。

链接和书名在"td class even"< a> 标签里，作者在"td class="even""里。

什么！标签重名了！怎么办！管他三七二十一！先把"td class="even""全打印出来看看。

book_author = soup.find_all("td", class_="even")
for each in book_author:
     print(each)
123

可以发现每个each分为两层。

那我们可以奇偶循环来分别处理这两层。（因为如果不分层处理的话第一层要用的方法（each.a.get(“href”）在第二层会报错，好像try也可以处理这个错,没试）

并且用创建两个三个列表来储存三个值。

books = []          #　 书名
authors = []        #  作者名
directory = []      #  目录链接
tem = 1
for each in book_author:
    if tem == 1:
        books.append(each.text)
        tem -= 1
        directory.append(each.a.get("href"))
    else:
        authors.append(each.text)
        tem += 1
123456789101112

成功！三个列表全部一样对应！
那么要如何实现选择一个序号，来让Python获得一个目录链接呢？
我们可以这样：

print('搜索结果：')
for num,book, author in zip(range(1, len(books)+1),books, authors):
     print((str(num)+": ").ljust(4)+(book+"\t").ljust(25) + ("\t作者：" + author).ljust(20))
search = dict(zip(books, directory))
1234

是不是很神奇！“search”是我们用书名和目录URL组成的字典，我们只要
return search[books[i-1]]
就可以让下一个函数得到这本书的目录URL了。

三、获取章节URL，获取文本内容，写入文件。

我们得到目录的URL后就可以用相同的方法获取每一章节的URL了（不赘述了）。

def get_text_url(titel_url):
    url = titel_url
    global headers
    r = requests.get(url, headers=headers)
    soup = bs4.BeautifulSoup(r.text.encode('ISO-8859-1'), "html.parser")
    titles = soup.find_all("dd")
    texts = []
    names = []
    texts_names = []
    for each in titles:
        texts.append("http://www.xbiquge.la"+each.a["href"])
        names.append(each.a.text)
    texts_names.append(texts)
    texts_names.append(names)
    return texts_names          #  注意这里的返回值是一个包含两个列表的列表！！
123456789101112131415

注意这里的返回值是一个包含两个列表的列表！！
texts_names[0] 就是每一章节的 URL, texts_names[0] 是章节名
为下一个写内容的函数方便调用。
接下来接是写文件了！

search = dict(zip(books, directory))
url = texts_url[0][n]
name = texts_url[1][n]
req = requests.get(url=url, headers=headers)
time.sleep(random.uniform(0, 0.5))  # 即使设置了延迟，他还有会可能503（没办法小网站）
req.encoding = 'UTF-8'  # 这里的编码是UTF-8，跟目录不一样，要注意！
html = req.text
soup = bs4.BeautifulSoup(html, features="html.parser")
texts = soup.find_all("div", id="content")
while (len(texts) == 0):  # 他如果503的话，读取内容就什么都木有，那直接让他再读一次，直到读出来为止。
    req = requests.get(url=url, headers=headers)
    time.sleep(random.uniform(0, 0.5))
    req.encoding = 'UTF-8'
    html = req.text
    soup = bs4.BeautifulSoup(html, features="html.parser")
    texts = soup.find_all("div", id="content")
else:
    content = texts[0].text.replace('\xa0' * 8, '\n\n')
    content = content.replace(
        "亲,点击进去,给个好评呗,分数越高更新越快,据说给新笔趣阁打满分的最后都找到了漂亮的老婆哦!手机站全新改版升级地址：http://m.xbiquge.la，数据和书签与电脑站同步，无广告清新阅读！", "\n")
    # 使用text属性，提取文本内容，滤除br标签，随后使用replace方法，去掉八个空格符号，并用回车代替 再去除每一页都有得结尾
with open(name + '.txt', "w", encoding='utf-8')as f:
    f.write(content)
    sys.stdout.write("\r已下载{}章，还剩下{}章".format(count, max - count))  # sys模块就在这用了一次，为了不让他换行。。。
    count += 1
12345678910111213141516171819202122232425

n就是章节的序列，直接for循环就可以把所有章节写成文件了
这里处理503的方法虽然很暴力，可是是最有用的！

四、整理代码，修修bug。

把上面的思路写成三道四个函数打包一下。
然后测试一下，看看有什么bug，能修就修复，修复不了就直接try掉。（哈哈哈）
想要文件夹的可以研究研究os模块，很简单，这里不赘述了。
最后附上完整代码！

import requests
import bs4          # 爬网站必备两个模块不解释
import os           # 用来创建文件夹的
import sys          # 没啥用单纯为了好看
import time
import random       # 使用随机数设置延时
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "Cookie": "_abcde_qweasd=0; Hm_lvt_169609146ffe5972484b0957bd1b46d6=1583122664; bdshare_firstime=1583122664212; Hm_lpvt_169609146ffe5972484b0957bd1b46d6=1583145548",
    "Host": "www.xbiquge.la"}      # 设置头尽量多一点 以防万一
b_n = ""
def get_title_url():
    x = str(input("输入书名或作者名:"))
    data = {'searchkey': x}
    url = 'http://www.xbiquge.la/modules/article/waps.php'
    global headers, b_n
    r = requests.post(url, data=data, headers=headers)
    soup = bs4.BeautifulSoup(r.text.encode('ISO-8859-1'), "html.parser")
    book_author = soup.find_all("td", class_="even")
    books = []          #　 书名
    authors = []        #  作者名
    directory = []      #  目录链接
    tem = 1
    for each in book_author:
        if tem == 1:
            books.append(each.text)
            tem -= 1
            directory.append(each.a.get("href"))
        else:
            authors.append(each.text)
            tem += 1
    print('搜索结果：')
    for num,book, author in zip(range(1, len(books)+1),books, authors):
        print((str(num)+": ").ljust(4)+(book+"\t").ljust(25) + ("\t作者：" + author).ljust(20))
    search = dict(zip(books, directory))
    if books == []:
        print("没有找到任何一本书，请重新输入!")
        get_title_url()
    try:
        i = int(input("输入需要下载的序列号(重新搜索输入'0')"))
    except:
        print("输入错误重新输入:")
        i = int(input("输入需要下载的序列号(重新搜索输入'0')"))
    if i == 0:
        books = []
        authors = []
        directory = []
        get_title_url()
    if i>len(books) or i<0:
        print("输入错误重新输入:")
        i = int(input("输入需要下载的序列号(重新搜索输入'0')"))
    b_n=books[i-1]
    try:
        os.mkdir(books[i-1])
        os.chdir(b_n)
    except:
        os.chdir(b_n)
        b_n = books[i - 1]
    return search[books[i-1]]

def get_text_url(titel_url):
    url = titel_url
    global headers
    r = requests.get(url, headers=headers)
    soup = bs4.BeautifulSoup(r.text.encode('ISO-8859-1'), "html.parser")
    titles = soup.find_all("dd")
    texts = []
    names = []
    texts_names = []
    for each in titles:
        texts.append("http://www.xbiquge.la"+each.a["href"])
        names.append(each.a.text)
    texts_names.append(texts)
    texts_names.append(names)
    return texts_names          #  注意这里的返回值是一个包含两个列表的列表！！


def readnovel(texts_url):
    global headers,b_n
    count=1
    max=len(texts_url[1])
    print("预计耗时{}分钟".format((max // 60)+1))
    tishi = input(str(b_n)+"一共{}章，确认下载输入'y',输入其他键取消".format(max))
    if tishi == "y"or tishi =="Y":
        for n in range(max):
            url = texts_url[0][n]
            name = texts_url[1][n]
            req = requests.get(url=url,headers=headers)
            time.sleep(random.uniform(0, 0.5))          # 即使设置了延迟，他还有会可能503（没办法小网站）
            req.encoding = 'UTF-8'                      # 这里的编码是UTF-8，跟目录不一样，要注意！
            html = req.text
            soup = bs4.BeautifulSoup(html, features="html.parser")
            texts = soup.find_all("div", id="content")
            while (len(texts) == 0):                    #   他如果503的话，读取内容就什么都木有，那直接让他再读一次，直到读出来为止。
                req = requests.get(url=url, headers=headers)
                time.sleep(random.uniform(0,0.5))
                req.encoding = 'UTF-8'
                html = req.text
                soup = bs4.BeautifulSoup(html, features="html.parser")
                texts = soup.find_all("div", id="content")
            else:
                content = texts[0].text.replace('\xa0' * 8, '\n\n')
                content=content.replace("亲,点击进去,给个好评呗,分数越高更新越快,据说给新笔趣阁打满分的最后都找到了漂亮的老婆哦!手机站全新改版升级地址：http://m.xbiquge.la，数据和书签与电脑站同步，无广告清新阅读！","\n")
                # 使用text属性，提取文本内容，滤除br标签，随后使用replace方法，去掉八个空格符号，并用回车代替 再去除每一页都有得结尾
            with open(name+'.txt',"w",encoding='utf-8')as f:
                f.write(content)
                sys.stdout.write("\r已下载{}章，还剩下{}章".format(count,max-count))     # sys模块就在这用了一次，为了不让他换行。。。
                count += 1
        print("\n全部下载完毕")
    else:
        print("已取消!")
        os.chdir('..')
        os.rmdir(b_n)
        main()

def main():
    titel_url = get_title_url()
    texts_url = get_text_url(titel_url)
    readnovel(texts_url)
    input("输入任意键退出")


if __name__ == '__main__':
    print("小说资源全部来自于'新笔趣阁'---》http://www.xbiquge.la\n所以搜不到我也没办法..........@晓轩\n为了确保下载完整，每章设置了0.5秒到1秒延时！")
    main()

源码加群：1136192749

标签：爬取,headers,url,text,新手入门,texts,books,each,笔趣
来源： https://www.cnblogs.com/A3535/p/13579737.html