其他分享
首页 > 其他分享> > 1121 爬虫简单面条版

1121 爬虫简单面条版

作者:互联网

第一份的爬虫爬取小说网

http://www.minixiaoshuo.com/

没解决的问题:

  1. 爬取主页小说时,由于章节主页有最近章节,导致每一本小说的前面都有最新的十几章内容没法去除
  2. 写入速度太慢,两本书大约10M,爬取了13分钟.
  3. 代码冗余,暂时没有分函数爬取
import requests
import re

# 获取response文本
def get_text(url):
    response = requests.get(url)
    response.encoding = 'utf-8'
    return response.text

# 主页
z_url= 'http://www.minixiaoshuo.com/'
z_txt = get_text(z_url)
# print(z_txt)

# <h2><a href="/book/19237.html">凡人修仙之仙界篇</a></h2>
# <li><a href="/book/10158.html">[言情总裁] 极品全能狂医</a></li>
# 获取首页图书列表
z_book_url = re.findall('<h2><a href="(.*?)">(.*?)</a></h2>',z_txt)
# print(z_book_url)

# book_url = re.findall('<li><a href="(.*?)">(.*?)</a></li>',z_txt)
# print(book_url)

for z_url,z_title in z_book_url:
    z_url = 'http://www.minixiaoshuo.com/%s' % z_url
    # print(z_url,z_title)
    # <a href="/11781/" id="read">画演天地最新章节</a>
    # 获取小说章节主页
    book_z = get_text(z_url)
    # print(book_z)
    page_url = re.findall(r'<p class="read_link"><a href="(.*?)" id="read">',book_z)[0]
    # print(page_url)
    book_page_url = 'http://www.minixiaoshuo.com/%s' % page_url
    # # 获取小说章节目录主页
    # print(book_page_url)
    # 获取每一本书的章节主页文本
    page_txt = get_text(book_page_url)
    # print(page_txt)
    # 获取章节链接和章节名
    # <dd><a href="/20799/279.html">第268章 试一试</a></dd>
    section_url_title = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>',page_txt)
    with open(f'{z_title}.txt', 'w', encoding='utf-8') as f:
        for section_url,section_title in section_url_title:
            # 判断起始章节    /11781/2.html 第1章 善战骁勇
            # start_section = re.findall(r'.*?/(.*?).html',section_url)[0]
            # lis = start_section.split('/')
            # print(lis[1]) 获得的是/11781/ 2 .html
            section_ur = f'http://www.minixiaoshuo.com{section_url}'
            # print(section_ur, section_title)
            # print(section_ur )      # http://www.minixiaoshuo.com/20799/356.html
            # 获取章节内文本
            section_wen = get_text(section_ur)
            pagetet = re.findall(r'</div><p>(.*?)<div class=',section_wen)[0]
            # re.findall(r'',section_text)
            # print(pagetet)
            page = pagetet.replace('<p class="middleshow"><script>show(pc_middle);</script></p>','').replace('<p>','').replace(' ','').replace('</p>','').replace('skbwznaitoaip','')

            # 写入文章内章节
            f.write(section_title)
            f.write('\r\n')
            f.write(page)
            f.write('\r\n')
    print(f'{z_title}.txt下载完毕')

标签:面条,title,url,1121,section,爬虫,book,print,page
来源: https://www.cnblogs.com/fwzzz/p/11907845.html