1121 爬虫简单面条版
作者:互联网
第一份的爬虫爬取小说网
http://www.minixiaoshuo.com/
没解决的问题:
- 爬取主页小说时,由于章节主页有最近章节,导致每一本小说的前面都有最新的十几章内容没法去除
- 写入速度太慢,两本书大约10M,爬取了13分钟.
- 代码冗余,暂时没有分函数爬取
import requests
import re
# 获取response文本
def get_text(url):
response = requests.get(url)
response.encoding = 'utf-8'
return response.text
# 主页
z_url= 'http://www.minixiaoshuo.com/'
z_txt = get_text(z_url)
# print(z_txt)
# <h2><a href="/book/19237.html">凡人修仙之仙界篇</a></h2>
# <li><a href="/book/10158.html">[言情总裁] 极品全能狂医</a></li>
# 获取首页图书列表
z_book_url = re.findall('<h2><a href="(.*?)">(.*?)</a></h2>',z_txt)
# print(z_book_url)
# book_url = re.findall('<li><a href="(.*?)">(.*?)</a></li>',z_txt)
# print(book_url)
for z_url,z_title in z_book_url:
z_url = 'http://www.minixiaoshuo.com/%s' % z_url
# print(z_url,z_title)
# <a href="/11781/" id="read">画演天地最新章节</a>
# 获取小说章节主页
book_z = get_text(z_url)
# print(book_z)
page_url = re.findall(r'<p class="read_link"><a href="(.*?)" id="read">',book_z)[0]
# print(page_url)
book_page_url = 'http://www.minixiaoshuo.com/%s' % page_url
# # 获取小说章节目录主页
# print(book_page_url)
# 获取每一本书的章节主页文本
page_txt = get_text(book_page_url)
# print(page_txt)
# 获取章节链接和章节名
# <dd><a href="/20799/279.html">第268章 试一试</a></dd>
section_url_title = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>',page_txt)
with open(f'{z_title}.txt', 'w', encoding='utf-8') as f:
for section_url,section_title in section_url_title:
# 判断起始章节 /11781/2.html 第1章 善战骁勇
# start_section = re.findall(r'.*?/(.*?).html',section_url)[0]
# lis = start_section.split('/')
# print(lis[1]) 获得的是/11781/ 2 .html
section_ur = f'http://www.minixiaoshuo.com{section_url}'
# print(section_ur, section_title)
# print(section_ur ) # http://www.minixiaoshuo.com/20799/356.html
# 获取章节内文本
section_wen = get_text(section_ur)
pagetet = re.findall(r'</div><p>(.*?)<div class=',section_wen)[0]
# re.findall(r'',section_text)
# print(pagetet)
page = pagetet.replace('<p class="middleshow"><script>show(pc_middle);</script></p>','').replace('<p>','').replace(' ','').replace('</p>','').replace('skbwznaitoaip','')
# 写入文章内章节
f.write(section_title)
f.write('\r\n')
f.write(page)
f.write('\r\n')
print(f'{z_title}.txt下载完毕')
标签:面条,title,url,1121,section,爬虫,book,print,page 来源: https://www.cnblogs.com/fwzzz/p/11907845.html