爬虫练习——爬取某网站的壁纸
作者:互联网
2022年6月9日 21:38
在将《python3网络爬虫开发实战》这本书啃完三章后,觉得对于对于各个爬虫基本库的使用还很生疏,于是在网上找了一些简单的爬虫练习。
要求
爬取网站www.4kbizhi.com 的高清4k壁纸
功能模块
全局变量
为了能更方便的修改爬取的设置,将一些变量放在代码开头,作为全局变量
BASIC_URL = 'https://www.4kbizhi.com'
START_PAGE = 1
END_PAGE = 10
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'
}
assemble_url(page)
该函数作用为组装高清壁纸网页的URL,通过对URL的观察,可以得出非常简单的规律,当页面是第一页时,路径为/meinv
/index.html,当页面不是第一页是,路径为/meinv/index_{page}.html,page为页数。
所以可以得出构成URL的代码为:
def assemble_url(page):
if page == 1:
url = urljoin(BASIC_URL, f'/meinv/index.html')
else:
url = urljoin(BASIC_URL, f'/meinv/index_{page}.html')
return url
scrape_pages(url)
该函数作用为请求url页面,并返回html文档,在get请求是用header参数简单的隐藏一下,代码为:
def scrape_pages(url): logging.info(f'正在爬取{url}...') response = requests.get(url, headers=header) try: if response.status_code == 200: response.encoding = "gbk" return response.text logging.error(f'返回状态码异常{response.status_code},爬取{url}失败!') except requests.RequestException: logging.error(f'程序异常,爬取{url}失败!')
get_img_link(text)
该函数作用为对html文档进行解析,提取出其中图片的连接,因为图片的链接为如下选中的路,所以需要对其进行拼接
def get_img_link(text): html = etree.HTML(text) # 获取图片的地址 img_links = ['https://www.4kbizhi.com'+url for url in html.xpath('//li/a/img/@src')] # 获取图片的名称 img_names = html.xpath('//li/a/p/text()') links_names = zip(img_links, img_names) # print(list(links_names)) return links_names
download_img(link_name)
该函数的作用为根据图片地址下载图片,以图片名称为名,将图片保存到本地
def download_img(link_name): try: r = requests.get(link_name[0], headers=header) name = link_name[1] + '.jpg' path = 'D:/桌面文件/4k壁纸/' + name with open(path, 'wb') as f: f.write(r.content) except Exception as e: print(f'下载{link_name[1]}图片时出错', e)
main(page)
该函数的作用是对上述的功能进行整合,完成一次完整的爬取
def main(page): url = assemble_url(page) text = scrape_pages(url) links_names = get_img_link(text) # links_names为一个列表,其中的元素为包含一个链接和一个图片名的元组 for link_name in links_names: download_img(link_name)
多进程调用
if __name__ == '__main__': pool = multiprocessing.Pool() pages = range(START_PAGE, END_PAGE+1) pool.map(main, pages) pool.close() pool.join()
下面给出完整代码
# 在坚冰还盖着北海的时候,我看到了怒放的梅花 # 李志文 Ruohary # 开发环境: 2022/6/9 19:00 import requests import logging from urllib.parse import urljoin from lxml import etree import multiprocessing BASIC_URL = 'https://www.4kbizhi.com' START_PAGE = 1 END_PAGE = 10 header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0' } def assemble_url(page): if page == 1: url = urljoin(BASIC_URL, f'/meinv/index.html') else: url = urljoin(BASIC_URL, f'/meinv/index_{page}.html') return url def scrape_pages(url): logging.info(f'正在爬取{url}...') response = requests.get(url, headers=header) try: if response.status_code == 200: response.encoding = "gbk" return response.text logging.error(f'返回状态码异常{response.status_code},爬取{url}失败!') except requests.RequestException: logging.error(f'程序异常,爬取{url}失败!') def get_img_link(text): html = etree.HTML(text) # 获取图片的地址 img_links = ['https://www.4kbizhi.com'+url for url in html.xpath('//li/a/img/@src')] # 获取图片的名称 img_names = html.xpath('//li/a/p/text()') links_names = zip(img_links, img_names) # print(list(links_names)) return links_names def download_img(link_name): try: r = requests.get(link_name[0], headers=header) name = link_name[1] + '.jpg' path = 'D:/桌面文件/4k壁纸/' + name with open(path, 'wb') as f: f.write(r.content) except Exception as e: print(f'下载{link_name[1]}图片时出错', e) def main(page): url = assemble_url(page) text = scrape_pages(url) links_names = get_img_link(text) # links_names为一个列表,其中的元素为包含一个链接和一个图片名的元组 for link_name in links_names: download_img(link_name) if __name__ == '__main__': pool = multiprocessing.Pool() pages = range(START_PAGE, END_PAGE+1) pool.map(main, pages) pool.close() pool.join()
标签:links,name,img,url,爬虫,爬取,link,names,壁纸 来源: https://www.cnblogs.com/zhuohua/p/16361436.html