其他分享
首页 > 其他分享> > 爬虫练习——爬取某网站的壁纸

爬虫练习——爬取某网站的壁纸

作者:互联网

 2022年6月9日  21:38

  在将《python3网络爬虫开发实战》这本书啃完三章后,觉得对于对于各个爬虫基本库的使用还很生疏,于是在网上找了一些简单的爬虫练习。

要求

  爬取网站www.4kbizhi.com 的高清4k壁纸

功能模块

  全局变量

    为了能更方便的修改爬取的设置,将一些变量放在代码开头,作为全局变量

BASIC_URL = 'https://www.4kbizhi.com'
START_PAGE = 1
END_PAGE = 10
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'
}

  assemble_url(page)

    该函数作用为组装高清壁纸网页的URL,通过对URL的观察,可以得出非常简单的规律,当页面是第一页时,路径为/meinv

    /index.html,当页面不是第一页是,路径为/meinv/index_{page}.html,page为页数。

    所以可以得出构成URL的代码为:

def assemble_url(page):
if page == 1:
url = urljoin(BASIC_URL, f'/meinv/index.html')
else:
url = urljoin(BASIC_URL, f'/meinv/index_{page}.html')
return url

  scrape_pages(url)

    该函数作用为请求url页面,并返回html文档,在get请求是用header参数简单的隐藏一下,代码为:

def scrape_pages(url):
    logging.info(f'正在爬取{url}...')
    response = requests.get(url, headers=header)
    try:
        if response.status_code == 200:
            response.encoding = "gbk"
            return response.text
        logging.error(f'返回状态码异常{response.status_code},爬取{url}失败!')
    except requests.RequestException:
        logging.error(f'程序异常,爬取{url}失败!')

  get_img_link(text)

    该函数作用为对html文档进行解析,提取出其中图片的连接,因为图片的链接为如下选中的路,所以需要对其进行拼接

def get_img_link(text):
    html = etree.HTML(text)
    # 获取图片的地址
    img_links = ['https://www.4kbizhi.com'+url for url in html.xpath('//li/a/img/@src')]
    # 获取图片的名称
    img_names = html.xpath('//li/a/p/text()')
    links_names = zip(img_links, img_names)
    # print(list(links_names))
    return links_names

  download_img(link_name)

    该函数的作用为根据图片地址下载图片,以图片名称为名,将图片保存到本地

def download_img(link_name):
    try:
        r = requests.get(link_name[0], headers=header)
        name = link_name[1] + '.jpg'
        path = 'D:/桌面文件/4k壁纸/' + name
        with open(path, 'wb') as f:
            f.write(r.content)
    except Exception as e:
        print(f'下载{link_name[1]}图片时出错', e)

  main(page)

    该函数的作用是对上述的功能进行整合,完成一次完整的爬取

def main(page):
    url = assemble_url(page)
    text = scrape_pages(url)
    links_names = get_img_link(text)
    # links_names为一个列表,其中的元素为包含一个链接和一个图片名的元组
    for link_name in links_names:
        download_img(link_name)

  多进程调用

if __name__ == '__main__':
    pool = multiprocessing.Pool()
    pages = range(START_PAGE, END_PAGE+1)
    pool.map(main, pages)
    pool.close()
    pool.join()

 

下面给出完整代码

# 在坚冰还盖着北海的时候,我看到了怒放的梅花
# 李志文 Ruohary
# 开发环境: 2022/6/9 19:00

import requests
import logging
from urllib.parse import urljoin
from lxml import etree
import multiprocessing

BASIC_URL = 'https://www.4kbizhi.com'
START_PAGE = 1
END_PAGE = 10
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'
}

def assemble_url(page):
    if page == 1:
        url = urljoin(BASIC_URL, f'/meinv/index.html')
    else:
        url = urljoin(BASIC_URL, f'/meinv/index_{page}.html')
    return url

def scrape_pages(url):
    logging.info(f'正在爬取{url}...')
    response = requests.get(url, headers=header)
    try:
        if response.status_code == 200:
            response.encoding = "gbk"
            return response.text
        logging.error(f'返回状态码异常{response.status_code},爬取{url}失败!')
    except requests.RequestException:
        logging.error(f'程序异常,爬取{url}失败!')

def get_img_link(text):
    html = etree.HTML(text)
    # 获取图片的地址
    img_links = ['https://www.4kbizhi.com'+url for url in html.xpath('//li/a/img/@src')]
    # 获取图片的名称
    img_names = html.xpath('//li/a/p/text()')
    links_names = zip(img_links, img_names)
    # print(list(links_names))
    return links_names

def download_img(link_name):
    try:
        r = requests.get(link_name[0], headers=header)
        name = link_name[1] + '.jpg'
        path = 'D:/桌面文件/4k壁纸/' + name
        with open(path, 'wb') as f:
            f.write(r.content)
    except Exception as e:
        print(f'下载{link_name[1]}图片时出错', e)

def main(page):
    url = assemble_url(page)
    text = scrape_pages(url)
    links_names = get_img_link(text)
    # links_names为一个列表,其中的元素为包含一个链接和一个图片名的元组
    for link_name in links_names:
        download_img(link_name)

if __name__ == '__main__':
    pool = multiprocessing.Pool()
    pages = range(START_PAGE, END_PAGE+1)
    pool.map(main, pages)
    pool.close()
    pool.join()

 

标签:links,name,img,url,爬虫,爬取,link,names,壁纸
来源: https://www.cnblogs.com/zhuohua/p/16361436.html