首页 > 编程语言> > python爬取彼岸网

python爬取彼岸网

2021-01-20 20:59:19 作者：互联网

先放一张妹子
爬取的图片
先导入包

import requests
from bs4 import BeautifulSoup
import time

我们要爬取的网站http://www.netbian.com/dongman/ 仅爬取了动漫这一部分
在这里插入图片描述
我们先分析一下网页

我们可以发现所有的数据都在一个ul标签里，每一个li标签下面又有一个a标签，而a标签的href属性是我们要获取的链接

 response = requests.get(url=url, headers=headers).content.decode('gbk')
    soup = BeautifulSoup(response, 'lxml')
    for list_data in soup.find('div', class_='list').find_all('li'):
        href = list_data.find('a').get('href')

在这里插入图片描述

我们获取到的了链接，在请求的时候建议加上headers，为了以防万一
在获取到了链接后发现一个广告
在这里插入图片描述
所以我们在这里略过

        if href == 'http://pic.netbian.com/':
            continue

在这里的url是不完整的，所以我们点开以一张图片把url补充完整
在这里插入图片描述

f'http://www.netbian.com{href}'

补充完的URL
在这里插入图片描述
我们继续请求上面的URL，发现这不是我么想要的，

但是我们发现了一个链接，我们把他提取出来

response_2 = requests.get(url=f'http://www.netbian.com{href}', headers=headers).content.decode('gbk')
        soup_2 = BeautifulSoup(response_2, 'lxml')
        div_pic_href = soup_2.find('div', class_="pic-down").find('a').get('href')

我们最后提取的链接
在这里插入图片描述
我们再一次把URL补充完整

再一次请求，我们会发现图片是高清的，但不能实现下载

在这里插入图片描述

response_3 = requests.get(url=f'http://www.netbian.com{div_pic_href}', headers=headers).content.decode('gbk')
        soup_3 = BeautifulSoup(response_3, 'lxml')
        photo_url = soup_3.find('table', id="endimg").find('img').get('src')
        photo_title = soup_3.find('table', id="endimg").find('img').get('title')

我们再提取链接的同时，提取图片的标题
到了这里可以发现，图片可以保存了
a标签和img标签的属性相同，所以这里我这里提取的是图片的属性
我们继续请求链接

response_photo = requests.get(url=photo_url, headers=headers).content

请求成功后我们进行保存

with open("image/" + photo_title + '.jpg', "wb")as f:
	f.write(response_photo)

保存的时候我们还要带上我们提取的标题

为了不给服务器增加负担我们再每一次请求的前一行增加一行代码time.sleep（1）让他休眠一秒，图片不止一页，可以增加一个循环让代码重复爬取

总的代码

import requests
from bs4 import BeautifulSoup
import time

headers = {
    'Cookie': '__cfduid=d41108c9dadd3b3710630af78b80b39011611109443; Hm_lvt_14b14198b6e26157b7eba06b390ab763=1611109677; xygkqecookieztrecord=%2C6%2C; xygkqecookieclassrecord=%2C19%2C4%2C; xygkqecookieinforecord=%2C4-22305%2C19-23157%2C19-23151%2C; Hm_lpvt_14b14198b6e26157b7eba06b390ab763=1611137003',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
for i in range(2, 131):
    url = f'http://www.netbian.com/dongman/index_{i}.htm'
    response = requests.get(url=url, headers=headers).content.decode('gbk')
    soup = BeautifulSoup(response, 'lxml')
    for list_data in soup.find('div', class_='list').find_all('li'):
        href = list_data.find('a').get('href')
        if href == 'http://pic.netbian.com/':
            continue
        time.sleep(1)
        response_2 = requests.get(url=f'http://www.netbian.com{href}', headers=headers).content.decode('gbk')
        soup_2 = BeautifulSoup(response_2, 'lxml')
        div_pic_href = soup_2.find('div', class_="pic-down").find('a').get('href')
        time.sleep(1)
        print(f'http://www.netbian.com{div_pic_href}')
        response_3 = requests.get(url=f'http://www.netbian.com{div_pic_href}', headers=headers).content.decode('gbk')
        soup_3 = BeautifulSoup(response_3, 'lxml')
        photo_url = soup_3.find('table', id="endimg").find('img').get('src')
        photo_title = soup_3.find('table', id="endimg").find('img').get('title')
        print(photo_url)
        print(photo_title)
        time.sleep(1)
        response_photo = requests.get(url=photo_url, headers=headers).content
        with open("image/" + photo_title + '.jpg', "wb")as f:
            f.write(response_photo)

请勿转载！！！

标签：headers,python,get,彼岸,response,爬取,url,href,find
来源： https://blog.csdn.net/huoshicang/article/details/112909196