其他分享
首页 > 其他分享> > scrapy之请求传参、图片爬取与中间件

scrapy之请求传参、图片爬取与中间件

作者:互联网

  使用场景:如果解析的数据不在同一个页面中(深度爬取)。

  举个例子:假如我们首先爬取了首页数据,然后再解析详情页数据,如何操作?

 1     # 解析首页的岗位名称
 2     def parse(self, response):
 3         li_list = response.xpath('//*[@id="main"]/div/div[3]/ul/li')
 4         for li in li_list:
 5             # 实例化item对象
 6             item = BossproItem()
 7 
 8             detail_page_url = 'https://www.zhipin.com' + li.xpath('./div/div[1]/div[1]/div/@href').extract_first()
 9             job_name = li.xpath('.//span[@class="job-name"]//text()').extract()
10 
11             # 分页操作
12             if self.page_num <= 5:
13                 new_url = format(self.url%self.page_num)
14                 self.page_num += 1
15             # 对详情页发请求获取详情页页面源码数据
16             yield scrapy.Request(detail_page_url, callback=self.parse_detail)
17 
18             item['job_name'] = job_name

  之前的代码在16行,callback参数是parse(),但是因为这个时候详情数据在第二个页面打开,如果回调给parse(),解析的还是首页数据,所以此处callback应该是解析详情页数据的函数parse_detail()。

  此外,在parse()中解析到的数据传给了item,包括详情页url,所以如果要对详情页发起请求我们需要获取到详情页的url,也就是parse_detail()要获取到item,所以我们需要在第16行将item作为参数传递过去,即16行改为:

yield scrapy.Request(detail_page_url, callback=self.parse_detail, meta={'item': item})

  再补充一个parse_detail(),获取一个item然后在进行数据解析即可。

    # 回调函数接受item
    # 解析详情页岗位描述
    def parse_detail(self,response):

        item = response.meta['item']

        job_desc = response.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]//text()').extract()
        job_desc = ''.join(job_desc)
        # print(job_desc)

        item['job_desc'] = job_desc

        yield item

  1、scrapy中图片爬取需要专门用到一个管道类 --- ImagesPipeline,因为:

  (1)字符串:只需要基于xpath进行解析且提交给管道进行持久化存储

  (2)图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据。

  2、ImagesPipeline:

    只需要将img的src的属性值进行解析,提交给管道,管道就会对图片src进行请求发送获取图片的二进制类型的数据,且会帮我们进行持久化存储

  3、使用流程  

    (1)数据解析(只获取到图片的地址就可)

    (2)将存储图片地址的item提交到制定的管道类

    (3)在管道文件中自定制一个基于ImagesPipeLine的一个管道类

class imgsPipeLine(ImagesPipeline):
    # 根据图片地址进行图片数据的爬取
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['src'])
    # 指定图片的存储路径
    def file_path(self, request, response=None, info=None, *, item=None):
        imgName = request.url.split('/')[-1]
        return imgName
    def item_completed(self, results, item, info):
        return item

    (4)在配置文件中进行配置:指定图片存储的目录:IMAGES_STORE = './imgsName'以及指定开启的管道(自定制的管道类) 

  下载中间件:位于引擎和下载器之间,在middlerwares类中

    作用:批量拦截到整个工程中所有的请求和响应

    拦截请求:UA伪装(process_request)、代理IP(process_exception:return request)

user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        PROXY_http = [
            '153.180.102.104:80',
            '195.208.131.189:56055',
        ]
        PROXY_https = [
            '120.83.49.90:9000',
            '95.189.112.214:35508',
        ]

        # UA伪装
        request.headers['User-Agent'] = random.choice(user_agent_list)

        # 为了检验代理IP是否生效
        request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http)
        return None

 

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain

        # 设置代理IP
        if request.url.split(':')[0] == 'http':
            request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http)
        else:
            request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)
        # 将修正之后的请求对象进行重新的请求发送
        return request

 

    拦截响应:篡改响应数据、响应对象

 

标签:传参,Chrome,KHTML,中间件,爬取,item,AppleWebKit,Safari,536.3
来源: https://www.cnblogs.com/TzySec/p/15851276.html