首页 > 其他分享> > Scrapy 框架之 ——crawl Spiders

Scrapy 框架之 ——crawl Spiders

2019-05-03 21:53:17 作者：互联网

一、适用条件

可以对有规律或者无规律的网站进行自动爬取

二、代码讲解

(1)创健scrapy项目

E:myweb>scrapy startproject mycwpjt
New Scrapy project 'mycwpjt', using template directory 'd:\\python35\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\Python35\myweb\part16\mycwpjt
You can start your first spider with:
    cd mycwpjt
    scrapy genspider example example.com

(2) 创健爬虫

E:\myweb>scrapy genspider -t crawl weisuen sohu.com
Created spider 'weisuen' using template 'crawl' in module:
  Mycwpjt.spiders.weisuen

(3)item编写

# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class MycwpjtItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    link = scrapy.Field()

(4)pipeline编写

# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
 
class MycwpjtPipeline(object):
    def process_item(self, item, spider):
        print(item["name"])
        print(item["link"])
        return item

(5)settings设置

ITEM_PIPELINES = {
   'mycwpjt.pipelines.MycwpjtPipeline': 300,
}

(6)爬虫编写

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from mycwpjt.items import MycwpjtItem
 
#显示可用的模板              scrapy genspider -l
#利用crawlspider创建的框架  scrapy genspider -t crawl weisun sohu.com
#开始爬取                   scrapy crawl weisun --nolog
 
class WeisunSpider(CrawlSpider):
    name = 'weisun'
    allowed_domains = ['sohu.com']
    start_urls = ['http://sohu.com/']
 
    rules = (
        # 新闻网页的url地址类似于：
        # “http://news.sohu.com/20160926/n469167364.shtml”
        # 所以可以得到提取的正则表达式为'.*?/n.*?shtml’
        Rule(LinkExtractor(allow=('.*?/n.*?shtml'), allow_domains=('sohu.com')), callback='parse_item', follow=True),
    )
 
    def parse_item(self, response):
        i = MycwpjtItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        # 根据Xpath表达式提取新闻网页中的标题
        i["name"] = response.xpath("/html/head/title/text()").extract()
        # 根据Xpath表达式提取当前新闻网页的链接
        i["link"] = response.xpath("//link[@rel='canonical']/@href").extract()
        return i

CrawlSpider是爬取那些具有一定规则网站的常用的爬虫，它基于Spider并有一些独特属性

rules: 是Rule对象的集合，用于匹配目标网站并排除干扰
parse_start_url: 用于爬取起始响应，必须要返回Item，Request中的一个。

因为rules是Rule对象的集合，所以这里介绍一下Rule。它有几个参数：link_extractor、callback=None、 cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的link_extractor既可以自己定义，也可以使用已有LinkExtractor类，主要参数为：

allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
allow_domains：会被提取的链接的domains。
deny_domains：一定不会被提取链接的domains。
restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

三、结果显示

标签：com,scrapy,sohu,item,Scrapy,link,Spiders,domains,crawl
来源： https://blog.csdn.net/magicboom/article/details/89791680