其他分享
首页 > 其他分享> > 如何将scrapy.Field填充为字典

如何将scrapy.Field填充为字典

作者:互联网

我正在使用Scrapy(使用SitemapSpider蜘蛛)为www.apkmirror.com构建刮板.到目前为止,以下工作:

DEBUG = True

from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem


class ApkmirrorSitemapSpider(SitemapSpider):
    name = 'apkmirror-spider'
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]

    if DEBUG:
        custom_settings = {'CLOSESPIDER_PAGECOUNT': 20}

    def parse(self, response):
        item = ApkmirrorScraperItem()
        item['url'] = response.url
        item['title'] = response.xpath('//h1[@title]/text()').extract_first()
        item['developer'] = response.xpath('//h3[@title]/a/text()').extract_first()
        return item

其中,在items.py中定义了ApkMirrorScraperItem,如下所示:

class ApkmirrorScraperItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    developer = scrapy.Field()

如果使用命令从项目目录运行它,将生成结果JSON输出

scrapy crawl apkmirror-spider -o data.json

是JSON字典的数组,键url,title和developer和相应的字符串作为值.但是,我想对此进行修改,以便开发人员的值本身就是带有名称字段的字典,这样我就可以像这样填充它:

item['developer']['name'] = response.xpath('//h3[@title]/a/text()').extract_first()

但是,如果尝试此操作,则会得到KeyErrors,而且如果我将开发人员的Field(根据https://doc.scrapy.org/en/latest/topics/items.html#item-fields的一项规定)初始化为developer = scrapy.Field(name = None),也会得到KeyErrors.我该怎么办?

解决方法:

Scrapy在内部将字段作为dict实现,但这并不意味着应将它们作为dict访问.当您调用item [‘developer’]时,您真正要做的是获取字段的值,而不是字段本身.因此,如果尚未设置该值,则将抛出KeyError.

考虑到这一点,有两种解决问题的方法.

首先,只需将开发人员字段值设置为dict:

def parse(self, response):
    item = ApkmirrorScraperItem()
    item['url'] = response.url
    item['title'] = response.xpath('//h1[@title]/text()').extract_first()
    item['developer'] = {'name': response.xpath('//h3[@title]/a/text()').extract_first()}
    return item

第二个,创建一个新的Developer类,并将developer值设置为该类的实例:

# this can go to items.py
class Developer(scrapy.Item):
    name = scrapy.Field()

def parse(self, response):
    item = ApkmirrorScraperItem()
    item['url'] = response.url
    item['title'] = response.xpath('//h1[@title]/text()').extract_first()

    dev = Developer()        
    dev['name'] = response.xpath('//h3[@title]/a/text()').extract_first()       
    item['developer'] = dev

    return item

希望能帮助到你 :)

标签:scrapy-spider,scrapy,python
来源: https://codeday.me/bug/20191111/2020199.html