如何将scrapy.Field填充为字典
作者:互联网
我正在使用Scrapy(使用SitemapSpider蜘蛛)为www.apkmirror.com构建刮板.到目前为止,以下工作:
DEBUG = True
from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem
class ApkmirrorSitemapSpider(SitemapSpider):
name = 'apkmirror-spider'
sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
if DEBUG:
custom_settings = {'CLOSESPIDER_PAGECOUNT': 20}
def parse(self, response):
item = ApkmirrorScraperItem()
item['url'] = response.url
item['title'] = response.xpath('//h1[@title]/text()').extract_first()
item['developer'] = response.xpath('//h3[@title]/a/text()').extract_first()
return item
其中,在items.py中定义了ApkMirrorScraperItem,如下所示:
class ApkmirrorScraperItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
developer = scrapy.Field()
如果使用命令从项目目录运行它,将生成结果JSON输出
scrapy crawl apkmirror-spider -o data.json
是JSON字典的数组,键url,title和developer和相应的字符串作为值.但是,我想对此进行修改,以便开发人员的值本身就是带有名称字段的字典,这样我就可以像这样填充它:
item['developer']['name'] = response.xpath('//h3[@title]/a/text()').extract_first()
但是,如果尝试此操作,则会得到KeyErrors,而且如果我将开发人员的Field(根据https://doc.scrapy.org/en/latest/topics/items.html#item-fields的一项规定)初始化为developer = scrapy.Field(name = None),也会得到KeyErrors.我该怎么办?
解决方法:
Scrapy在内部将字段作为dict实现,但这并不意味着应将它们作为dict访问.当您调用item [‘developer’]时,您真正要做的是获取字段的值,而不是字段本身.因此,如果尚未设置该值,则将抛出KeyError.
考虑到这一点,有两种解决问题的方法.
首先,只需将开发人员字段值设置为dict:
def parse(self, response):
item = ApkmirrorScraperItem()
item['url'] = response.url
item['title'] = response.xpath('//h1[@title]/text()').extract_first()
item['developer'] = {'name': response.xpath('//h3[@title]/a/text()').extract_first()}
return item
第二个,创建一个新的Developer类,并将developer值设置为该类的实例:
# this can go to items.py
class Developer(scrapy.Item):
name = scrapy.Field()
def parse(self, response):
item = ApkmirrorScraperItem()
item['url'] = response.url
item['title'] = response.xpath('//h1[@title]/text()').extract_first()
dev = Developer()
dev['name'] = response.xpath('//h3[@title]/a/text()').extract_first()
item['developer'] = dev
return item
希望能帮助到你 :)
标签:scrapy-spider,scrapy,python 来源: https://codeday.me/bug/20191111/2020199.html