scrapy框架集成http
作者:互联网
如果只是在Flask中调用Scrapy爬虫,可能会遇到如下错误:
ValueError: signal only works in main thread
# 或者
twisted.internet.error.ReactorNotRestartable
解决的办法有几个。
1 使用python子进程(subproccess)
首先,确保目录结构类似如下:
> tree -L 1
├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py
然后在,新进程中启动爬虫:
# server.py import subprocess from flask import Flask app = Flask(__name__) @app.route('/') def hello_world(): """ Run spider in another process and store items in file. Simply issue command: > scrapy crawl dmoz -o "output.json" wait for this command to finish, and read output.json to client. """ spider_name = "dmoz" subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"]) with open("output.json") as items_file: return items_file.read() if __name__ == '__main__': app.run(debug=True)
新进程中启动爬虫:
2 使用Twisted-Klein + Scrapy
代码如下:
# server.py import json from klein import route, run from scrapy import signals from scrapy.crawler import CrawlerRunner from dirbot.spiders.dmoz import DmozSpider class MyCrawlerRunner(CrawlerRunner): """ Crawler object that collects items and returns output after finishing crawl. """ def crawl(self, crawler_or_spidercls, *args, **kwargs): # keep all items scraped self.items = [] # create crawler (Same as in base CrawlerProcess) crawler = self.create_crawler(crawler_or_spidercls) # handle each item scraped crawler.signals.connect(self.item_scraped, signals.item_scraped) # create Twisted.Deferred launching crawl dfd = self._crawl(crawler, *args, **kwargs) # add callback - when crawl is done cal return_items dfd.addCallback(self.return_items) return dfd def item_scraped(self, item, response, spider): self.items.append(item) def return_items(self, result): return self.items def return_spider_output(output): """ :param output: items scraped by CrawlerRunner :return: json with list of items """ # this just turns items into dictionaries # you may want to use Scrapy JSON serializer here return json.dumps([dict(item) for item in output]) @route("/") def schedule(request): runner = MyCrawlerRunner() spider = DmozSpider() deferred = runner.crawl(spider) deferred.addCallback(return_spider_output) return deferred run("localhost", 8080)
3 使用ScrapyRT
安装ScrapyRT,然后启动:
> scrapyrt
文章来源:https://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy
标签:集成,http,items,self,item,scrapy,output,return,crawler 来源: https://www.cnblogs.com/Im-Victor/p/15473986.html