python – 顺序运行多个Spider
作者:互联网
Class Myspider1
#do something....
Class Myspider2
#do something...
以上是我的spider.py文件的架构.我试图先运行Myspider1,然后根据某些条件运行Myspider2倍数.我怎么能这样做?有小费吗?
configure_logging()
runner = CrawlerRunner()
def crawl():
yield runner.crawl(Myspider1,arg.....)
yield runner.crawl(Myspider2,arg.....)
crawl()
reactor.run()
我试图用这种方式.但不知道如何运行它.我应该在cmd上运行cmd(什么命令?)或者只运行python文件?
非常感谢!!!
解决方法:
运行python文件
例如:
test.py
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
]
def parse(self, response):
print "first spider"
class MySpider2(scrapy.Spider):
# Your second spider definition
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
print "second spider"
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
现在运行python test.py> output.txt的
您可以从output.txt中观察到您的蜘蛛顺序运行.
标签:scrapy-spider,python,scrapy,web-crawler 来源: https://codeday.me/bug/20191006/1858674.html