scrapy 2.3 在同一進(jìn)程中運(yùn)行多個(gè)spider

2021-06-16 10:50 更新

默認(rèn)情況下，當(dāng)您運(yùn)行時(shí)，scrapy為每個(gè)進(jìn)程運(yùn)行一個(gè)spider ?scrapy crawl? . 但是，Scrapy支持使用 internal API .

下面是一個(gè)同時(shí)運(yùn)行多個(gè)蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

使用相同的示例 ?CrawlerRunner? ：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

同樣的例子，但是通過(guò)鏈接延遲來(lái)按順序運(yùn)行spider：

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

參見(jiàn)

從腳本中運(yùn)行Scrapy .

以上內(nèi)容是否對(duì)您有幫助：

← scrapy 2.3 怎么從腳本中運(yùn)行

scrapy 2.3 分布式爬行 →

寫(xiě)筆記

我要補(bǔ)充

scrapy 2.3 在同一進(jìn)程中運(yùn)行多個(gè)spider

推薦文章

推薦教程

推薦課程