scrapy 2.3 怎么從腳本中運行

2021-06-16 10:49 更新

你可以使用 API 從腳本運行scrapy，而不是運行scrapy via的典型方式 ?scrapy crawl? .

記住，scrappy構(gòu)建在TwistedAsynchronicNetworkLibrary之上，所以需要在TwistedReactor中運行它。

你能用來運行蜘蛛的第一個工具是 ?scrapy.crawler.CrawlerProcess? . 這個類將為您啟動一個扭曲的反應(yīng)器，配置日志記錄和設(shè)置關(guān)閉處理程序。這個類是所有slapy命令使用的類。

下面是一個示例，演示如何使用它運行單個蜘蛛。

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

在CrawlerProcess中定義字典中的設(shè)置。一定要檢查 ?CrawlerProcess? 了解其使用細(xì)節(jié)的文檔。

如果您在一個零碎的項目中，有一些額外的幫助器可以用來導(dǎo)入項目中的那些組件。你可以自動輸入蜘蛛的名字 ?CrawlerProcess? 及使用 ?get_project_settings? 得到一個 ?Settings? 具有項目設(shè)置的實例。

下面是一個如何使用 testspiders 以項目為例。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

還有另一個Scrapy實用程序，它提供了對爬行過程的更多控制： ?scrapy.crawler.CrawlerRunner? . 這個類是一個薄包裝器，它封裝了一些簡單的幫助器來運行多個爬行器，但是它不會以任何方式啟動或干擾現(xiàn)有的反應(yīng)器。

使用這個類，在調(diào)度spider之后應(yīng)該顯式地運行reactor。建議您使用 ?CrawlerRunner? 而不是 ?CrawlerProcess? 如果您的應(yīng)用程序已經(jīng)在使用Twisted，并且您希望在同一個反應(yīng)器中運行Scrapy。

請注意，蜘蛛完成后，您還必須自己關(guān)閉扭曲的反應(yīng)堆。這可以通過將回調(diào)添加到由 ?CrawlerRunner.crawl? 方法。

下面是一個使用它的例子，以及在 ?MySpider? 已完成運行。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

參見

Reactor Overview

以上內(nèi)容是否對您有幫助：

← scrapy 2.3 常用做法

scrapy 2.3 在同一進(jìn)程中運行多個spider →

寫筆記

我要補充

scrapy 2.3 怎么從腳本中運行

推薦文章

推薦教程

推薦課程