scrapy 2.3 項(xiàng)目截圖

2021-06-08 15:22 更新

這個(gè)例子演示了如何使用 coroutine syntax 在 ?process_item()? 方法。

此項(xiàng)管道向本地運(yùn)行的實(shí)例發(fā)出請(qǐng)求 Splash 呈現(xiàn)項(xiàng)目URL的屏幕截圖。下載請(qǐng)求響應(yīng)后,項(xiàng)目管道將屏幕截圖保存到文件中,并將文件名添加到項(xiàng)目中。

import hashlib
from urllib.parse import quote

import scrapy
from itemadapter import ItemAdapter

class ScreenshotPipeline:
    """Pipeline that uses Splash to render screenshot of
    every Scrapy item."""

    SPLASH_URL = "http://localhost:8050/render.png?url={}"

    async def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        encoded_item_url = quote(adapter["url"])
        screenshot_url = self.SPLASH_URL.format(encoded_item_url)
        request = scrapy.Request(screenshot_url)
        response = await spider.crawler.engine.download(request, spider)

        if response.status != 200:
            # Error happened, return item.
            return item

        # Save screenshot to file, filename will be hash of url.
        url = adapter["url"]
        url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
        filename = f"{url_hash}.png"
        with open(filename, "wb") as f:
            f.write(response.body)

        # Store filename in item.
        adapter["screenshot_filename"] = filename
        return item
以上內(nèi)容是否對(duì)您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號(hào)
微信公眾號(hào)

編程獅公眾號(hào)