python - Scrapy 中的项目缓存

Question

我正在抓取一个结构如下的网站：

Archive
    Article 1
        Authors
            Author 1
            Author 2
        Title
        Body
        Comments
            Comment 1
            Comment 2
    ...

每个作者Authors都有自己的个人资料页面。问题是作者写了多篇文章，所以当我的蜘蛛抓取网站时，我最终会一遍又一遍地抓取相同作者的个人资料。

我将如何使用 Scrapy 缓存作者个人资料？

score 1 · Accepted Answer

我认为您需要实施新的缓存策略。看这里

还要看HttpcacheMiddleware

我仍然很困惑为什么它会再次访问访问页面。他们的文档说这是默认策略

此策略不了解任何 HTTP Cache-Control 指令。每个请求及其相应的响应都会被缓存。当再次看到相同的请求时，将返回响应而不从 Internet 传输任何内容。

score 1 · Accepted Answer

您应该像以下示例一样添加重复过滤器：

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.author_ids_seen = set()

    def process_item(self, item, spider):
        if item['author_id'] in self.author_ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['author_id'])
            return item

并激活 ITEM_PIPELINES 列表中的 DuplicatesPipeline，

ITEM_PIPELINES = [
    'myproject.pipeline.DuplicatesPipeline',
]

python - Scrapy 中的项目缓存

2 回答 2

Related

Reference