python - Ignoring requests while scraping two pages

Question

I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).

The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to retrieve additional information about the item. DeltaFetch works well in ignoring requests to page B, but that also means that every time the scraping runs, it runs requests to page A regardless of whether it has visited it or not.

This is how my code is structured right now:

# Gathering links from a page, creating an item, and passing it to parse_A
def parse(self, response):
    for href in response.xpath(u'//a[text()="詳細を見る"]/@href').extract():
        item = ItemLoader(item=ItemClass(), response=response)
        yield scrapy.Request(response.urljoin(href), 
                                callback=self.parse_A,
                                meta={'item':item.load_item()})

# Parsing elements in page A, and passing the item to parse_B
def parse_A(self, response):
    item = ItemLoader(item=response.meta['item'], response=response)
    item.replace_xpath('age',u"//td[contains(@class,\"age\")]/text()")
    page_B = response.xpath(u'//a/img[@alt="周辺環境"]/../@href').extract_first()
    yield scrapy.Request(response.urljoin(page_B), 
                            callback=self.parse_B,
                            meta={'item':item.load_item()})

# Parsing elements in page B, and yielding the item
def parse_B(self, response):
    item = ItemLoader(item=response.meta['item'])
    item.add_value('url_B',response.url)
    yield item.load_item()

Any help would be appreciated to ignore the first request to page A when this page has already been visited, using DeltaFetch.

score 4 · Accepted Answer

DeltaFetch 只记录在其数据库中产生项目的请求，这意味着默认情况下只会跳过那些请求。

但是，您可以使用元键自定义用于存储记录的deltafetch_key键。如果对调用的请求parse_A()和在里面创建的请求一样，把这个key设置成一样parse_A()，应该可以达到你想要的效果。

像这样的东西应该可以工作（未经测试）：

from scrapy.utils.request import request_fingerprint

# (...)

    def parse_A(self, response):
        # (...)
        yield scrapy.Request(
            response.urljoin(page_B),
            callback=self.parse_B,
            meta={
                'item': item.load_item(),
                'deltafetch_key': request_fingerprint(response.request)
            }
        )

注意：上面的示例有效地将对 url 的请求的过滤替换为对parse_B()url 的请求的过滤parse_A()。您可能需要根据需要使用不同的密钥。

python - Ignoring requests while scraping two pages

1 回答 1

Related

Reference