I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).
The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to retrieve additional information about the item. DeltaFetch works well in ignoring requests to page B, but that also means that every time the scraping runs, it runs requests to page A regardless of whether it has visited it or not.
This is how my code is structured right now:
# Gathering links from a page, creating an item, and passing it to parse_A
def parse(self, response):
for href in response.xpath(u'//a[text()="詳細を見る"]/@href').extract():
item = ItemLoader(item=ItemClass(), response=response)
yield scrapy.Request(response.urljoin(href),
callback=self.parse_A,
meta={'item':item.load_item()})
# Parsing elements in page A, and passing the item to parse_B
def parse_A(self, response):
item = ItemLoader(item=response.meta['item'], response=response)
item.replace_xpath('age',u"//td[contains(@class,\"age\")]/text()")
page_B = response.xpath(u'//a/img[@alt="周辺環境"]/../@href').extract_first()
yield scrapy.Request(response.urljoin(page_B),
callback=self.parse_B,
meta={'item':item.load_item()})
# Parsing elements in page B, and yielding the item
def parse_B(self, response):
item = ItemLoader(item=response.meta['item'])
item.add_value('url_B',response.url)
yield item.load_item()
Any help would be appreciated to ignore the first request to page A when this page has already been visited, using DeltaFetch.