0

I made a Scrapy crawler that collects some data from forum threads. On the list page, i can see the last modified date. Based on that date, i want to decide whether to crawl the thread again or not. I store the data in mysql, using pipeline. While processing the list page with my CrawlSpider, i want to check a record in the mysql, and based on that record i either want to yield a Request or not. (I DO NOT want to load the url unless there is a new post.)

Whats the best way to do this?

4

1 回答 1

0

使用CrawSpider Rule

Rule(SgmlLinkExtractor(), follow=True, process_request='check_moddate'),

然后在你的蜘蛛中:

def check_moddate(self, request):
    def dateisnew():
        # check the date
    if dateisnew():
        return request
于 2013-03-14T13:15:16.457 回答