python - Scrapy: best way to select urls based on mysql

Question

I made a Scrapy crawler that collects some data from forum threads. On the list page, i can see the last modified date. Based on that date, i want to decide whether to crawl the thread again or not. I store the data in mysql, using pipeline. While processing the list page with my CrawlSpider, i want to check a record in the mysql, and based on that record i either want to yield a Request or not. (I DO NOT want to load the url unless there is a new post.)

Whats the best way to do this?

score 0 · Accepted Answer

使用CrawSpider Rule：

Rule(SgmlLinkExtractor(), follow=True, process_request='check_moddate'),

然后在你的蜘蛛中：

def check_moddate(self, request):
    def dateisnew():
        # check the date
    if dateisnew():
        return request

python - Scrapy: best way to select urls based on mysql

1 回答 1

Related

Reference