不确定这是否适合您的设置,但是您可以lastseen
在初始化蜘蛛时从 MySQL 获取,并在响应包含带有 的项目时停止在回调中生成请求postdate < lastseen
,因此基本上移动逻辑以停止直接在蜘蛛内部而不是管道内爬行。
有时将参数传递给蜘蛛会更简单
scrapy crawl myspider -a lastseen=20130715
并设置你的蜘蛛的属性来测试你的回调(http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments)
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, lastseen=None):
self.lastseen = lastseen
# ...
def parse_new_items(self, reponse):
follow_next_page = True
# item fetch logic
for element in <some_selector>:
# get post_date
post_date = <extract post_date from element>
# check post_date
if post_date < self.lastseen:
follow_next_page = False
continue
item = MyItem()
# populate item...
yield item
# find next page to crawl
if follow_next_page:
next_page_url = ...
yield Request(url = next_page_url, callback=parse_new_items)