warwaruk 是对的,DEPTH_LIMIT 设置的默认值为 0 - 即“不施加限制”。
所以让我们刮一下 miniova 看看会发生什么。从today
页面开始,我们看到有两个 tor 链接:
stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]
让我们抓取第一个链接,我们看到该页面上没有新的 tor 链接,只有指向自身的链接,默认情况下不会重新抓取 (scrapy.http.Request(url[, ... dont_filter=False, . ..])):
>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]
没有运气,我们仍然处于深度 1。让我们试试另一个链接:
>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]
不,这个页面也只包含一个链接,一个指向自身的链接,它也会被过滤。所以实际上没有要抓取的链接,所以 Scrapy 关闭了蜘蛛(深度==1)。