scrapy - 无法爬行深度超过1的scrapy

Question

我无法将scrapy配置为以深度> 1运行，我尝试了以下3个选项，但没有一个有效，摘要日志中的request_depth_max始终为1：

1）添加：

from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2

到蜘蛛文件（站点上的示例，只是使用不同的站点）

2）使用选项运行命令行-s：

/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org

3）添加到settings.py和scrapy.cfg：

DEPTH_LIMIT=2

应该如何配置为1以上？

score 4 · Accepted Answer

我有一个类似的问题，它有助于follow=True在定义时设置Rule：

follow是一个布尔值，它指定是否应从使用此规则提取的每个响应中遵循链接。如果callback默认None follow 为True，否则默认为False。

score 4 · Accepted Answer

warwaruk 是对的，DEPTH_LIMIT 设置的默认值为 0 - 即“不施加限制”。

所以让我们刮一下 miniova 看看会发生什么。从today页面开始，我们看到有两个 tor 链接：

stav@maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]

让我们抓取第一个链接，我们看到该页面上没有新的 tor 链接，只有指向自身的链接，默认情况下不会重新抓取 (scrapy.http.Request(url[, ... dont_filter=False, . ..]））：

>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]

没有运气，我们仍然处于深度 1。让我们试试另一个链接：

>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]

不，这个页面也只包含一个链接，一个指向自身的链接，它也会被过滤。所以实际上没有要抓取的链接，所以 Scrapy 关闭了蜘蛛（深度==1）。

score 1 · Accepted Answer

DEPTH_LIMIT设置的默认值为0- 即“不施加限制”。

你写了：

request_depth_max在摘要日志中总是1

您在日志中看到的是统计信息，而不是设置。当它这么说时，request_depth_max这1意味着从第一个回调开始，没有产生其他请求。

您必须显示您的蜘蛛代码以了解发生了什么。

但是为它创造另一个问题。

更新：

啊，我看到你正在为scrapy 介绍运行 mininova spider：

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

正如您从代码中看到的，蜘蛛从不向其他页面发出任何请求，它直接从顶层页面抓取所有数据。这就是最大深度为 1 的原因。

如果你让自己的蜘蛛跟随其他页面的链接，最大深度将大于 1。

scrapy - 无法爬行深度超过1的scrapy

3 回答 3

Related

Reference