python - Robotparser 似乎无法正确解析

Question

我正在编写一个爬虫，为此我正在实现 robots.txt 解析器，我正在使用标准的 lib robotparser。

似乎robotsparser解析不正确，我正在使用 Google 的robots.txt调试我的爬虫。

（以下示例来自 IPython）

In [1]: import robotparser

In [2]: x = robotparser.RobotFileParser()

In [3]: x.set_url("http://www.google.com/robots.txt")

In [4]: x.read()

In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False

In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False

In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False

这很有趣，因为有时它似乎“工作”，有时它似乎失败了，我也尝试了来自 Facebook 和 Stackoverflow 的 robots.txt。这是robotpaser模块的错误吗？还是我在这里做错了什么？如果是这样，是什么？

我想知道这个错误是否有任何相关

score 4 · Accepted Answer

这不是错误，而是解释上的差异。根据robots.txt 规范草案（从未获得批准，也不太可能获得批准）：

要评估是否允许访问 URL，机器人必须尝试将 Allow 和 Disallow 行中的路径与 URL 进行匹配，按照它们在记录中出现的顺序。使用找到的第一个匹配项。如果未找到匹配项，则默认假设是允许该 URL。

（第 3.2.2 节，允许和禁止行）

使用这种解释，然后是“/catalogs/p?” 应该被拒绝，因为之前有一个“Disallow: /catalogs”指令。

在某个时候，Google 开始以不同于该规范的方式解释 robots.txt。他们的方法似乎是：

Check for Allow. If it matches, crawl the page.
Check for Disallow. If it matches, don't crawl.
Otherwise, crawl.

问题是robots.txt的解释没有正式的约定。我见过使用 Google 方法的爬虫和其他使用 1996 年草案标准的爬虫。当我操作爬虫时，当我使用 Google 解释时，我从网站管理员那里得到了 nastygram，因为我爬取了他们认为不应该被爬取的页面，如果我使用其他解释，我会从其他人那里得到讨厌的图，因为他们认为应该索引的东西，不是。

score 3 · Accepted Answer

经过几次谷歌搜索后，我没有找到任何关于机器人解析器问题的信息。我最终得到了其他东西，我发现了一个名为reppy的模块，我做了一些测试，它看起来非常强大。您可以通过pip安装它；

pip install reppy

下面是一些使用reppy的示例（在 IPython 上），再次使用 Google 的robots.txt

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

In [10]: # It also has a x.disallowed function. The contrary of x.allowed

score 2 · Accepted Answer

有趣的问题。我查看了源代码（我只有 python 2.4 源代码可用，但我敢打赌它没有改变）并且代码通过执行来规范化正在测试的 url：

urllib.quote(urlparse.urlparse(urllib.unquote(url))[2])

这是您的问题的根源：

>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo"))[2]) 
'/foo'
>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo?"))[2]) 
'/foo'

所以它要么是python库中的一个错误，要么是谷歌通过包含“？”来破坏robot.txt规范。规则中的字符（这有点不寻常）。

[以防万一不清楚，我会以不同的方式再说一遍。上面的代码被 robotsparser 库用作检查 url 的一部分。所以当网址以“？”结尾时该字符被删除。因此，当您检查/catalogs/p?执行的实际测试时，是针对/catalogs/p. 因此你的结果令人惊讶。]

我建议向 python 人员提交一个错误（您可以在此处发布链接作为解释的一部分）[编辑：谢谢]。然后使用您找到的其他库...

score 1 · Accepted Answer

大约一周前，我们合并了一个提交，其中包含导致此问题的错误。我们刚刚将版本 0.2.2 推送到 repo 中的 pip 和 master，包括针对这个问题的回归测试。

版本 0.2 包含轻微的接口更改——现在您必须创建一个包含reppy最初具有的确切接口的 RobotsCache 对象。这主要是为了使缓存显式化，并使同一进程中可以有不同的缓存。但是看哪，它现在又可以工作了！

from reppy.cache import RobotsCache
cache = RobotsCache()
cache.allowed('http://www.google.com/catalogs', 'foo')
cache.allowed('http://www.google.com/catalogs/p', 'foo')
cache.allowed('http://www.google.com/catalogs/p?', 'foo')

python - Robotparser 似乎无法正确解析

4 回答 4

Related

Reference