robots.txt - Googlebot 不尊重 Robots.txt

Question

出于某种原因，当我查看 Google 网站管理员工具的“分析 robots.txt”以查看我们的 robots.txt 文件阻止了哪些网址时，这不是我所期望的。这是我们文件开头的片段：

Sitemap: http://[omitted]/sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

对于 Googlebot 和 Mediapartners-Google，scripts 文件夹中的任何内容都会被正确阻止。我可以看到这两个机器人看到了正确的指令，因为 Googlebot 说脚本从第 7 行被阻止，而 Mediapartners-Google 从第 4 行被阻止。但是我从第二个用户下的不允许的 url 中输入的任何其他 url -agent 指令未被阻止！

我想知道我的评论或使用绝对网址是否把事情搞砸了......

任何见解都值得赞赏。谢谢。

score 11 · Accepted Answer

它们被忽略的原因是您在robots.txt文件中具有用于Disallow条目的完全限定 URL，而规范不允许它。（您应该只指定相对路径，或使用 / 指定绝对路径）。尝试以下操作：

Sitemap: /sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

至于缓存，谷歌平均每 24 小时尝试获取 robots.txt 文件的副本。

score 2 · Accepted Answer

这是绝对网址。robots.txt 应该只包含相对 URI；域是根据访问 robots.txt 的域推断出来的。

score 0 · Accepted Answer

它已经存在至少一周了，谷歌说它最后一次下载是在 3 小时前，所以我确定它是最近的。

score -1 · Accepted Answer

您最近是否对 robots.txt 文件进行了此更改？根据我的经验，谷歌似乎将这些东西缓存了很长时间。

robots.txt - Googlebot 不尊重 Robots.txt

4 回答 4

Related

Reference