ruby-on-rails-3 - 使用 robots.txt 防止搜索蜘蛛访问 Rails 3 嵌套资源

Question

我试图阻止 Google、Yahoo 等访问我的 /products/ID/purchase 页面，但我不确定该怎么做。

我目前阻止他们通过以下方式登录：

User-agent: *
Disallow: /sign_in

我可以执行以下操作吗？

User-agent: *
Disallow: /products/*/purchase

或者应该是：

User-agent: *
Disallow: /purchase

score 2 · Accepted Answer

我假设您想阻止/products/ID/purchase但允许/products/ID.

您的最后一个建议只会阻止以“购买”开头的页面：

User-agent: *
Disallow: /purchase

所以这不是你想要的。

你需要你的第二个建议：

User-agent: *
Disallow: /products/*/purchase

这将阻止所有以开头/products/、后跟任何字符、后跟 . 的URL /purchase。

注意：它使用通配符*。在最初的 robots.txt“规范”中，这并不是一个具有特殊含义的字符。但是，一些搜索引擎扩展了规范并将其用作一种通配符。所以它应该适用于谷歌和可能的其他一些搜索引擎，但你不能打赌它会适用于所有其他爬虫/机器人。

所以你的 robots.txt 可能看起来像：

User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase

另请注意，某些搜索引擎（包括 Google）可能仍会在其搜索结果中列出 URL（没有标题/片段），尽管它在 robots.txt 中被阻止。当他们在允许被抓取的页面上找到指向被阻止页面的链接时，可能就是这种情况。为防止这种情况，您必须查看noindex文档。

score 0 · Accepted Answer

According to Google Disallow: /products/*/purchase should work. But according to robotstxt.org this doesn't work.

2 回答 2