robots.txt - 如何禁止 robots.txt 中的搜索页面

Question

我需要禁止http://example.com/startup?page=2搜索页面被编入索引。

我希望http://example.com/startup被索引，但不是http://example.com/startup?page=2和 page3 等等。

此外，启动可以是随机的，例如http://example.com/XXXXX?page

score 9 · Accepted Answer

正如谷歌网站管理员工具“test robots.txt”功能所证实的那样，这样的事情是有效的：

User-Agent: *
Disallow: /startup?page=

Disallow 此字段的值指定不被访问的部分 URL。这可以是完整路径，也可以是部分路径；不会检索以该值开头的任何 URL。

但是，如果 URL 的第一部分发生变化，则必须使用通配符：

User-Agent: *
Disallow: /startup?page=
Disallow: *page=
Disallow: *?page=

score 3 · Accepted Answer

你可以把它放在你不想被索引的页面上：

<META NAME="ROBOTS" CONTENT="NONE">

这告诉机器人不要索引页面。

在搜索页面上，使用它可能更有趣：

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

这会指示机器人不要索引当前页面，但仍会跟随此页面上的链接，从而允许它们访问在搜索中找到的页面。

score 2 · Accepted Answer

创建一个文本文件并将其命名为：robots.txt
添加用户代理和禁止部分（参见下面的示例）
将文件放在站点的根目录中

样本：

###############################
#My robots.txt file
#
User-agent: *
#
#list directories robots are not allowed to index 
#
Disallow: /testing/
Disallow: /staging/
Disallow: /admin/
Disallow: /assets/
Disallow: /images/
#
#
#list specific files robots are not allowed to index
#
Disallow: /startup?page=2
Disallow: /startup?page=3
Disallow: /startup?page=3
# 
#
#End of robots.txt file
#
###############################

这是 Google 实际robots.txt 文件的链接

您可以在 Google 网站管理员关于使用 robots.txt 文件阻止或删除页面的帮助主题中获得一些有用的信息

robots.txt - 如何禁止 robots.txt 中的搜索页面

3 回答 3

Related

Reference