search-engine - Google 爬虫找到 robots.txt，但无法下载

Question

谁能告诉我这个 robots.txt 有什么问题？

http://bizup.cloudapp.net/robots.txt

以下是我在 Google 网站管理员工具中遇到的错误：

Sitemap errors and warnings
Line    Status  Details
Errors  -   
Network unreachable: robots.txt unreachable
We were unable to crawl your Sitemap because we found a robots.txt file at the root of
your site but were unable to download it. Please ensure that it is accessible or remove
it completely.

实际上，上面的链接是执行机器人动作的路线的映射。该操作从存储中获取文件并将内容作为文本/纯文本返回。谷歌表示他们无法下载该文件。是不是因为这个？

score 4 · Accepted Answer

看起来它正在读取 robots.txt 好的，但是您的 robots.txt 然后声称http://bizup.cloudapp.net/robots.txt也是您的 XML 站点地图的 URL，而实际上它是http://bizup.cloudapp .net/sitemap.xml。该错误似乎来自 Google 试图将 robots.txt 解析为 XML 站点地图。您需要将 robots.txt 更改为

User-agent: *
Allow: /
Sitemap: http://bizup.cloudapp.net/sitemap.xml

编辑

它实际上比这更深入一些，而且 Googlebot 根本无法下载您网站上的任何页面。以下是 Googlebot 请求 robots.txt 或主页时返回的异常：

此应用程序不支持无 Cookie 表单身份验证。

异常详细信息：System.Web.HttpException：此应用程序不支持无 Cookie 表单身份验证。

[HttpException (0x80004005): Cookieless Forms Authentication is not supported for this application.]
AzureBright.MvcApplication.FormsAuthentication_OnAuthenticate(Object sender, FormsAuthenticationEventArgs args) in C:\Projectos\AzureBrightWebRole\Global.asax.cs:129
System.Web.Security.FormsAuthenticationModule.OnAuthenticate(FormsAuthenticationEventArgs e) +11336832
System.Web.Security.FormsAuthenticationModule.OnEnter(Object source, EventArgs eventArgs) +88
System.Web.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +80
System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +266

FormsAuthentication 正在尝试使用无 cookie 模式，因为它识别出 Googlebot 不支持 cookie，但是您的 FormsAuthentication_OnAuthenticate 方法中的某些内容随后会引发异常，因为它不想接受无 cookie 身份验证。

我认为最简单的解决方法是在 web.config 中更改以下内容，这会阻止 FormsAuthentication 尝试使用无 cookie 模式...

<authentication mode="Forms"> 
    <forms cookieless="UseCookies" ...>
    ...

score 2 · Accepted Answer

我以一种简单的方式解决了这个问题：只需添加一个 robots.txt 文件（与我的 index.html 文件在同一目录中），以允许所有访问。我把它留了下来，打算以这种方式允许所有访问——但也许谷歌网站管理员工具然后找到了另一个由我的 ISP 控制的robot.txt？

因此，至少对于某些 ISP，您应该有一个 robots.txt 文件，即使您不想排除任何机器人，只是为了防止这种可能的故障。

score 1 · Accepted Answer

生成 robots.txt 文件的脚本有问题。当 GoogleBot 访问文件时，它正在获取500 Internal Server Error. 以下是标头检查的结果：

请求：http://bizup.cloudapp.net/robots.txt
获取 /robots.txt HTTP/1.1
连接：保持活动
保活：300
接受：*/*
主办方：bizup.cloudapp.net
接受语言：en-us
接受编码：gzip，放气
用户代理：Mozilla/5.0（兼容；Googlebot/2.1；+http://www.google.com/bot.html）

服务器响应：500 内部服务器错误
缓存控制：私有
内容类型：文本/html；字符集=utf-8
服务器：Microsoft-IIS/7.0
X-AspNet-版本：4.0.30319
X-Powered-By: ASP.NET
日期：格林威治标准时间 2010 年 8 月 19 日星期四 16:52:09
内容长度：4228
最终目的地页面

您可以在这里测试标题http://www.seoconsultants.com/tools/headers/#Report

score 1 · Accepted Answer

获取您的 robots.txt 没有问题

User-agent: *
Allow: /
Sitemap: http://bizup.cloudapp.net/robots.txt

但是它不是在执行递归 robots.txt 调用吗？

站点地图应该是一个 xml 文件，请参阅Wikipedia

search-engine - Google 爬虫找到 robots.txt，但无法下载

4 回答 4

Related

Reference