web-crawler - 如何避免搜索引擎在 Sitecore 多站点环境中针对特定网站进行爬网

Question

我们在 sitecore 项目中实施了多站点解决方案。我们将 robots.txt 放置在网站根目录下，以防止抓取生产服务器上的特定目录。

现在我们将beta.example.com在生产服务器上再托管一个网站，但我们希望避免抓取这个子域。

我们如何实现，因为它是多站点环境，并且只有一个 robots.txt 文件。我们如何使这种爬行机制适用于特定的网站。

我们是否需要为此编写任何管道机制。

谢谢

score 3 · Accepted Answer

您可以添加一个自定义处理程序来处理您的 robots.txt，如下所示：

<customHandlers>
  <handler trigger="robots.txt" handler="RobotsTxtHandler.ashx" />
</customHandlers>

然后在后面的代码中ashx，您可以编写加载所需 robots.txt 所需的逻辑。

public void ProcessRequest(HttpContext context)
{
    var database = Factory.GetDatabase("web");
    var path = string.Format("{0}{1}", Context.Site.RootPath, Context.Site.StartItem)
    Item siteRoot = database.GetItem(path)
    if (siteRoot != null)
    {
        context.Response.Clear();
        context.Response.ContentType = "text/plain";
        context.Response.ContentEncoding = System.Text.Encoding.UTF8;

        //Write your code to fetch the robots.txt from sitecore item
    }

    context.Response.End();
}

请注意，ashx 的代码隐藏应该继承IHttpHandler

您还需要在<system.webServer> <handlers>标签下的 web.config 中添加处理程序。

<add verb="*" path="RobotsTxtHandler.ashx" type="YourNamespace.RobotsTxtHandler, YourAssembly" name="RobotsTxtHandler" />

我的建议是最好将每个站点的 Robots.txt 存储在 Sitecore 项目中，而不是网站根目录中。这将使每个站点都有自己的 robots.txt

web-crawler - 如何避免搜索引擎在 Sitecore 多站点环境中针对特定网站进行爬网

1 回答 1

Related

Reference