c# - 检测诚实的网络爬虫

Question

我想检测（在服务器端）哪些请求来自机器人。在这一点上，我不关心恶意机器人，只关心那些玩得很好的机器人。我见过一些方法，主要涉及将用户代理字符串与“bot”等关键字进行匹配。但这似乎很尴尬，不完整且无法维护。那么有人有更可靠的方法吗？如果没有，您是否有任何资源用于与所有友好的用户代理保持同步？

如果你很好奇：我不会试图做任何违反任何搜索引擎政策的事情。我们有一个网站部分，用户会随机看到几个稍微不同的页面版本之一。但是，如果检测到网络爬虫，我们将始终为它们提供相同的版本，以使索引保持一致。

我也在使用 Java，但我想这种方法对于任何服务器端技术都是相似的。

score 93 · Accepted Answer

你说在“bot”上匹配用户代理可能很尴尬，但我们发现它是一个很好的匹配。我们的研究表明，它将覆盖您收到的大约 98% 的点击。我们也没有遇到任何误报匹配。如果您想将其提高到 99.9%，您可以添加一些其他著名的匹配项，例如“crawler”、“baiduspider”、“ia_archiver”、“curl”等。我们已经在我们的生产系统上测试了数百万个点击数。

这里有一些 c# 解决方案给你：

1) 最简单的

处理未命中时最快。即来自非机器人的流量——普通用户。捕获 99+% 的爬虫。

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

2) 中等

处理命中时最快。即来自机器人的流量。错过也很快。捕获接近 100% 的爬虫。匹配“bot”、“crawler”、“spider”。您可以向其中添加任何其他已知的爬虫。

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

3）偏执狂

相当快，但比选项 1 和 2 慢一点。它是最准确的，并且允许您根据需要维护列表。如果您担心将来出现误报，您可以维护一个单独的名称列表，其中包含“bot”。如果我们得到一个简短的匹配，我们会记录它并检查它是否有误报。

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

笔记：

继续向正则表达式选项 1 添加名称是很诱人的。但如果你这样做，它会变得更慢。如果您想要更完整的列表，那么带有 lambda 的 linq 会更快。
确保 .ToLower() 在您的 linq 方法之外——记住该方法是一个循环，您将在每次迭代期间修改字符串。
始终将最重的机器人放在列表的开头，以便它们更快地匹配。
将列表放入静态类中，这样它们就不会在每次页面浏览时重新构建。

蜜罐

唯一真正的替代方法是在您的网站上创建一个只有机器人才能访问的“蜜罐”链接。然后，您将访问蜜罐页面的用户代理字符串记录到数据库中。然后，您可以使用这些记录的字符串对爬虫进行分类。

Postives:它将匹配一些未声明自己的未知爬虫。

Negatives:并非所有爬虫都深入挖掘到足以访问您网站上的每个链接，因此它们可能无法到达您的蜜罐。

score 28 · Accepted Answer

您可以在 robotstxt.org机器人数据库中找到关于已知“良好”网络爬虫的非常全面的数据数据库。利用这些数据将比仅在用户代理中匹配机器人更有效。

score 10 · Accepted Answer

一个建议是在您的页面上创建一个只有机器人会跟随的空锚点。普通用户看不到链接，留下蜘蛛和机器人跟随。例如，指向子文件夹的空锚标记将在您的日志中记录获取请求...

<a href="dontfollowme.aspx"></a>

许多人在运行 HoneyPot 时使用此方法来捕获未遵循 robots.txt 文件的恶意机器人。我在编写的ASP.NET 蜜罐解决方案中使用空锚方法来捕获和阻止那些令人毛骨悚然的爬虫......

score 7 · Accepted Answer

7

任何入口页面为 /robots.txt 的访问者都可能是机器人。

于 2009-02-13T02:06:57.323 回答

score 4 · Accepted Answer

像这样快速而肮脏的东西可能是一个好的开始：

return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i

注意：rails 代码，但正则表达式通常适用。

score 0 · Accepted Answer

我很确定大部分机器人不使用 robots.txt，但这是我的第一个想法。

在我看来，检测机器人的最佳方法是使用请求之间的时间，如果请求之间的时间一直很快，那么它就是一个机器人。

score 0 · Accepted Answer

void CheckBrowserCaps()
    {
        String labelText = "";
        System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
        if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
        {
            labelText = "Browser is a search engine.";
        }
        else
        {
            labelText = "Browser is not a search engine.";
        }

        Label1.Text = labelText;
    }

HttpCapabilitiesBase.Crawler 属性

c# - 检测诚实的网络爬虫

7 回答 7

1) 最简单的

2) 中等

3）偏执狂

Related

Reference