9

到目前为止,我能够通过将这些字符串与已知的用户代理匹配来从用户代理字符串列表中检测机器人,但我想知道还有什么其他方法可以使用 php 来做到这一点,因为我使用这种方法检索的机器人比预期的要少。

我也在寻找如何检测浏览器或机器人是否使用用户代理字符串欺骗另一个浏览器。

任何建议表示赞赏。

编辑:这必须使用带有如下行的日志文件来完成:

129.173.129.168 - - [11/Oct/2011:00:00:05 -0300] "GET /cams/uni_ave2.jpg?time=1318302291289 HTTP/1.1" 200 20240 "http://faculty.dentistry.dal.ca /loanertracker/webcam.html" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23"

这意味着除了访问时间之外,我无法检查用户行为。

4

5 回答 5

13

In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:

<a style="display:none" href="autocatch.php">A</a>

Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.

于 2012-11-14T04:09:22.483 回答
7

Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.

I work for a security company and our bot detection algorithm look something like this:

  1. Step 1 - Gathering data:

    a. Cross-Check user-agent vs IP. (both need to be right)

    b. Check Header parameters (what is missing, what is the order and etc...)

    c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)

  2. Step 2 - Classification:

    By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"

  3. Step 3 - Active Challenges:

    Suspicious bots undergo the following challenges:

    a. JS Challenge (can it activate JS?)

    b. Cookie Challenge (can it accept coockies?)

    c. If still not conclusive -> CAPTCHA

This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).

We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.

There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).

GL

于 2012-11-14T11:51:30.750 回答
5

Beyond just comparing user agents, you would keep a log of activity and look for robot behavior. Often times this will include checking for /robots.txt and not loading images. Another trick is to ask the client if they have javascript since most bots won't mark it as enabled.

However, beware, you may well accidently get some people who are genuinely people.

于 2012-11-14T04:03:32.430 回答
2

不,用户代理可以被欺骗,因此它们不被信任。

除了检查 Javascript 或图像/css 加载之外,您还可以测量页面加载速度,因为机器人通常会比任何人类访问者跳来跳去更快地抓取您的网站。但这仅适用于小型站点,在共享外部 IP 地址(大型公司或大学校园)后面有大量访问者的热门站点可能会以类似机器人的速度访问您的站点。

我想你也可以测量它们加载的顺序,因为机器人会以先到先得的爬行顺序爬行,因为人类用户通常不适合这种模式,但跟踪起来有点复杂

于 2012-11-14T04:14:12.197 回答
1

您的问题特别与使用用户代理字符串的检测有关。正如许多人所提到的,这可以被欺骗。

要了解欺骗的可能性以及检测它的难度,最好建议您使用 cURL 学习 PHP 中的艺术。

从本质上讲,使用 cURL 几乎可以在浏览器(客户端)请求中发送的所有内容都可以被欺骗,但 IP 是一个明显的例外,但即使在这里,一个确定的欺骗者也会将自己隐藏在代理服务器后面,以消除您检测到他们的 IP。

不用说,每次发出请求时使用相同的参数将能够检测到欺骗者,但是使用不同的参数轮换将使得在真正的流量日志中检测到任何欺骗者非常困难,如果不是不可能的话。

于 2013-03-18T15:39:26.310 回答