41

其他一些网站使用 cURL 和假 http 引用来复制我的网站内容。我们有什么方法可以检测 cURL 或不是真正的网络浏览器吗?

4

6 回答 6

101

没有避免自动爬行的神奇解决方案。凡是人类能做到的,机器人也能做到。只有让工作变得更难的解决方案,如此之难,以至于只有技术过硬的极客可能会试图通过它们。

几年前我也遇到了麻烦,我的第一个建议是,如果你有时间,自己做一个爬虫(我假设“爬虫”是爬你网站的人),这是该学科最好的学校。通过爬取几个网站,我学到了不同类型的保护,并且通过将它们关联起来,我一直很有效率。

我给你一些你可以尝试的保护例子。


每个 IP 的会话数

如果用户每分钟使用 50 个新会话,您可以认为该用户可能是不处理 cookie 的爬虫。当然,curl 可以完美地管理 cookie,但是如果您将它与每个会话的访问计数器(稍后解释)相结合,或者如果您的爬虫是 cookie 问题的新手,它可能会很有效。

很难想象同一共享连接的 50 个人会同时访问您的网站(这当然取决于您的流量,这取决于您)。如果发生这种情况,您可以锁定您网站的页面,直到验证码被填充。

主意 :

1)您创建 2 个表:1 个保存被禁止的 ip,1 个保存 ip 和会话

create table if not exists sessions_per_ip (
  ip int unsigned,
  session_id varchar(32),
  creation timestamp default current_timestamp,
  primary key(ip, session_id)
);

create table if not exists banned_ips (
  ip int unsigned,
  creation timestamp default current_timestamp,
  primary key(ip)
);

2)在脚本的开头,您从两个表中删除了太旧的条目

3)接下来你检查你的用户的IP是否被禁止(你设置一个标志为真)

4)如果没有,你计算他有多少会话为他的 ip

5)如果他有太多的会话,你将它插入你的禁止表并设置一个标志

6)如果尚未插入,则在每个 ip 表的会话中插入他的 ip

我写了一个代码示例以更好地展示我的想法。

<?php

try
{

    // Some configuration (small values for demo)
    $max_sessions = 5; // 5 sessions/ip simultaneousely allowed
    $check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
    $lock_duration = 60; // time to lock your website for this ip if max_sessions is reached

    // Mysql connection
    require_once("config.php");
    $dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

    // Delete old entries in tables
    $query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
    $dbh->exec($query);

    $query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
    $dbh->exec($query);

    // Get useful info attached to our user...
    session_start();
    $ip = ip2long($_SERVER['REMOTE_ADDR']);
    $session_id = session_id();

    // Check if IP is already banned
    $banned = false;
    $count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
    if ($count > 0)
    {
        $banned = true;
    }
    else
    {
        // Count entries in our db for this ip
        $query = "select count(*)  from sessions_per_ip where ip = '{$ip}'";
        $count = $dbh->query($query)->fetchColumn();
        if ($count >= $max_sessions)
        {
            // Lock website for this ip
            $query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
            $dbh->exec($query);
            $banned = true;
        }

        // Insert a new entry on our db if user's session is not already recorded
        $query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
        $dbh->exec($query);
    }

    // At this point you have a $banned if your user is banned or not.
    // The following code will allow us to test it...

    // We do not display anything now because we'll play with sessions :
    // to make the demo more readable I prefer going step by step like
    // this.
    ob_start();

    // Displays your current sessions
    echo "Your current sessions keys are : <br/>";
    $query = "select session_id from sessions_per_ip where ip = '{$ip}'";
    foreach ($dbh->query($query) as $row) {
        echo "{$row['session_id']}<br/>";
    }

    // Display and handle a way to create new sessions
    echo str_repeat('<br/>', 2);
    echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
    if (isset($_GET['new']))
    {
        session_regenerate_id();
        session_destroy();
        header("Location: " . basename(__FILE__));
        die();
    }

    // Display if you're banned or not
    echo str_repeat('<br/>', 2);
    if ($banned)
    {
        echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
        echo '<br/>';
        echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
    }
    else
    {
        echo '<span style="color:blue;">You are not banned!</span>';
        echo '<br/>';
        echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
    }
    ob_end_flush();
}
catch (PDOException $e)
{
    /*echo*/ $e->getMessage();
}

?>

参观柜台

如果您的用户使用相同的 cookie 来抓取您的页面,您将能够使用他的会话来阻止它。这个想法很简单:你的用户有可能在 60 秒内访问 60 个页面吗?

主意 :

  1. 在用户会话中创建一个数组,它将包含访问时间()。
  2. 删除此数组中超过 X 秒的访问
  3. 为实际访问添加一个新条目
  4. 计算此数组中的条目
  5. 如果他访问了 Y 个页面,则禁止您的用户

示例代码:

<?php

$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits

session_start();

// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
    $_SESSION['visit_counter'] = array();
}

// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
    if ((time() - $time) > $visit_counter_secs) {
        unset($_SESSION['visit_counter'][$key]);
    }
}

// we add the current visit into our array
$_SESSION['visit_counter'][] = time();

// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);

echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

要下载的图像

当爬虫需要做他的脏活时,那是为了大量的数据,并且在尽可能短的时间内。这就是为什么他们不下载页面上的图像;它需要太多的带宽并使爬行变慢。

这个想法(我认为最优雅和最容易实现)使用mod_rewrite将代码隐藏在 .jpg/.png/... 图像文件中。此图像应在您要保护的每个页面上都可用:它可能是您的徽标网站,但您将选择一个小尺寸的图像(因为不得缓存此图像)。

主意 :

1/ 将这些行添加到您的 .htaccess

RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php

2/ 安全地创建你的 logo.php

<?php

// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;

// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");

// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();

3/ 在您需要添加安全性的每个页面上增加您的 no_logo_count,并检查它是否达到您的限制。

示例代码:

<?php

$no_logo_limit = 5; // number of allowd pages without logo

// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
    $_SESSION['no_logo_count'] = 0;
}
else
{
    $_SESSION['no_logo_count']++;
}

// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display "show image" link : note that we're using .jpg file
echo <<< EOT

<div id="image_container">
    <a id="image_load" href="#">Load image</a>
</div>
<br/>

<script type="text/javascript">

  // On your implementation, you'llO of course use <img src="logo.jpg" />
  $('#image_load').click(function(e) {
    e.preventDefault();
    $('#image_load').html('<img src="logo.jpg" />');
  });

</script>

EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

饼干检查

您可以在 javascript 端创建 cookie 以检查您的用户是否会解释 javascript(例如,使用 Curl 的爬虫不会)。

这个想法很简单:这与图像检查大致相同。

  1. 将 $_SESSION 值设置为 1 并在每次访问中递增
  2. 如果确实存在 cookie(在 JavaScript 中设置),则将会话值设置为 0
  3. 如果此值达到限制,则禁止您的用户

代码 :

<?php

$no_cookie_limit = 5; // number of allowd pages without cookie set check

// Start session and reset counter
session_start();

if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
    $_SESSION['cookie_check_count'] = 0;
}

// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
    // Cookie does not exist or is incorrect...
    $_SESSION['cookie_check_count']++;
}
else
{
    // Cookie is properly set so we reset counter
    $_SESSION['cookie_check_count'] = 0;
}

// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<br/>
<a id="reload" href="#">Reload</a>
<br/>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

// Display "set cookie" link
echo <<< EOT

<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#cookie_link').click(function(e) {
    e.preventDefault();
    var expires = new Date();
    expires.setTime(new Date().getTime() + 3600000);
    document.cookie="cookie_check=42;expires=" + expires.toGMTString();
  });

</script>
EOT;


// Display "unset cookie" link
echo <<< EOT

<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#unset_cookie').click(function(e) {
    e.preventDefault();
    document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
  });

</script>
EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}

防止代理

关于我们可能在网上找到的不同类型的代理的一些话:

  • “普通”代理显示有关用户连接的信息(特别是他的 IP)
  • 匿名代理不显示 IP,但会在标头上提供有关代理使用情况的信息。
  • 高匿名代理不显示用户 IP,也不显示浏览器可能不会发送的任何信息。

找到连接任何网站的代理很容易,但很难找到高匿名代理。

如果您的用户位于代理后面,则某些 $_SERVER 变量可能包含键(详尽列表取自此问题):

  • CLIENT_IP
  • 转发
  • FORWARDED_FOR
  • FORWARDED_FOR_IP
  • HTTP_CLIENT_IP
  • HTTP_FORWARDED
  • HTTP_FORWARDED_FOR
  • HTTP_FORWARDED_FOR_IP
  • HTTP_PC_REMOTE_ADDR
  • HTTP_PROXY_CONNECTION'
  • HTTP_VIA
  • HTTP_X_FORWARDED
  • HTTP_X_FORWARDED_FOR
  • HTTP_X_FORWARDED_FOR_IP
  • HTTP_X_IMFORWARDS
  • HTTP_XROXY_CONNECTION
  • 通过
  • X_FORWARDED
  • X_FORWARDED_FOR

$_SERVER如果您在变量上检测到其中一个键,您可能会为您的反爬网证券提供不同的行为(下限等) 。


结论

有很多方法可以检测您网站上的滥用行为,因此您肯定会找到解决方案。但是您需要准确地知道您的网站是如何使用的,这样您的证券就不会对您的“普通”用户产生攻击性。

于 2012-09-13T07:18:06.093 回答
2

Remember: HTTP is not magic. There's a defined set of headers sent with each HTTP request; if these headers are sent by web-browser, they can as well be sent by any program - including cURL (and libcurl).

Some consider it a curse, but on the other hand, it's a blessing, as it greatly simplifies functional testing of web applications.

UPDATE: As unr3al011 rightly noticed, curl doesn't execute JavaScript, so in theory it's possible to create a page that will behave differently when viewed by grabbers (for example, with setting and, later, checking a specific cookie by JS means).

Still, it'd be a very fragile defense. The page's data still had to be grabbed from server - and this HTTP request (and it's always HTTP request) can be emulated by curl. Check this answer for example of how to defeat such defense.

... and I didn't even mention that some grabbers are able to execute JavaScript. )

于 2012-09-04T06:05:40.920 回答
0

避免虚假推荐人的方法是跟踪用户

您可以通过以下一种或多种方法跟踪用户:

  1. 使用一些特殊代码(例如:上次访问的 url,时间戳)在浏览器客户端中保存一个 cookie,并在服务器的每个响应中验证它。

  2. 与以前相同,但使用会话而不是显式 cookie

对于 cookie,您应该添加加密安全性,例如。

[Cookie]
url => http://someurl/
hash => dsafdshfdslajfd

哈希是通过这种方式在 PHP 中计算的

$url = $_COOKIE['url'];
$hash = $_COOKIE['hash'];
$secret = 'This is a fixed secret in the code of your application';

$isValidCookie = (hash('algo', $secret . $url) === $hash);

$isValidReferer = $isValidCookie & ($_SERVER['HTTP_REFERER'] === $url)
于 2012-09-13T07:47:46.713 回答
0

您可以通过以下方法检测 cURL-Useragent。但请注意,用户代理可能会被用户覆盖,无论如何默认设置都可以通过以下方式识别:

function is_curl() {
    if (stristr($_SERVER["HTTP_USER_AGENT"], 'curl'))
        return true;
}
于 2018-03-10T23:58:41.227 回答
-1

正如一些人提到的那样,cURL 无法执行 JavaScritp(据我所知),因此您可以尝试像raina77ow 建议的那样设置一些东西,但这对于其他抓取器/donwloader 来说是行不通的。

我建议您尝试构建一个机器人陷阱,以处理可以执行 JavaScript 的抓取器/下载器。

我不知道有任何一种解决方案可以完全防止这种情况,所以我最好的建议是尝试多种解决方案:

1) 在你的.htaccess 文件中只允许已知的用户代理,例如所有主流浏览器

2) 设置您的 robots.txt 以防止机器人

3) 为不尊重 robots.txt 文件的机器人设置机器人陷阱

于 2012-09-12T17:19:39.060 回答
-4

将其作为文件放入根文件夹.htaccess。它可能会有所帮助。我在一个虚拟主机提供商网站上找到了它,但不知道这意味着什么:)

SetEnvIf User-Agent ^Teleport graber   
SetEnvIf User-Agent ^w3m graber    
SetEnvIf User-Agent ^Offline graber   
SetEnvIf User-Agent Downloader graber  
SetEnvIf User-Agent snake graber  
SetEnvIf User-Agent Xenu graber   
Deny from env=graber
于 2012-09-05T08:04:57.647 回答