它确实是由 JavaScript 设置的 cookie,然后重定向到原始图像。问题是 curl/fgc 不会解析 html 并将 cookie 设置为由 curl 将存储在其 cookie jar 中的服务器设置的唯一 cookie。
这是您在重定向之前获得的代码,它通过 JavaScript 制作一个没有名称但 location.href 作为值的 cookie:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<HEAD>
<TITLE>http://phim.xixam.com/thumb/giotdang.jpeg</TITLE>
<meta http-equiv="Refresh" content="0;url=http://phim.xixam.com/thumb/giotdang.jpeg">
</HEAD>
<script type="text/javascript">
window.onload = function checknow() {
var today = new Date();
var expires = 3600000*1*1;
var expires_date = new Date(today.getTime() + (expires));
var ua = navigator.userAgent.toLowerCase();
if ( ua.indexOf( "safari" ) != -1 ) { document.cookie = "location.href"; } else { document.cookie = "location.href;expires=" + expires_date.toGMTString(); }
}
</script>
<BODY>
</BODY></HTML>
但一切都不会丢失,因为通过预先设置/伪造 cookie,您可以规避此安全措施(使用 cookie 进行任何类型的安全性都是不好的原因)。
cookie.txt
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
phim.xixam.com FALSE /thumb/ FALSE 1338867990 location.href
所以完成的 curl 脚本看起来像:
<?php
function curl_get($url){
$return = '';
(function_exists('curl_init')) ? '' : die('cURL Must be installed!');
//Forge the cookie
$expire = time()+3600000*1*1;
$cookie =<<<COOKIE
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
phim.xixam.com FALSE /thumb/ FALSE $expire location.href
COOKIE;
file_put_contents(dirname(__FILE__).'/cookie.txt',$cookie);
//Browser Masquerade cURL request
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/json,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_COOKIEJAR, dirname(__FILE__).'/cookie.txt');
curl_setopt($curl, CURLOPT_COOKIEFILE, dirname(__FILE__).'/cookie.txt');
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, 0);
//Pass the referer check
curl_setopt($curl, CURLOPT_REFERER, 'http://xixam.com/forum.php');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$html = curl_exec($curl);
curl_close($curl);
return $html;
}
$image = curl_get('http://phim.xixam.com/thumb/giotdang.jpeg');
file_put_contents('test.jpg',$image);
?>
阻止爬虫的唯一方法是将所有访问者的 ips 记录在数据库中,并根据每个 ip 的访问量增加一个值,然后每周一次左右查看 ip 的热门点击,然后反向查找 ip 和看看它是否来自托管服务提供商,如果在你的防火墙或 htaccess 中阻止它,那么你不能真正停止对资源的请求,如果它是公开可用的,因为任何障碍都可以克服。
希望能帮助到你。