1

我写了一个爬虫,它显然被一些网站阻止了。我想做的是获取带有假用户代理 ID(类似Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.12011-10-16 20:23:00)的页面。请注意,fsockopen()不支持这一点,所以我试图以不同的方式做到这一点。

这是我的代码:

$sock = fsockopen($host, 80, $errno, $errstr, 30);

然后在下面,这是我用指针做的事情:

    $request  = "HEAD "  . $path . " HTTP/1.1\r\n"; 
    $request .= 'Host: ' . $host . "\r\n"; 
    $request .= "Connection: Close\r\n\r\n"; 
    fwrite($sock, $request);

同样,如何设置假浏览器代理?我可以在$request字符串中设置它吗?

4

3 回答 3

2

使用 fsockopen,您可以像添加其他 Headers 选项一样添加用户代理:

$sock = fsockopen($host, 80, $errno, $errstr, 30);

$request  = "HEAD "  . $path . " HTTP/1.1\r\n"; 
$request .= 'Host: ' . $host . "\r\n"; 
$request .= "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.12011-10-16 20:23:00\r\n";
$request .= "Connection: Close\r\n\r\n"; 


fwrite($sock, $request);

用 PHP5.3 测试

于 2014-07-09T04:03:40.370 回答
1

如果您使用 php cURL(如您在标签中建议的那样),您应该能够:

curl_setopt($ch, CURLOPT_HTTPHEADER, array('User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.12011-10-16 20:23:00'));      
于 2012-11-12T18:22:03.597 回答
0

这对我有用

$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
echo $content;
$response = curl_getinfo( $ch );
curl_close ( $ch );
于 2014-06-28T22:41:44.137 回答