php - 维基百科不喜欢 file_get_contents

Question

我使用 PHP 函数file_get_contents作为代理来获取两个不同 Web 主机上的网站。

它适用于除维基百科以外的所有网站。

它每次都给我这个输出：

WIKIMEDIA FOUNDATION
错误
我们的服务器当前遇到技术问题。这可能是暂时的，应该尽快修复。请在几分钟后再试一次。

有谁知道问题是什么？

score 5 · Accepted Answer

您可能没有通过正确的用户代理。见这里。

您应该将上下文传递给file_get_contents：

score 1 · Accepted Answer

维基媒体基金会的政策是阻止带有非描述性或缺少 User-Agent 标头的请求，因为这些请求往往源自行为不端的脚本。“PHP”是此标头的黑名单值之一。

您应该将默认的 User-Agent 标头更改为标识您的脚本以及系统管理员在必要时如何与您联系的标头：

ini_set('user_agent', 'MyCoolTool/1.1 (http://example.com/MyCoolTool/; MyCoolTool@example.com)');

当然，请务必更改名称、URL 和电子邮件地址，而不是逐字复制代码。

score 0 · Accepted Answer

Wikipedia 要求User-Agent随请求一起发送 HTTP 标头。默认情况下，file_get_contents不发送。

您应该使用fsockopen、fputs和feof来fgets发送完整的 HTTP 请求，或者您可以使用 cURL 来完成。我个人的经验是f*功能，所以这里有一个例子：

$attempts = 0;
do {
    $fp = @fsockopen("en.wikipedia.org",80,$errno,$errstr,5);
    $attempts++;
} while(!$fp && $attempts < 5);
if( !$fp) die("Failed to connect");
fputs($fp,"GET /wiki/Page_name_here HTTP/1.0\r\n"
     ."Host: en.wikipedia.org\r\n"
     ."User-Agent: PHP-scraper (your-email@yourwebsite.com)\r\n\r\n");
$out = "";
while(!feof($fp)) {
    $out .= fgets($fp);
}
fclose($fp);
list($head,$body) = explode("\r\n\r\n",$out);
$head = explode("\r\n",$head);
list($http,$status,$statustext) = explode(" ",array_shift($head),3);
if( $status != 200) die("HTTP status ".$status." ".$statustext);
echo $body;

score 0 · Accepted Answer

为此使用 cURL：

$ch = curl_init('http://wikipedia.org');
curl_setopt_array($ch, array(
    CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 5.1; rv:18.0) Gecko/20100101 Firefox/18.0',
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_RETURNTRANSFER => true
);
$data = curl_exec($ch);
echo $data;

score -1 · Accepted Answer

我假设您已经“在几分钟内再次尝试”。

您可以尝试的下一件事是使用 cURL 而不是file_get_contents，并将用户代理设置为常用浏览器之一。

如果它仍然不起作用，它至少应该为您提供更多信息。

php - 维基百科不喜欢 file_get_contents

5 回答 5

Related

Reference