2

嗨,我有一个想用 cUrl 解析的域,情况如下:

当我进入域http://register.metsad.ee/avalik/info_teatis.php?too_id=2942704201

它会将我重定向到 [register.metsad.ee/avalik/info_teatis.php?too_id=2942704201]

没有 http://www 也是一样的。我用来解析的代码是:

function get_data($url) {
        $ch = curl_init();
        $timeout = 5;
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
        $data = curl_exec($ch);
        curl_close($ch);
        return $data;
    }
$src = 'http://register.metsad.ee/avalik/info_teatis.php?too_id=2942704201';

然后$c = get_data($src); echo $c; 对于结果,我得到一个空白的白页。我也尝试过这样的 Simple_Html_Dom 解析器:

echo file_get_html($src)->plaintext;

但我仍然得到一个空白的白页。当我尝试在没有 http:// 的情况下进行解析时,会出现一个错误

Warning: file_get_contents(register.metsad.ee/avalik/info_teatis.php?too_id=2942704201) [function.file-get-contents]: failed to open stream: Result too large in C:\xampp\htdocs\Trash\metsakontroll\system\c_simple_html_dom.php on line 70

cUrl 仍然是白屏,没有效果。当我试图像这样的文件夹解析它时:

http://www.metsad.ee/register/avalik/info_teatis.php?too_id=2942704201然后服务器说未找到

我搜索了整个互联网 =/ 任何想法如何通过 cUrl 或 Simple_html_dom 阅读该页面?

4

1 回答 1

2

register.metsad.ee 方面有某种保护。Thay 返回空响应,直到User-Agent设置标头。

呼叫失败(空响应):

feedbee@server:~$ telnet register.metsad.ee 80
Trying 213.184.43.115...
Connected to register.metsad.ee.
Escape character is '^]'.
GET /avalik/info_teatis.php?too_id=2942704201 HTTP/1.1
Host: register.metsad.ee

HTTP/1.1 200 OK
Date: Thu, 13 Dec 2012 20:07:11 GMT
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8

调用成功(返回 HTML 页面):

feedbee@server:~$ telnet register.metsad.ee 80
GET http://register.metsad.ee/avalik/info_teatis.php?too_id=2942704201 HTTP/1.1
Host: register.metsad.ee
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0

HTTP/1.1 200 OK
Date: Thu, 13 Dec 2012 20:13:07 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: SNS=a0e425c2aec17c38be3716b366f75749; path=/
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

762
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
...

因此,您需要将下一行添加到:

curl_setopt($ch, So you need to add CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"); for example (or any other user agent string).
于 2012-12-13T20:14:20.510 回答