php - simple_html_dom 不从某些网站获取数据

Question

simple_html_dom 不从某些网站获取数据。对于网站www.google.pl，它下载页面的来源，但对于其他如：gearbest.com，stooq.pl 不下载任何数据。

require('simple_html_dom.php');

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/"); //  work

/*
curl_setopt($ch, CURLOPT_URL, "https://www.gearbest.com/"); // dont work
curl_setopt($ch, CURLOPT_URL, "https://stooq.pl/"); // dont work
*/

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
curl_close($ch);

$html = new simple_html_dom();
$html->load($response);

echo $html;

我应该对代码进行哪些更改才能从网站接收数据？

score 0 · Accepted Answer

这里的根本问题（至少在我的计算机上，可能与您的版本不同......）是该站点返回 gzip 压缩的数据，并且在传递给 dom 解析器之前它没有被 php 和 curl 正确解压缩。如果您使用的是 php 5.4，您可以使用 gzdecode 和 file_get_contents 自行解压缩。
<?php
    // download the site
    $data = file_get_contents("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035");
    // decompress it (a bit hacky to strip off the gzip header)
    $data = gzinflate(substr($data, 10, -8));
    include("simple_html_dom.php");
    // parse and use
    $html = str_get_html($data);
    echo $html->root->innertext();
请注意，此 hack 不适用于大多数网站。在我看来，这背后的主要原因是 curl 没有宣布它接受 gzip 数据......但是该域上的 Web 服务器并不关注该标头，并且无论如何都会对其进行 gzip 压缩。然后，curl 和 php 都不会真正检查响应中的 Content-Encoding 标头，并假定它没有被 gzip 压缩，因此它通过它而不会出错，也不会调用 gunzip。服务器和客户端的错误都在这里！

对于更强大的解决方案，也许您可以使用 curl 获取标头并自己检查它们以确定是否需要解压缩它。或者，您可以只使用这个 hack 这个网站和其他人的正常方法来保持简单。

在您的输出上设置字符编码可能仍然有帮助。在回显任何内容之前添加此内容，以确保您读取的数据不会因为被读取为错误的字符集而在用户浏览器中被重新损坏：
header('Content-Type: text/html; charset=utf-8');

php - simple_html_dom 不从某些网站获取数据

1 回答 1

Related

Reference