php - 抓取网页时回显 utf-8 文本

Question

我使用此代码从网站上抓取特定数据：

<!DOCTYPE html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">

    <title>scrap</title>
  </head>
  <body>
<?php
$url = 'http://xn--mgbaam1d9c.com';
$html = file_get_contents( $url);

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);

// A name attribute on a <div>???
$node = $xpath->query( '//div[@class="list"]')->item( 0);

echo $node->textContent; 

?>

</body>
</html>

刮得很好但是

结果只显示 1 个结果，我希望它显示所有结果（网站有分页）。
结果以阿拉伯语显示，如下图所示 - http://i.stack.imgur.com/Z9VMn.png

那么我如何让它获得所有结果并像它们一样以阿拉伯语显示它们。

提前致谢。

score 2 · Accepted Answer

你只得到第一项.item(0)。看看 $xpath->query返回的是什么：DOMNodeList它有一个length 属性。
将编码从转换windows-1256为utf-8using iconv。

像这样的东西：

$nodeList = $xpath->query( '//div[@class="list"]');

for ( $i = 0; $i < $nodeList->length; $i++ ) {
    $node = $nodeList->item($i);
    echo iconv('WINDOWS-1256','UTF-8',$node->textContent);
}

编辑： mb_convert_encoding不支持 windows-1256，改为iconv改为。

您还可以从 HTML 元中动态检索内容编码：

$fromEncoding = '';
$contentType = $xpath->query('//meta[@http-equiv="content-type"]')->item(0)->getAttribute('content');
preg_match('/charset=([A-Za-z0-9_-]+)$/',$contentType,$contentTypeMatches);
if ( isset($contentTypeMatches[1]) ) {
    $fromEncoding = strtoupper($contentTypeMatches[1]);
}

php - 抓取网页时回显 utf-8 文本

1 回答 1

Related

Reference