php - 使用 PHP 抓取页面会导致意外字符

Question

好的，所以我正在使用 PHP 从网页中抓取一些数据，并以某种方式提取源文档中不存在的一些意外字符。我认为这是由于我解释了错误的字符编码，但我不确定如何解决这个问题

这是给我错误的 HTML 示例

<tr>
    <td>Aug 2013</td>
    <td>TEDxColbyCollege</td>
    <td>
        <a href="/talks/daniel_h_cohen_for_argument_s_sake.html">Daniel H. Cohen: For argument’s sake</a>       </td>
   . 
   . 
   . 
// more of the table

现在，我在 db 中回显/存储的结果字符串如下所示：Daniel H. Cohen: For argumentÃ¢ÂÂs sake

我正在使用以下代码加载 HTML 文档并抓取

$html = file_get_contents('url_of_html_page_being_scrapped');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
$table = $sxml->xpath('//table');
foreach($tbl->tr as $vid)
{
 .
 .
 echo $vid->td[2]->a  // line giving me the problem
 .
 .
}

文件头部指出

 <!doctype html>
 <html lang="en">
 <head>
 <meta charset="utf-8">
 .
 .
 </head>

所以我假设我的方法没有正确解释字符集，尽管我不确定如何指定这个或者它是否是问题......而且似乎错误发生在值上：'任何洞察正在发生的事情/如何我可以修复它会很棒，因为我不确定

更新在@Patrick Manser 的一些建议之后，我尝试了在 SO 其他地方找到的解决方案

主要是：

 $html =stripslashes(mb_convert_encoding( file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8" ));
 //AND
 $html = mb_convert_encoding( file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8" );

两者都导致输出看起来像这样Daniel H. Cohen: For argumentâ€™s sake

score 1 · Accepted Answer

尽管在回显时以及在我的数据库表中使用 html 文档头部的这一行（显示数据时 make's the ）时，文本仍然显示错误配置 '正确呈现

 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

score 1 · Accepted Answer

即使正确应用htmlspecialchars_decode()、html_entities_decode()和mb_convert_encoding()，这个问题也很难摆脱。

我使用 Sebastián GrignoliforceUTF8()函数的修改版本来完全清理字符串。我不知道还有什么比它更适合 php 的。

您可以在 github 上找到该函数的一个版本。

如果您真的需要完全清理而不考虑涉及的角色，这会产生惊人的结果。

以下是自述文件中的示例。

一个示例用法：

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

例子：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÃ©dÃÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÃÃ©dÃÃÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÃÃÃ©dÃÃÃÃ©ration Camerounaise de Football");

将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

编辑

另外，请注意，如果您使用基于 Web 的 DB 浏览器（如 phpMyAdmin），您可能会遇到存储在 DB 中的字符编码与网页定义的编码之间的字符差异。我曾经遇到过存储在数据库中的内容完全正确的情况，但从界面上看却是错误的。

php - 使用 PHP 抓取页面会导致意外字符

2 回答 2

Related

Reference