1

我已将网络抓取的结果从 DOMNodeLists 转换为字符串:

$node = $the_sentence->item(0);
$the_sentence = "{$node->nodeName} - {$node->nodeValue}";

但是现在当我打印出结果时,它包括文本在页面中的任何标签以及   字符:

前:

"This is the sentence"

现在:

"h2 - This is the Âsentence Â"

有什么想法可以摆脱这些角色吗?谢谢你的帮助。

4

1 回答 1

1

This looks like a character set problem.

Have a look at the source page and see what character set it is encoded in. This might be in a Content-Type HTTP header, or it might be in a <meta> tag at the start of the document. Then, when you handle the data, make sure that everything you do handles it in the same format.

You probably want to store the data in UTF-8. Thus, if you capture in another format, in general it is a good idea to convert it from that charset to UTF-8; this will mean you can capture from a wide range of sources and store it in the same database. Look at iconv in the PHP manual if you wish to learn more about charset conversion.

Are you printing the output to console or a browser? If the former, note that some consoles (old versions of Windows in particular) do not handle UTF-8 well at all. If you are echoing to a browser, make sure your character set is set to "UTF-8" in your own HTML.

于 2013-11-05T19:38:17.833 回答