php - 将特殊的 HTML 字符转换回其原始字符串

Question

我正在构建一个小型解析器，用于抓取网页并在其上记录数据。要记录的一件事是论坛的帖子标题。我正在使用 XML 解析器来查看 DOM 并获取此信息，并且我将其存储如下：

// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));

这在大多数情况下都有效，但有些帖子有某些特殊的 HTML 字符代码–，例如破折号 ( -)。我将如何将这些特殊字符代码转换回其原始字符串？

谢谢。

score 3 · Accepted Answer

使用html_entity_decode。这是一个简单的例子。

$string = "hyphenated&#8211words";

$new = html_entity_decode($string);

echo $new;

你应该看到...

hyphenated–words

score 0 · Accepted Answer

文档是你的朋友：

html_entity_decode(trim($title->plaintext), ENT_XHTML, YOUR_ENCODING);
                                            ^^^^^^^^^^^^^^^^^^^^^^^^

score 0 · Accepted Answer

这可能会有所帮助：

<?php
 function clean_up($str){
 $str = stripslashes($str);
 $str = strtr($str, get_html_translation_table(HTML_ENTITIES));
 $str = str_replace( array("\x82", "\x84", "\x85", "\x91", "\x92", "\x93", "\x94", "\x95", "\x96",  "\x97"), array("&#8218;", "&#8222;", "&#8230;", "&#8216;", "&#8217;", "&#8220;", "&#8221;", "&#8226;", "&#8211;", "&#8212;"),$str);
return $str;
}
?>

php - 将特殊的 HTML 字符转换回其原始字符串

3 回答 3

Related

Reference