php - PHP 没有用于 XML 安全实体解码的功能？没有一些 xml_entity_decode？

Question

问题：我需要一个由 UTF8 “完全编码”的 XML 文件；也就是说，没有实体表示符号，所有符号都由 UTF8 编码，除了 XML 保留的仅有的 3 个符号，“&”（amp）、“<”（lt）和“>”（gt）。而且，我需要一个能够快速完成的内置函数：将实体转换为真正的 UTF8 字符（不会破坏我的 XML）。
PS：这是一个“现实世界的问题”（！）；例如，在PMC/journals中，有 280 万篇科学文章以特殊的 XML DTD（也称为JATS 格式）进行编码...要处理为“通常的 XML-UTF8-text”，我们需要从数字实体更改为 UTF8字符。

尝试的解决方案：此任务的自然函数是html_entity_decode，但它破坏了 XML 代码 (!)，转换了保留的 3 个 XML 保留符号。

说明问题

认为

  $xmlFrag ='<p>Hello world! &#160;&#160; Let A&lt;B and A=&#x222C;dxdy</p>';

其中实体 160 (nbsp) 和 x222C (双整数) 必须转换为 UTF8，而 XML-reserved 则lt不需要。XML 文本将（转换后），

$xmlFrag = '<p>世界你好！令 A <B 和 A=∬dxdy </p>';

文本“A<B”需要一个 XML 保留字符，因此必须保持为A<B.

沮丧的解决方案

我尝试使用html_entity_decode来解决（直接！）问题......所以，我将我的 PHP 更新到 v5.5 以尝试使用该ENT_XML1选项，

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
                                                        // as I expected

也许另一个问题是，“为什么没有其他选择可以做我期望的事情？” ——这对许多其他 XML 应用程序（！）很重要，不仅对我来说。

我不需要解决方法作为答案...好吧，我展示了我丑陋的功能，也许它可以帮助您理解问题，

  function xml_entity_decode($s) {
    // here an illustration (by user-defined function) 
    // about how the hypothetical PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 

    //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+

    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
  }  // you see? not need a benchmark: 
     //  it is not so fast as direct use of html_entity_decode; if there 
     //  was an XML-safe option was ideal.

PS：在此答案后更正。必须是ENT_HTML5标志，以便真正转换所有命名实体。

score 6 · Accepted Answer

这个问题一次又一次地创造一个“错误答案”（见答案）。这可能是因为人们没有注意，因为没有答案：缺少PHP内置解决方案。

...因此，让我们重复我的解决方法（这不是答案！）以免造成更多混乱：

最好的解决方法

注意：

xml_entity_decode()下面的函数是最好的（超过任何其他）解决方法。
下面的函数不是当前问题的答案，它只是一种解决方法。

  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }

要测试并证明您有更好的解决方案，请先使用这个简单的基准进行测试：

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";

score 2 · Accepted Answer

加载 JATS XML 文档时使用 DTD，因为它将定义从命名实体到 Unicode 字符的任何映射，然后在保存时将编码设置为 UTF-8：

$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);

score 2 · Accepted Answer

我遇到了同样的问题，因为有人使用 HTML 模板来创建 XML，而不是使用 SimpleXML。叹息......无论如何，我想出了以下内容。它没有你的那么快，但也没有慢一个数量级，而且不那么hacky。您的会无意中转换#_x_amp#;为$amp;，但它不太可能出现在源 XML 中。

注意：我假设默认编码是 UTF-8

// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

/* <Foo>€&amp;amp;foo Ç</Foo> */

此外，如果您想用编号实体替换特殊字符（以防您不想要 UTF-8 XML），您可以轻松地在上面的代码中添加一个函数：

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>&#8364;&amp;foo &#199;</Foo> */

在您的情况下，您希望反过来。将编号实体编码为 UTF-8：

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>€&amp;amp;foo Ç</Foo> */

基准

我添加了一个好的衡量标准。为了清楚起见，这也表明了您的解决方案中的缺陷。下面是我使用的输入字符串。

<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>

你的方法

php -r '$q=["&amp;","&gt;","&lt;"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;amp;foo Ç é &amp; ∬&lt;/Foo>
=====
Time taken: 2.0397531986237

我的方法

php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;amp;foo Ç é #_x_amp#; &#8748;</Foo>
=====
Time taken: 4.045273065567

我的方法（使用 unicode 到编号实体）：

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>&#8364;&amp;foo &#199; &#233; #_x_amp#; &#8748;</Foo>
=====
Time taken: 5.4407880306244

我的方法（带有编号实体到 unicode）：

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;amp;foo Ç é #_x_amp#; ∬&lt;/Foo>
=====
Time taken: 5.5400078296661

score 1 · Accepted Answer

    public function entity_decode($str, $charset = NULL)
{
    if (strpos($str, '&') === FALSE)
    {
        return $str;
    }

    static $_entities;

    isset($charset) OR $charset = $this->charset;
    $flag = is_php('5.4')
        ? ENT_COMPAT | ENT_HTML5
        : ENT_COMPAT;

    do
    {
        $str_compare = $str;

        // Decode standard entities, avoiding false positives
        if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
        {
            if ( ! isset($_entities))
            {
                $_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));

                // If we're not on PHP 5.4+, add the possibly dangerous HTML 5
                // entities to the array manually
                if ($flag === ENT_COMPAT)
                {
                    $_entities[':'] = '&colon;';
                    $_entities['('] = '&lpar;';
                    $_entities[')'] = '&rpar';
                    $_entities["\n"] = '&newline;';
                    $_entities["\t"] = '&tab;';
                }
            }

            $replace = array();
            $matches = array_unique(array_map('strtolower', $matches[0]));
            for ($i = 0; $i < $c; $i++)
            {
                if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
                {
                    $replace[$matches[$i]] = $char;
                }
            }

            $str = str_ireplace(array_keys($replace), array_values($replace), $str);
        }

        // Decode numeric & UTF16 two byte entities
        $str = html_entity_decode(
            preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
            $flag,
            $charset
        );
    }
    while ($str_compare !== $str);
    return $str;
}

score 0 · Accepted Answer

对于那些来到这里的人，因为您在 128 到 159 范围内的数字实体仍然是数字实体，而不是被转换为字符：

echo xml_entity_decode('&#128;');
//Output &#128; instead expected €

这取决于 PHP 版本（至少对于 PHP >=5.6，实体仍然存在）和受影响的字符。原因是字符 128 到 159 不是 UTF-8 中的可打印字符。如果要转换的数据混淆了 windows-1252 内容（其中 € 是 € 符号），则可能会发生这种情况。

score -1 · Accepted Answer

试试这个功能：

function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
     return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s);
else
     return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), html_entity_decode($s));
}

示例用法：

echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';

也：https ://stackoverflow.com/a/9446666/2312709

生产中使用的这段代码似乎与 UTF-8 没有问题