0

我正在将一组记录导出到 xml,然后通过 xslt 转换到 xliff。导出工作正常,但我无法转换导出文件中的某些字符。这里有一些一步一步的细节:

步骤 1. 用户输入混合字符串,例如以下字符串 Autocomplete On' see the wrong character ==> í

Mysql db/table 字段编码设置为 utf8 例如

  `unicode longtext COLLATE utf8_unicode_ci`

它存储上述文本。

步骤 2. 为导出目的生成一个 html 片段,例如

<html version="1.2">
    <table>
        <tr>
            <td id="Autocomplete_On">Autocomplete On' see the wrong character ==&#62; í</td>
        </tr>
    </table>
    </html>

步骤 3. 转换为 xml

  <?xml version="1.0" standalone="yes"?>
     <html version="1.2"><body><table><tr><td id="Autocomplete_On">
        Autocomplete On' see the wrong character ==&gt; &#xC3;&#xAD;</td>
</tr></table></body></html>

第 4 步:使用 xslt 进行转换:

(仅粘贴所需的输出部分,在浏览器中查看时我看到了这个,而实际字符Ã在文件中)

 <body>
      <group id="id796986axmarkhtml-0">
        <group id="id533787bxmarkbody-1">
          <group id="id533788bxmarktable-2">
            <group id="id533790bxmarktr-3">
              <trans-unit id="td-4">
                <source>Autocomplete On' see the wrong character ==&gt; í</source>
                <target>Autocomplete On' see the wrong character ==&gt; í</target>
              </trans-unit>
            </group>
          </group>
        </group>
      </group>
    </body>

实际代码:

  private function xml2xliff($htmlStr,$source,$target){
        $xml=new \DOMDocument();
        //hacky way to tidy html
        @$xml->loadHTML($htmlStr);//step 3
        $xsl = new \DOMDocument;
        $xsl->load(__DIR__.'/xliff/xsl/xml2xliff.xsl');
        $proc = new \XSLTProcessor();
        $proc->ImportStyleSheet($xsl);
        $proc->setParameter('', 'source', $this->getIsoName($source));
        $proc->setParameter('', 'target', $this->getIsoName($target));
        return $proc->transformToXML($xml); //step 4
    }

$htmlStr 是步骤 2 中生成的 html 片段,

所以问题是字符串被转换了两次。正在考虑的实际特征是

第1步。í

步骤 2. 仍然í

step 3. 转换为Ã ie &#xC3;&#xAD;

步骤 4. 转换为í

另一个例子:

输入。Autocomplete On They’re gone now

xml 输出。Autocomplete On Theyâre gone now

4

1 回答 1

0

DOMDocument::loadHtml() 将您的 html 加载为 ANSI,但它是 UTF-8。所以特殊字符被分割和销毁。您可以欺骗它使用带有 XML 处理指令的 UTF-8:

$html = <<<HTML
<html>
  <table>
    <tr>
      <td id="Autocomplete_On">Autocomplete On' see the wrong character ==&#62; í</td>
    </tr>
  </table>
</html>
HTML;

$dom = new DOMDocument('1.0', 'UTF-8');

$dom->loadHTML('<?xml encoding="UTF-8"?>'.$html);
var_dump(
  $dom->saveXml()
);

输出:

string(331) "<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="UTF-8"??>
<html version="1.2"><body><table><tr><td id="Autocomplete_On">Autocomplete On' see the wrong character ==&gt; &#xED;</td>&#xD;
    </tr></table></body></html>
"
于 2014-05-17T13:55:46.940 回答