python - 将 XML 非法 &char 转换为 utf8 - python

Question

有一个 XML 和 HTML 字符引用列表：https ://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references 。

但是，该列表中根本没有定义一些东西，但它们已在较旧的 HTML 脚本中使用。当我处理Senseval-2 format (with fixes)来自http://www.d.umn.edu/~tpederse/data.html的数据集时，我遇到以下单词，它破坏了我试图用来xml.et.elementTree解析数据的脚本。

这些词的 unicode 等价物是什么？

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.

我的脚本：

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()

给出这个回溯：

Traceback (most recent call last):
  File "senseval.py", line 4, in <module>
    tree = et.parse(s1)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113

score 4 · Accepted Answer

“单词”看起来像格式错误的实体引用。有效的实体引用末尾有一个分号。我查看了test-fix.xml（在 Sval1to2.fix.tar.gz 中），似乎&dash(or &dash.) 很可能代表某种破折号或连字符。该文件具有.xml扩展名，如果修复了错误的实体引用，它将非常接近格式良好的 XML。

在您链接到的页面（http://www.d.umn.edu/~tpederse/data.html）上，它说：

请注意，我们转换后的数据不会“解析”为真正的 xml 文本。这是因为在原始的有义标记文本中，需要在 xml 中进行特殊处理的字符没有被转义，等等。我们正在考虑使这些数据“真实”的 xml 的方法，并且非常感谢有关如何最好地做到这一点的任何反馈。

因此，即使文档看起来很像 XML，但它不是 XML，发布它的人也很清楚这一点。

score 3 · Accepted Answer

基本但令人失望的答案是：它们是拼写错误（使用.而不是;）。

这是其中的大多数：

次 → http://www.fileformat.info/info/unicode/char/d7/index.htm
学位 → http://www.fileformat.info/info/unicode/char/b0/index.htm
破折号 → http://www.fileformat.info/info/unicode/char/search.htm?q=dash&preview=entity
省略号 → http://www.fileformat.info/info/unicode/char/2026/index.htm

……等等，您必须查看其中一些内容的上下文，以判断原始文本作者是否意味着特定的东西，或者只是拼写错误甚至更糟（dashq‽）。

您最合适的做法是在解析之前使用简单的字符串replace方法调用链来修复混乱。

score 3 · Accepted Answer

我找到了这个答案，它可以使用 Python lxml 包解析你的 xml：

使用 Python 和 lxml 获取数据

从这里安装 lxml 包：http: //lxml.de/

并使用此代码：

import lxml.html
root = lxml.html.parse('train-fix.xml').getroot()

希望它对你有用

score 2 · Accepted Answer

如果您有 Linux 可用，请使用 xmllint 查找错误并修复它们

xmllint --recover ~/tmp/test-fix.xml --output ~/tmp/test-fix-fixed.xml 
/home/luis/tmp/test-fix.xml:179: parser error : EntityRef: expecting ';'
inate, Hesse and the Saarland; North Rhine-Westphalia, Baden-Wu&umlaut.rttemberg
                                                                           ^
/home/luis/tmp/test-fix.xml:179: parser error : EntityRef: expecting ';'
Bavaria would remain untouched, and the planned five East German La&umlaut.nder
...
/home/luis/tmp/test-fix.xml:3832: parser error : EntityRef: expecting ';'
Charlie Watts today) we should be ready to hit the road together as Lyndon &and.
                                                                           ^
/home/luis/tmp/test-fix.xml:3841: parser error : Opening and ending tag mismatch: corpus line 1 and lexelt
</lexelt>
     ^
/home/luis/tmp/test-fix.xml:3842: parser error : Extra content at the end of the document
<lexelt item="behaviour-n">


                                                                           ^

python - 将 XML 非法 &char 转换为 utf8 - python

4 回答 4

Related

Reference