python - 过滤lxml中无效unicode字符的中央方法？

Question

众所周知，XML 文档中不允许使用某些字符范围。我知道过滤掉这些字符的解决方案（如[1]，[2]）。

遵循不要重复自己的原则，我更愿意在一个中心点实施这些解决方案之一——现在，我必须在将任何可能不安全的文本输入到lxml. 有没有办法实现这一点，例如通过子类化lxml过滤器类、捕获一些异常或设置配置开关？

编辑：希望能稍微澄清一下这个问题，这里有一个示例代码：

from lxml import etree

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += '\x02'.decode("utf-8")

执行这个给出结果

<root>&#65535;&#55296;</root>

Traceback (most recent call last):
  File "[…]", line 9, in <module>
    root.text += u'\u0002'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

如您所见，对于 2 字节引发了异常，但 lxml 很高兴地转义了另外两个超出范围的字符。真正的麻烦在于

s = "<root>&#65535;&#55296;</root>"
root = etree.fromstring(s)

也会抛出异常。在我看来，这种行为有点令人不安，尤其是因为它会生成无效的 XML 文档。

事实证明，这可能是一个 2 对 3 的问题。用python3.4，上面的代码抛出异常

Traceback (most recent call last):
  File "[…]", line 5, in <module>
    root.text += u'\ud800'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed

唯一剩下的问题是\uffff角色，它lxml仍然欣然接受。

score 1 · Accepted Answer

只需在使用 LXML 解析字符串之前过滤字符串：从 XML 中清除无效字符（法律要点）。

我用你的代码试过了；它似乎有效，保存您需要更改要点以导入re和sys的事实！

from lxml import etree
from cleaner import invalid_xml_remove

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += invalid_xml_remove('\x02'.decode("utf-8"))

python - 过滤lxml中无效unicode字符的中央方法？

1 回答 1

Related

Reference