python - 在python中用“格式不正确”的字符解析xml

Question

我正在从应用程序获取 xml 数据，我想在 python 中对其进行解析：

#!/usr/bin/python

import xml.etree.ElementTree as ET
import re

xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)

它适用于带有示例数据的较小数据集，但是当我使用真实的实时数据时，我得到

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72

查看 xml 文件，我看到这行 364658：

WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>

我想^[这是让 python 窒息的原因——它在 vim 中也以蓝色突出显示。现在我希望我可以用我的正则表达式替换来清理数据，但这不起作用。

最好的办法是修复生成 xml 的应用程序，但这超出了范围。所以我需要按原样处理数据。我该如何解决这个问题？我可以忍受只是扔掉“非法”角色。

score 3 · Accepted Answer

你已经这样做了：

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)

但角色^[可能是 Python 的\x1b. 如果 xml.parser.expat 卡住它，您只需要清理更多内容，只接受 0x20（空格）以下的一些字符。例如：

xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)

score 0 · Accepted Answer

我知道这已经很老了，但是在下面的 url 上发现了所有主要字符及其编码的列表。

https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

python - 在python中用“格式不正确”的字符解析xml

2 回答 2

Related

Reference