python - 解析 Stackoverflow Posts.xml 数据转储文件使程序崩溃，给出 ascii 编码错误

Question

我已经下载了 Stackoverflow 2013 年 6 月的数据转储，现在正在解析 XML 文件并存储在 MySQL 数据库中。我正在使用 Python ElementTree 来执行此操作，但它不断崩溃并给我编码错误。

解析代码片段：

post = open('a.xml', 'r')
a = post.read()  
tree = xml.parse((a).encode('ascii', 'ignore')) # I also tried .encode('utf-8').strip() it doesn't work

#Get the root node

row = tree.findall("row")

它给了我以下错误：

'ascii' codec can't encode character u'\u2019' in position 248: ordinal not in range(128)

我也尝试使用以下方法，但问题仍然存在。

.encode('ascii', 'ignore')

任何解决问题的建议将不胜感激。此外，如果有人链接到干净的数据也会有所帮助。

另外，我的最终目标是将数据转换为 RDF，所以如果有人有 RDF 格式的 StackOverflow 数据转储，我将不胜感激。

提前致谢！

ps 这是导致问题并使程序崩溃的 XML 行：

<row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />

编辑：@Arjan 你在这里提到的解决方案对我不起作用。

score 0 · Accepted Answer

您没有提及您使用的是哪个版本的 Python，并且版本 2 和版本 3 处理 unicode 的方式存在差异，因此这可能是一个因素。由于您遇到了麻烦，我猜您使用的是 2.x 版，因为 3 版通常更优雅地处理 unicode。

ElementTree 了解如何解析包含 unicode 的 xml 文件（或字符串），而不需要 str.encode()。假设 Python 2.7，下面的代码可以解析一个 xml 文件，该文件包含问题中带有 unicode 字符的行：

首先，这是为测试而创建的名为“test.xml”的 xml 文件的内容，其中包括有问题的行：

<?xml version="1.0"?>
<rows>
    <row Id="99" PostTypeId="2" ParentId="88" CreationDate="2008-08-01T14:55:08.477" Score="2" Body="&lt;blockquote&gt;&#xD;&#xA;  &lt;p&gt;The actual resolution of gettimeofday() depends on the hardware architecture. Intel processors as well as SPARC machines offer high resolution timers that measure microseconds. Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz. In such cases, the time resolution will be less accurate. &lt;/p&gt;&#xD;&#xA;&lt;/blockquote&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;I obtained this answer from &lt;a href=&quot;http://www.informit.com/guides/content.aspx?g=cplusplus&amp;amp;seqNum=272&quot; rel=&quot;nofollow&quot;&gt;High Resolution Time Measurement and Timers, Part I&lt;/a&gt;&lt;/p&gt;" OwnerUserId="25" LastActivityDate="2008-08-01T14:55:08.477" />
</rows>

解析上述文件的代码：

>>> import xml.etree.ElementTree as xml
>>> tree = xml.parse('test.xml') # Assuming code lives in same directory as file
>>> # File is now parsed into variable 'tree',
>>> # and we can check the problematic unicode character is in there
>>> body = tree.find('row').attrib['Body']
>>> # We can look at the escaped unicode character...
>>> body [238:256]
the system\u2019s timer
>>> # Or we can view it represented as we would expect to read it
>>> print body[238:256]
the system’s timer

如果使用此示例仍然会为您产生错误，也许您可以提供一些有关您的问题的附加信息。

python - 解析 Stackoverflow Posts.xml 数据转储文件使程序崩溃，给出 ascii 编码错误

1 回答 1

Related

Reference