python - 在 Applescript 中使用 cElementTree 解析 XML 时出现 UnicodeEncodeError

Question

抱歉，如果这是重复的或非常明显的东西，但请多多包涵，因为我是 Python 新手。我正在尝试使用 cElementTree (Python 2.7.5) 来解析 Applescript 中的 XML 文件。XML 文件包含一些非 ASCII 文本编码为实体的字段，例如<foo>café</foo>.

在终端中运行以下基本代码会按预期输出成对的标签和标签内容：

import xml.etree.cElementTree as etree
parser = etree.XMLParser(encoding="utf-8")
tree = etree.parse("myfile.xml", parser=parser)
root = tree.getroot()
for child in root:
    print child.tag, child.text

但是当我在 Applescript 中使用运行相同的代码时do shell script，我得到了可怕的UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128)。

我发现如果我将print线路更改为

    print [child.tag, child.text]

然后我确实得到了一个包含在 [''] 中的 XML 标记/值对的字符串，但是任何非 ASCII 字符都会作为文字 Unicode 字符串传递到 Applescript（所以我以结尾u'caf\\xe9'）。

我尝试了几件事，包括 a.) 将 .xml 文件读入字符串并使用 .fromstring 而不是 .parse，b.) 尝试将 .xml 文件转换为 str，然后再将其导入 cElementTree，c.) 只是在任何我能找到的地方粘贴 .encode 以查看是否可以避免使用 ASCII 编解码器，但还没有解决方案。不幸的是，我被困在使用 Applescript 作为容器。提前感谢您的建议！

score 0 · Accepted Answer

您至少需要编码child.text成 Applescript 可以处理的内容。如果您希望返回字符实体引用，则可以这样做：

print child.tag.encode('ascii', 'xmlcharrefreplace'), child.text.encode('ascii', 'xmlcharrefreplace')

或者如果它可以处理像 utf-8 这样的东西：

print child.tag.encode('utf-8'), child.text.encode('utf-8')

score 0 · Accepted Answer

这不是 AppleScript 的错——它是 Python 通过为您猜测要使用什么输出编码来“提供帮助”。（不幸的是，根据是否连接了终端，它会做出不同的猜测。）

最简单的解决方案（Python 2.6+）是PYTHONIOENCODING在调用之前设置环境变量python：

do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python '/path/to/script.py'"

或者：

do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python << EOF

# -*- coding: utf-8 -*-

# your Python code goes here...

print u'A Møøse once bit my sister ...'

EOF"

python - 在 Applescript 中使用 cElementTree 解析 XML 时出现 UnicodeEncodeError

2 回答 2

Related

Reference