python - Python unicode 麻烦

Question

我在编写的脚本中遇到了 unicode 问题。我已经搜索了互联网，包括这个网站，我尝试了很多东西，但我仍然不知道出了什么问题。

我的代码很长，但我将展示一段摘录：

raw_results = get_raw(args)
write_raw(raw_results)
parsed_results = parse_raw(raw_results)
write_parsed(parsed_results)

基本上，我得到了以 UTF-8 编码的 XML 格式的原始结果。写入 RAW 数据没有问题。但是写入解析的数据是。所以我很确定问题出在解析数据的函数内部。

我尝试了一切，但我不明白问题是什么。即使是这条简单的线也给了我一个错误：

def parse_raw(raw_results)
    content = raw_results.replace(u'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>', u'')

UnicodeDecodeError：“ascii”编解码器无法解码位置 570 中的字节 0xd7：序数不在范围内（128）

理想情况下，我希望能够使用 unicode 并且没有问题，但我也没有替换/忽略任何 unicode 并仅使用常规文本的问题。我知道我没有提供我的完整代码，但理解这是一个问题，因为它与工作相关。但我希望这足以让我得到一些帮助。

编辑：我的 parse_raw 函数的顶部：

from xml.etree.ElementTree import XML, fromstring, tostring
def parse_raw(raw_results)    
    raw_results = raw_results.decode("utf-8")
    content = raw_results.replace('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>', '')
    content = "<root>\n%s\n</root>" % content
    mxml = fromstring(content)

Edit2: : 我认为最好指出代码工作正常，除非有特殊字符。当它是 100% 英语时，没问题；每当涉及任何外国字母或重音字母时，就会出现问题。

score 3 · Accepted Answer

raw_results可能是一个str对象，而不是一个unicode对象。

raw_results.replace(u'...', ...)导致 Python 首先将解码str raw_results为unicode. Python2ascii默认使用编解码器。raw_results包含'\xd7'位置 570 处的字节，编解码器无法ascii解码（即，它不是 ascii 字符）。

以下是如何发生此错误的演示：

In [27]: '\xd7'.replace(u'a',u'b')      
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128)

而如果raw_results是 unicode，则不会有静默解码ascii，因此不会发生错误：

In [28]: u'\xd7'.replace(u'a',u'b')
Out[28]: u'\xd7'

raw_results如果您知道适当的编解码器，则可以通过显式解码来解决此问题：

raw_results = raw_results.decode('latin-1')

latin-1只是一个猜测。如果位置 570 的字符是乘法符号，则可能是正确的：

In [26]: print('\xd7'.decode('latin-1'))
×

score 0 · Accepted Answer

Thank you everyone for the input and the nudges. I have subsequently solved my own problem by going over my code for the millionth time with a fine-toothed comb, and I have found the culprit. And I have solved all my problems now.

For anyone with a similar problem, I have the following information that could help you:

Use the codecs module for writing your files.
Do not try to handle it all along your code, your code should ignore any type of character set throughout methods, and should have specific methods or calls to methods where only you modify the charset. (this helped me find the problem)

My problem was that at a certain point I was trying to turn unicode into unicode. And in another place I was trying to turn normal ASCII into ASCII again. So whenever I solved one issue, another arose and I figured it was the same problem.

Break your issue into sections... and then you might find your problem!

python - Python unicode 麻烦

2 回答 2

Related

Reference