python - 如何在 Python 中将 \xXY 编码字符转换为 UTF-8？

Question

我有一个文本，其中包含诸如“\xaf”、“\xbe”之类的字符，据我从这个问题中了解到，这些字符是 ASCII 编码字符。

我想将它们在 Python 中转换为它们的 UTF-8 等价物。通常的string.encode("utf-8")投掷UnicodeDecodeError。有没有更好的方法，例如，使用codecs标准库？

score 3 · Accepted Answer

.encode用于将 Unicode 字符串（unicode在 2.x 中，str在 3.x 中）转换为字节字符串（str在 2.x 中，bytes在 3.x 中）。

在 2.x 中，调用对象是合法.encode的。strPython 首先将字符串隐式解码为 Unicode：s.encode(e)就像您编写s.decode(sys.getdefaultencoding()).encode(e).

问题是默认编码是“ascii”，而您的字符串包含非 ASCII 字符。您可以通过明确指定正确的编码来解决此问题。

>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'

score 2 · Accepted Answer

它不是 ASCII（ASCII 码最多只能达到 127；\xaf是 175）。您首先需要找出正确的编码，对其进行解码，然后以 UTF-8 重新编码。

你能提供一个实际的字符串样本吗？那么我们大概可以猜出当前的编码。

score 2 · Accepted Answer

您的文件已经是 UTF-8 编码文件。

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s\n" % (ord(char), charname))

并手动填写未知名称：
char U000a LINE FEED
char U001e INFORMATION SEPARATOR
2 char U001f INFORMATION SEPARATOR ONE

python - 如何在 Python 中将 \xXY 编码字符转换为 UTF-8？

3 回答 3

Related

Reference