python - 如何使用 Python 将具有 cp1252 字符的 unicode 字符串转换为 UTF-8？

Question

我通过一个 API 获取文本，该 API 返回带有 windows 编码撇号 (\x92) 的字符：

> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>

我正在尝试将此字符串转换为 UTF-8，以便它返回：“六月有三十天”

当我尝试解码或编码此 unicode 字符串时，它会引发错误：

>>> title.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)

>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>

如果我要将字符串初始化为纯文本然后对其进行解码，它可以工作：

>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June

我的问题是如何将我得到的 unicode 字符串转换为纯文本字符串以便我可以解码它？

score 7 · Accepted Answer

看来您的字符串已被解码（latin1因为它是 type unicode）

要将其转换回原来的字节，您需要使用该编码( latin1)
然后要取回文本（unicode），您必须使用正确的编解码器（）cp1252解码
最后，如果要获取utf-8字节，则必须使用编解码器进行编码UTF-8。

在代码中：

>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June

根据 API 是采用文本 ( unicode) 还是bytes， 3. 可能不是必需的。

python - 如何使用 Python 将具有 cp1252 字符的 unicode 字符串转换为 UTF-8？

1 回答 1

Related

Reference