python - Decoding a Unicode string; what does that mean and how can I avoid it?

Question

I'm looking for a simple way of converting a user-supplied string to UTF-8. It doesn't have to be very smart; it should handle all ASCII byte strings and all Unicode strings (2.x unicode, 3.x str).

Since unicode is gone in 3.x and str changed meaning, I thought it might be a good idea to check for the presence of a decode method and call that without arguments to let Python figure out what to do based on the locale, instead of doing isinstance checks. Turns out that's a not a good idea at all:

>>> u"één"
u'\xe9\xe9n'
>>> u"één".decode()
Traceback (most recent call last):
  File "<ipython-input-36-85c1b388bd1b>", line 1, in <module>
    u"één".decode()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

My question is two-fold:

Why is there a unicode.decode method at all? I thought Unicode strings were considered "not encoded". This looks like a sure way of getting doubly encoded strings.
How do I tackle this problem in a way that is forward-compatible with Python 3?

score 5 · Accepted Answer

谈论“解码”一个unicode字符串是没有用的。您想将其编码为字节。unicode.decode仅出于历史原因而存在；它的语义是没有意义的。因此，它已在 Python 3 中删除。

然而，encode/decode语义历来被扩展为包括（字符）字符串到字符串或字节到字节的编码，例如 rot13 或 bzip2。在 Python 3.1 中，这些伪编码被删除，并在 Python 3.2 中重新引入。

通常，您应该设计您的接口，以便它们接受字符或字节字符串。接受两者的接口（出于向后兼容性以外的原因）是一种代码味道，难以测试，容易出现错误（如果有人通过 UTF-16 字节怎么办？）并且首先具有可疑的语义。

如果您必须有一个接受字符和字节字符串的接口，您可以检查decodePython 3 中是否存在该方法。如果您希望您的代码也可以在 2.x 中工作，则必须使用isinstance.

score 1 · Accepted Answer

str 和 unicode 之间的转换并不是编码/解码的唯一目的。还有编解码器。

例如（在 Python 2 中）：

>>> u'123'.encode('hex')
'313233'
>>> '313233'.decode('hex')
'123'
>>> u'313233'.decode('hex')
'123'

我对 Python 3 不够熟悉，无法说出这是否适用于那里。

score 1 · Accepted Answer

Unicode 对象有一个 decode() 方法，因为它继承自 basestring 并且 basestring 有一个，所以 Unicdode 也必须有一个。
在 Python 2 或 Python 3 中，您通过从不解码 Unicode 字符串来解决这个问题。正如您所注意到的，这样做没有任何意义。所以不要。

那么如何在 Python 2 和 Python 3 的兼容等待中处理这个问题呢？好吧，您不要将字符串用于二进制数据，而是使用bytes. 他们有一个适用于所有 Python 版本的 decode() 方法。

有关这方面的更多信息，请参阅http://python3porting.com/noconv.html和http://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/

python - Decoding a Unicode string; what does that mean and how can I avoid it?

3 回答 3

Related

Reference