I'm looking for a simple way of converting a user-supplied string to UTF-8. It doesn't have to be very smart; it should handle all ASCII byte strings and all Unicode strings (2.x unicode
, 3.x str
).
Since unicode
is gone in 3.x and str
changed meaning, I thought it might be a good idea to check for the presence of a decode
method and call that without arguments to let Python figure out what to do based on the locale, instead of doing isinstance
checks. Turns out that's a not a good idea at all:
>>> u"één"
u'\xe9\xe9n'
>>> u"één".decode()
Traceback (most recent call last):
File "<ipython-input-36-85c1b388bd1b>", line 1, in <module>
u"één".decode()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
My question is two-fold:
- Why is there a
unicode.decode
method at all? I thought Unicode strings were considered "not encoded". This looks like a sure way of getting doubly encoded strings. - How do I tackle this problem in a way that is forward-compatible with Python 3?