python - 转换为 unicode 的正确方法是什么？

Question

假设你有一个字符串

s = "C:\Users\Eric\Desktop\beeline.txt"

如果不是，您想迁移到 Unicode。

return s if PY3 or type(s) is unicode else unicode(s, "unicode_escape")

如果字符串有可能包含 \U（即，用户目录），那么您可能会遇到 Unicode 解码错误。

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 3-4: truncated \UXXXXXXXX escape

像这样强迫它有什么问题吗：

return s if PY3 or type(s) is unicode else unicode(s.encode('string-escape'), "unicode_escape")

还是明确检查 \U ok 的存在，因为它是唯一的极端情况？

我希望代码适用于 python 2 和 3。

score 0 · Accepted Answer

它适用于英语，但在面对实际的 unicode 示例时，强制翻译可能不会使用与默认情况下相同的编码，从而给您留下令人不快的错误。

我将您给定的代码包装在一个名为 assert_unicode 的函数中（将 is 替换为 isinstance ）并对希伯来语中的文本进行了测试（只是说“你好”），检查一下：

In [1]: def assert_unicode(s):
            return s if isinstance(s, unicode) else unicode(s, 'unicode_escape')    

In [2]: assert_unicode(u'שלום')
Out[2]: u'\u05e9\u05dc\u05d5\u05dd'

In [3]: assert_unicode('שלום')
Out[3]: u'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'

你看？两者都返回一个 unicode 对象，但仍有很多区别。如果您尝试打印或使用第二个示例，它可能会失败（例如，一个简单的打印对我来说失败了，我使用的是对 unicode 非常友好的 console2）。

解决这个问题？使用 utf-8。现在这是一个标准，如果您确保所有内容也将被视为 utf-8，它应该像任何给定语言的魅力一样工作：

In [4]: def assert_unicode(s):
            return s if isinstance(s, unicode) else unicode(s, 'utf-8')    

In [5]: assert_unicode(u'שלום')
Out[5]: u'\u05e9\u05dc\u05d5\u05dd'

In [6]: assert_unicode('שלום')
Out[6]: u'\u05e9\u05dc\u05d5\u05dd'

score 0 · Accepted Answer

下面的例程在精神上与@yuvi 的答案相似，但它经历了多种编码（可配置）并返回使用的编码。它还可以更优雅地处理错误（仅通过转换基本字符串）。

#unicode practice, this routine forces stringish objects to unicode
#preferring utf-8 but works through other encodings on error
#return values are the encoded string and the encoding used
def to_unicode_or_bust_multile_encodings(obj, encoding=['utf-8','latin-1','Windows-1252']):
  'noencoding'
  successfullyEncoded = False
  for elem in encoding:
    if isinstance(obj, basestring):
      if not isinstance(obj, unicode):
        try:
          obj = unicode(obj, elem)
          successfullyEncoded = True
          #if we succeed then exit early
          break
        except:
          #encoding did not work, try the next one
          pass

  if successfullyEncoded:
    return obj, elem
  else:
    return obj,'no_encoding_found'

score 0 · Accepted Answer

转换为 unicode 的正确方法是什么？

这里是：

unicode_string = bytes_object.decode(character_encoding)

现在问题变成了：我有一个字节序列，我应该使用什么字符编码将它们转换为 Unicode 字符串？

答案取决于字节的来源。

在您的情况下，字节字符串是使用字节字符串的 Python 文字（Python 2）指定的，因此编码是 Python 源文件的字符编码。如果文件顶部没有字符编码声明（类似的注释：），# -*- coding: utf-8 -*-则默认源编码'ascii'在 Python 2（'utf-8'--Python 3）上。所以你的答案是：

if isinstance(s, str) and not PY3:
   return s.decode('ascii')

或者您可以直接使用 Unicode 文字（Python 2 和 Python 3.3+）：

unicode_string = u"C:\\Users\\Eric\\Desktop\\beeline.txt"

python - 转换为 unicode 的正确方法是什么？

3 回答 3

Related

Reference