python - 在python中将字符串转换为unicode类型

Question

我正在尝试这段代码：

s = "سلام"
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))

但发生此错误：

'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
UnicodeDecodeError：“ascii”编解码器无法解码位置 0 的字节 0xd3：序数不在范围内（128）

我试过'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))了，但没有任何改变。

我应该怎么办？

score 7 · Accepted Answer

由于您使用的是 python 2，s = "سلام"因此是一个字节字符串（在您的终端使用的任何编码中，大概是 utf8）：

>>> s = "سلام"
>>> s
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'

您不能encode字节字符串（因为它们已经“编码”）。您正在寻找 unicode（“真实”）字符串，在 python2 中必须以为前缀u：

>>> s = u"سلام"
>>> s
u'\u0633\u0644\u0627\u0645'
>>> '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
'1101100010110011110110011000010011011000101001111101100110000101'

如果您从函数中获取字节字符串，raw_input那么您的字符串已经编码 - 只需跳过该encode部分：

'{:b}'.format(int(s.encode('hex'), 16))

或（如果您打算用它做任何其他事情）将其转换为 unicode：

s = s.decode('utf8')

这假定您的输入是 UTF-8 编码的，如果不是这种情况，请先检查sys.stdin.encoding。

i10n 的东西很复杂，这里有两篇文章可以进一步帮助你：

python - 在python中将字符串转换为unicode类型

1 回答 1

Related

Reference