python - Working with UTF-8 in Python

Question

As it is summer now, I decided to learn a new language and Python was my choice. Really, what I would like to learn is how to manipulate Arabic text using Python. Now, I have found many many resources on using Python, which are really great. However, when I apply what I learned on Arabic strings, I get numbers and letters combined together.

Take for example this for English:

>>> ebook = 'The American English Dictionary'
>>> ebook[2]
'e'

Now, for Arabic:

>>> abook = 'القاموس العربي'
>>> abook[2]
'\xde'                  #the correct output should be 'ق'

However, using print works fine, as in:

>>> print abook[2]
ق

What do I need to modify to get Python to always recognize Arabic letters?

score 4 · Accepted Answer

Use Unicode explicitly:

>>> s = u'القاموس العربي'
>>> s
u'\u0627\u0644\u0642\u0627\u0645\u0648\u0633 \u0627\u0644\u0639\u0631\u0628\u064a'
>>> print s
القاموس العربي

>>> print s[2]
ق

Or even character by character:

>>> for i, c in enumerate(s):
...     print i,c
... 
0 ا
1 ل
2 ق
3 ا
4 م
5 و
6 س
7  
8 ا
9 ل
10 ع
11 ر
12 ب
13 ي
14

I recommend the Python Unicode page which is short, practical and useful.

score 3 · Accepted Answer

使用 python 3.x：字符串现在是 unicode - 参见python 3 what is new

>>> abook = 'القاموس العربي'
>>> abook[0]
'ا'
>>> abook[4]
'م'

score 1 · Accepted Answer

如果你想要输入：

>>> abook[2]

产生以下输出：

'ق'

它永远不会发生。交互式外壳打印repr(abook[2])，它将始终对阿拉伯字符使用转义序列。我不知道确切的规则，但我猜测 ASCII 宇宙之外的大多数字符都会被转义。为了让它像宣传的那样工作，你使用u前缀，但它仍然会输出一个转义序列（尽管这次是正确的）：

>>> abook = u'القاموس العربي'
>>> abook[2]
u'\u0642'

你得到的原因'\xde'是没有u前缀，abook 保存了短语的 UTF-8 编码。我的输出与您的不同（可能是因为代码点是通过复制粘贴更改的；我不确定），但原则仍然成立：

>>> abook = 'القاموس العربي'
>>> ' '.join( hex(ord(c))[-2:] for c in abook )
'd8 a7 d9 84 d9 82 d8 a7 d9 85 d9 88 d8 b3 20 d8 a7 d9 84 d8 b9 d8 b1 d8 a8 d9 8a'
>>> abook[2]
'\xd9'

您可以通过以下方式确认：

>>> abook = 'القاموس العربي'
>>> unicode(abook, 'utf-8')[2]
u'\u0642'
>>> print unicode(abook, 'utf-8')[2]
ق

score 0 · Accepted Answer

根据问题评论中的结果，这似乎repr导致了mojibake问题 - 也就是说，它对编码感到困惑并使用了错误的编码。print将尝试使用它认为您的 STDOUT 使用的编码，并直接打印结果字节 - repr 尝试打印 ASCII 安全表示，尽管在这种情况下似乎失败了。

好消息是 - 这是一个问题repr，而不是 Python 的 Unicode 处理问题。只要往返：s.encode('utf8').decode('utf8') == s有效，就可以了。print当你想检查它时的值，不要只在交互终端上提及它，而是在任何地方使用 Unicode 字符串（使用 Py3 将有助于解决这个问题，或者至少可以这样做：

from __future__ import unicode_literals
from io import open

)，跟踪编码，即使repr发生奇怪的事情，你的程序也能正常工作。

另请注意，您的问题与 UTF8无关- 它与 Unicode 有关，这是一个不同（尽管相关）的概念。如果您一直在阅读的资源没有强制执行这种差异，请获得更好的资源 - 对这些概念的误解会给您带来很多痛苦。

python - Working with UTF-8 in Python

4 回答 4

Related

Reference