python - python错误字符编码比较

Question

Python 中的西里尔字符比较有问题。这是小测试用例%

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def convert(text):
    result = []
    for i in xrange(len(text)):
        if text[i].lower() == 'й':
            result.append('q')
    print result

if __name__ == '__main__':
    convert('йцукенг')

你肯定看到，第一个字符应该等于条件中的字符。但条件失败，结果为空。

此外，如果我尝试打印整个字符串（文本），效果很好，但如果我尝试仅打印一个字符（如 text[2]）——我会得到 '?' 在输出中。

我确定问题出在编码上，但是如何正确比较单独的字符？

score 3 · Accepted Answer

You are seeing this behavior because you are looping over the bytes in a UTF-8 string, not over the characters. Here is an example of the difference:

>>> 'й'               # note that this is two bytes
'\xd0\xb9'
>>> 'йцукенг'[0]      # but when you loop you are looking at a single byte
'\xd0'
>>> len('йцукенг')    # 7 characters, but 14 bytes
14

This is why it is necessary to use Unicode for checking the character, as in mVChr's answer.

These easiest way to do this is to leave all of your code exactly the same, and just add a u prefix to all of your string literals (u'йцукенг' and u'й').

score 1 · Accepted Answer

假设您使用的是 Python 2.X，您应该使用 unicode 字符串，请尝试：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def convert(text):
    result = []
    for i in xrange(len(text)):
        if text[i].lower() == unicode('й', 'utf8'):
            result.append('q')
    print result

if __name__ == '__main__':
    convert(unicode('йцукенг', 'utf8'))

或者您可以简单地输入原始的 unicode 字符串u'йцукенг'和u'й'

python - python错误字符编码比较

2 回答 2

Related

Reference