python - 字计数器不会打印外来字符

Question

如何设置它以打印中文和重音字符？

from twill.commands import *
from collections import Counter

with open('names.txt') as inf:
    words = (line.strip() for line in inf)
    freqs = Counter(words)
    print (freqs)

score 3 · Accepted Answer

为了正确处理中文字符，我将使用codecs.open而不是 plain open，并将其传递给文件的正确编码。

例如，如果您有一个包含字符串“aèioሴ ሴ”的文件“unicode.txt”：

>>> open('unicode.txt').read()    # has utf-8 BOM
'\xef\xbb\xbfa\xc3\xa8io\xe1\x88\xb4 \xe1\x88\xb4'
>>> codecs.open('unicode.txt').read()    #without encoding is the same as open
'\xef\xbb\xbfa\xc3\xa8io\xe1\x88\xb4 \xe1\x88\xb4'
>>> codecs.open('unicode.txt', encoding='utf-8').read()
u'\ufeffa\xe8io\u1234 \u1234'

对于Counter您获得的 s：

>>> Counter(open('unicode.txt').read())
Counter({'\xe1': 2, '\x88': 2, '\xb4': 2, 'a': 1, '\xc3': 1, ' ': 1, 'i': 1, '\xa8': 1, '\xef': 1, 'o': 1, '\xbb': 1, '\xbf': 1})
>>> Counter(codecs.open('unicode.txt', encoding='utf-8').read())
Counter({u'\u1234': 2, u'a': 1, u' ': 1, u'i': 1, u'\xe8': 1, u'o': 1, u'\ufeff': 1})

如果对于“我如何设置它以打印中文字符”的意思是print(freqs) 应该显示类似的Counter({'不': 1})内容，那么这在 python2 中是不可能的，而它是 python3 的默认设置。

在 python2 中Counter的__str__方法类是__repr__字符串的方法，因此你总是会看到类似的东西\u40ed而不是真正的字符：

>>> Counter(u'不')
Counter({u'\u4e0d': 1})
>>> repr(u'不')
"u'\\u4e0d'"

在python3中，所有字符串都是unicode repr，'不'是“'不'”：

>>> Counter('不')
Counter({'不': 1})
>>> repr('不')
"'不'"

因此，如果您想要一个适用于 python2 和 python3 的解决方案，您应该创建一个函数str_counter，在 python3 中只返回strof Counter，而在 python2 中它必须迭代键和值对并构建字符串表示本身：

>>> def str_counter(counter):
...     if sys.version_info.major > 2:
...         # python3, no need to do anything
...         return str(counter)
...     # python2: we manually create a unicode representation.
...     result = u'{%s}'
...     parts = [u'%s: %s' % (unicode(key), unicode(value)) for key, value in counter.items()]
...     return result % u', '.join(parts)
... 
>>> print str_counter(Counter(u'不'))   # python2
{不: 1}

python - 字计数器不会打印外来字符

1 回答 1

Related

Reference