python - python map减少西里尔文字中的简单字数

Question

我正在尝试使用 MRJob 实现一个非常基本的字数统计示例。使用 ascii 输入一切正常，但是当我将西里尔字母混合到输入中时，我得到类似这样的输出

"\u043c\u0438\u0440"    1
"again!"    1
"hello" 2
"world" 1

据我了解，上面的第一行是西里尔字母“мир”的编码单次出现，这是关于我的示例输入文本的正确结果。这是MR代码

class MRWordCount(MRJob):

    def mapper(self, key, line):
       line = line.decode('cp1251').strip()
       words = line.split()
       for term in words:
          yield term, 1

    def reducer(self, term, howmany):
        yield term, sum(howmany)

if __name__ == '__main__':
        MRWordCount.run()

我在 Windows 上使用 Python 2.7 和 mrjob 0.4.2。我的问题是：

a）我如何设法在西里尔文输入上正确生成可读的西里尔文输出？b) 这种行为的根本原因是什么——是由于 python/MR 版本还是预计在非 Windows 上的工作方式不同——任何线索？

我正在重现 python -c "print u'мир'" 的输出

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\encodings\cp866.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>

score 2 · Accepted Answer

为了在 Python 2.x 中更易读，您需要明确告诉解释器它是一个 unicode 字符串：

>>> print(u"\u043c\u0438\u0440") # note leading u
мир

要将字符串转换为 unicode 字符串，请使用unicode：

>>> print(unicode("\u043c\u0438\u0440", "unicode_escape"))
мир

score 0 · Accepted Answer

To print to your console, you need to encode the characters to an encoding your terminal understands. Most of the time that'll be UTF-8: print u"\u043c\u0438\u0440".encode("utf-8"), but on Windows you might need to use another one (cp1251, maybe?).

python - python map减少西里尔文字中的简单字数

2 回答 2

Related

Reference