语境
Python 3.6.3 :: Anaconda custom (64-bit)
mrjob==0.6.2 with no custom configuration
在本地运行
我正在为本地地图减少作业实施基本字数统计示例。我的映射器.txt使用一个简单的正则表达式将 1 映射到文件中每本书的每一行中的每个单词。reducer 计算每个单词的出现次数,即每个单词分组的 1 的数量。
from mrjob.job import MRJob
import re
WORD_REGEXP = re.compile(r"[\w']+")
class WordCounter(MRJob):
def mapper(self, _, line):
words = WORD_REGEXP.findall(line)
for word in words:
yield word.lower(), 1
def reducer(self, word, times_seen):
yield word, sum(times_seen)
if __name__ == '__main__':
WordCounter.run()
问题
输出文件正确,但键值对未全局排序。结果似乎只在数据块中按字母顺序排序。
"customers'" 1
"customizing" 1
"cut" 2
"cycle" 1
"cycles" 1
"d" 10
"dad" 1
"dada" 1
"daily" 3
"damage" 1
"deductible" 6
...
"exchange" 10
"excited" 4
"excitement" 1
"exciting" 4
"executive" 2
"executives" 2
"theft" 1
"their" 122
"them" 166
"theme" 2
"themselves" 16
"then" 59
"there" 144
"they've" 2
...
"anecdotes" 1
"angel" 1
"angie's" 1
"angry" 1
"announce" 2
"announced" 1
"announcement" 3
"announcements" 3
"announcing" 2
...
"patents" 3
"path" 19
"paths" 1
"patterns" 1
"pay" 45
"exercise" 1
"exercises" 1
"exist" 6
"expansion" 1
"expect" 11
"expectation" 3
"expectations" 5
"expected" 4
....
"customer" 41
"customers" 122
"yours" 15
"yourself" 78
"youth" 1
"zealand" 1
"zero" 7
"zoho" 1
"zone" 2
问题
为了从 MRJob 获得全局排序的输出,是否需要进行一些初始配置?