1

语境

Python 3.6.3 :: Anaconda custom (64-bit)
mrjob==0.6.2 with no custom configuration
在本地运行

我正在为本地地图减少作业实施基本字数统计示例。我的映射器.txt使用一个简单的正则表达式将 1 映射到文件中每本书的每一行中的每个单词。reducer 计算每个单词的出现次数,即每个单词分组的 1 的数量。

from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[\w']+")

class WordCounter(MRJob):
  def mapper(self, _, line):
    words = WORD_REGEXP.findall(line)
    for word in words:
      yield word.lower(), 1

  def reducer(self, word, times_seen):
    yield word, sum(times_seen)

if __name__ == '__main__':
  WordCounter.run()

问题

输出文件正确,但键值对未全局排序。结果似乎只在数据块中按字母顺序排序。

"customers'"    1
"customizing"   1
"cut"   2
"cycle" 1
"cycles"    1
"d" 10
"dad"   1
"dada"  1
"daily" 3
"damage"    1
"deductible"    6
...
"exchange"  10
"excited"   4
"excitement"    1
"exciting"  4
"executive" 2
"executives"    2
"theft" 1
"their" 122
"them"  166
"theme" 2
"themselves"    16
"then"  59
"there" 144
"they've"   2
...
"anecdotes" 1
"angel" 1
"angie's"   1
"angry" 1
"announce"  2
"announced" 1
"announcement"  3
"announcements" 3
"announcing"    2
...
"patents"   3
"path"  19
"paths" 1
"patterns"  1
"pay"   45
"exercise"  1
"exercises" 1
"exist" 6
"expansion" 1
"expect"    11
"expectation"   3
"expectations"  5
"expected"  4
....
"customer"  41
"customers" 122
"yours" 15
"yourself"  78
"youth" 1
"zealand"   1
"zero"  7
"zoho"  1
"zone"  2

问题

为了从 MRJob 获得全局排序的输出,是否需要进行一些初始配置?

4

1 回答 1

-1

您缺少组合器步骤,在本指南中它是单步作业的第一个示例:https ://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html

我将复制代码以确保此答案的完整性:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFreqCount.run()
于 2018-05-10T00:29:52.143 回答