1

我是 MRJob 和 MR 的新手,我想知道 MRJob MR 的传统字数 python 示例:

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

是否可以将word, sum(occurrences)元组存储到字典中而不是产生它们,以便我以后可以访问它们?这样做的语法是什么?谢谢!

4

2 回答 2

2

您可以简单地使用列表而不是产量:

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        results = []
        for word in line.split():
            results.append((word, 1)) <-- Note that the list should append a tuple here.
        return results

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()
于 2012-12-13T07:38:08.777 回答
0

请记住,您所做的工作将在另一台服务器上运行。输入和输出被视为由运行模块的脚本管理的问题。

如果要使用作业的输出,则需要从写入的任何位置读取它(默认为标准输出)或以编程方式运行作业。

听起来你想要后者。在一个单独的模块中,您需要执行以下操作:

mr_job = MRWordCounter(args=['-r', 'emr'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        ... # do something with the parsed output

查看文档以获取更多详细信息。上面的代码示例取自: http: //pythonhosted.org/mrjob/guides/runners.html#runners-programmatically

于 2013-07-09T21:16:56.040 回答