python - MRJob MR 分配给字典而不是产量？

Question

我是 MRJob 和 MR 的新手，我想知道 MRJob MR 的传统字数 python 示例：

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

是否可以将word, sum(occurrences)元组存储到字典中而不是产生它们，以便我以后可以访问它们？这样做的语法是什么？谢谢！

score 2 · Accepted Answer

您可以简单地使用列表而不是产量：

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        results = []
        for word in line.split():
            results.append((word, 1)) <-- Note that the list should append a tuple here.
        return results

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

score 0 · Accepted Answer

请记住，您所做的工作将在另一台服务器上运行。输入和输出被视为由运行模块的脚本管理的问题。

如果要使用作业的输出，则需要从写入的任何位置读取它（默认为标准输出）或以编程方式运行作业。

听起来你想要后者。在一个单独的模块中，您需要执行以下操作：

mr_job = MRWordCounter(args=['-r', 'emr'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        ... # do something with the parsed output

查看文档以获取更多详细信息。上面的代码示例取自： http: //pythonhosted.org/mrjob/guides/runners.html#runners-programmatically

python - MRJob MR 分配给字典而不是产量？

2 回答 2

Related

Reference