python - 从 IPython 笔记本运行 MRJob

Question

我正在尝试从 IPython 笔记本运行 mrjob 示例

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)

然后用代码运行它

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

并得到错误：

TypeError: <module '__main__' (built-in)> is a built-in class

有没有办法从 IPython 笔记本运行 mrjob？

score 3 · Accepted Answer

我还没有找到“完美的方法”，但你可以做的一件事是创建一个笔记本单元格，使用%%file魔法，将单元格内容写入文件：

%%file wordcount.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

然后mrjob在稍后的单元格中运行该文件：

import wordcount
reload(wordcount)

mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

请注意，我调用了我的文件并从模块wordcount.py中导入了类——文件名和模块必须匹配。Python还缓存导入的模块，当您更改-file时，iPython不会重新加载模块，而是使用旧的缓存模块。这就是为什么我把电话放在那里。MRWordFrequencyCountwordcountwordcount.pyreload()

参考：https ://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ

更新（更短）
对于更短的第二个笔记本单元，您可以通过从笔记本中调用 shell 来运行 mrjob

! python mrjob.py shakespeare.txt

参考： http : //jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb

score 1 · Accepted Answer

我怀疑这是由于MRJob 网站上所述的限制：

带有作业类的文件被发送到 Hadoop 以运行。因此，作业文件不能尝试启动 Hadoop 作业，否则您将递归地创建 Hadoop 作业！运行作业的代码只能在 Hadoop 上下文之外运行。

或者，可能是因为您没有以下内容（参考）：

if __name__ == '__main__':  
  MRWordCounter.run()  # where MRWordCounter is your job class

python - 从 IPython 笔记本运行 MRJob

2 回答 2

Related

Reference