python - 使用 Python 的 Hadoop 程序 - 使用生成器读取文件

Question

我试图通过本教程了解如何使用 Python 编写 Hadoop 程序http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

这是mapper.py：

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

我不明白yield. read_input一次生成一行。但是，main只调用read_input一次，对应于文件的第一行。其余的行如何也被读取？

score 1 · Accepted Answer

实际上，main调用read_input了几次。

data = read_input(sys.stdin)
# Causes a generator to be assigned to data.
for words in data:

在 for 循环的每个循环中，都会调用data，它是由返回的生成器read_input。的输出data分配给words。

基本上，for words in data是“调用数据并将输出分配给单词，然后执行循环块”的简写。

python - 使用 Python 的 Hadoop 程序 - 使用生成器读取文件

1 回答 1

Related

Reference