0

我试图通过本教程了解如何使用 Python 编写 Hadoop 程序http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

这是mapper.py:

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

我不明白yield. read_input一次生成一行。但是,main只调用read_input一次,对应于文件的第一行。其余的行如何也被读取?

4

1 回答 1

1

实际上,main调用read_input了几次。

data = read_input(sys.stdin)
# Causes a generator to be assigned to data.
for words in data:

在 for 循环的每个循环中,都会调用data,它是由 返回的生成器read_input。的输出data分配给words

基本上,for words in data是“调用数据并将输出分配给单词,然后执行循环块”的简写。

于 2013-08-16T14:48:54.803 回答