我试图通过本教程了解如何使用 Python 编写 Hadoop 程序http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
这是mapper.py:
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
我不明白yield
. read_input
一次生成一行。但是,main
只调用read_input
一次,对应于文件的第一行。其余的行如何也被读取?