我正在运行一个基本Map Reduce
程序hadoop-streaming
看起来Map
像
import sys
index = int(sys.argv[1])
max = 0
for line in sys.stdin:
fields = line.strip().split(",")
if fields[index].isdigit():
val = int(fields[index])
if val > max:
max = val
else:
print max
我运行它
hadoop jar /usr/local/Cellar/hadoop/1.0.3/libexec/contrib/streaming/hadoop-streaming-1.0.3.jar -D mapred.reduce.tasks=1 -input input -output output -mapper '/Users/hhimanshu/code/p/java/hadoop-programs/hadoop-programs/src/main/python_scripts/AttributeMax.py 8' -file /Users/me/code/p/java/hadoop-programs/hadoop-programs/src/main/python_scripts/AttributeMax.py
我在Hadoop in Action中读到,mapred.reduce.tasks=1
是
由于我们没有指定任何特定的 reducer,它将使用默认的 IdentityReducer。顾名思义,IdentityReducer 将其输入直接传递给输出。
当我看到我的控制台时,我看到了
12/07/30 16:01:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/07/30 16:01:33 WARN snappy.LoadSnappy: Snappy native library not loaded
12/07/30 16:01:33 INFO mapred.FileInputFormat: Total input paths to process : 1
12/07/30 16:01:34 INFO streaming.StreamJob: getLocalDirs(): [/Users/me/app/hadoop/tmp/mapred/local]
12/07/30 16:01:34 INFO streaming.StreamJob: Running job: job_201207291003_0037
12/07/30 16:01:34 INFO streaming.StreamJob: To kill this job, run:
12/07/30 16:01:34 INFO streaming.StreamJob: /usr/local/Cellar/hadoop/1.0.3/libexec/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201207291003_0037
12/07/30 16:01:34 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201207291003_0037
12/07/30 16:01:35 INFO streaming.StreamJob: map 0% reduce 0%
12/07/30 16:01:51 INFO streaming.StreamJob: map 100% reduce 0%
它没有任何进展,只是继续运行。它似乎不起作用,我该如何解决这个问题?
更新
当
D mapred.reduce.tasks=0
我看到两个文件part-00000
并且part-00001
两个文件都有一行 0何时
D mapred.reduce.tasks=1
和-reduce 'cat'
行为与 reduce 没有做任何事情一样当我跑步时,
cat file | python AttibuteMax.py 8
我得到 868
这意味着D mapred.reduce.tasks=0
并且cat file | python AttributeMax.py 8
也没有产生相同的输出(但它们应该,对吗?)
当输入数据也相同时,会导致行为差异的原因是什么?
更新 1
- 当
D mapred.reduce.tasks=0
我看到 4 个文件part-00000
,part-00001
,part-00002
和part-00002
单行 268, 706, 348, 868 - 当我运行时,
$ cat ~/Downloads/hadoop/input/apat63_99.txt | python ../../../src/main/python_scripts/AttributeMax.py 8 | cat
我确实看到所需的输出为868