我有两个小的 python 脚本
CountWordOccurence_mapper.py
#!/usr/bin/env python
import sys
#print(sys.argv[1])
text = sys.argv[1]
wordCount = text.count(sys.argv[2])
#print (sys.argv[2],wordCount)
print '%s\t%s' % (sys.argv[2], wordCount)
PrintWordCount_reducer.py
#!/usr/bin/env python
import sys
finalCount = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t')
count=int(count)
finalCount += count
print(word,finalCount)
我执行如下相同:
$ ./CountWordOccurence_mapper.py \
"I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
"Accord" \
| /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)
正如所见,我的目标是愚蠢的 - 数一数。给定文本中提供的单词(在本例中为 Accord)的出现次数。
现在,我打算使用 Hadoop 流执行相同的操作。HDFS(部分)上的文本文件是:
"message" : "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it. Yesterday I tried to start it unsuccessfully. After hours at the auto mechanics it was found that there was a glitch in the electric/computer system. The news was disappointing enough (and expensive) but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful. When I bought a NEW Honda I thought I bought quality. I was wrong! Will Honda step up?"
我修改了 CountWordOccurence_mapper.py
#!/usr/bin/env python
import sys
for text in sys.stdin:
wordCount = text.count(sys.argv[1])
print '%s\t%s' % (sys.argv[1], wordCount)
我的第一个困惑是 - 如何发送要计算的单词,例如“Accord”、“Honda”作为映射器的参数(-cmdenv name=value)只会让我感到困惑。我仍然继续执行以下命令:
$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
-input /random_data/honda_service_within_warranty.txt \
-output /random_op/cnt.txt \
-file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
-file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
-reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py
正如预期的那样,作业失败了,我收到以下错误:
Traceback (most recent call last):
File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
请更正我所犯的语法和基本错误。
谢谢并恭祝安康 !