python - Hadoop 流式调用 python 脚本

Question

我有两个小的 python 脚本

CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

#print(sys.argv[1])

text = sys.argv[1]

wordCount = text.count(sys.argv[2])

#print (sys.argv[2],wordCount)
print '%s\t%s' % (sys.argv[2], wordCount)

PrintWordCount_reducer.py

#!/usr/bin/env python
import sys

finalCount = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t')

    count=int(count)

    finalCount += count

    print(word,finalCount)

我执行如下相同：

$ ./CountWordOccurence_mapper.py \
    "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
    "Accord" \
     | /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)

正如所见，我的目标是愚蠢的 - 数一数。给定文本中提供的单词（在本例中为 Accord）的出现次数。

现在，我打算使用 Hadoop 流执行相同的操作。HDFS（部分）上的文本文件是：

"message" : "I am a Honda customer 100%.. 94 Accord ex   96 Accord exV6  98 Accord exv6 cpe  2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower   motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it.  Yesterday   I tried to start it unsuccessfully.  After hours at the auto mechanics  it was found that there was a glitch in the electric/computer system.  The news was disappointing enough (and expensive)    but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful.  When I bought a NEW Honda I thought I bought quality.  I was wrong! Will Honda step up?"

我修改了 CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

for text in sys.stdin:  

    wordCount = text.count(sys.argv[1])

    print '%s\t%s' % (sys.argv[1], wordCount)

我的第一个困惑是 - 如何发送要计算的单词，例如“Accord”、“Honda”作为映射器的参数（-cmdenv name=value）只会让我感到困惑。我仍然继续执行以下命令：

$HADOOP_HOME/bin/hadoop jar \
  $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
 -input /random_data/honda_service_within_warranty.txt \
 -output /random_op/cnt.txt \
 -file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
 -mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
 -file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
 -reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py

正如预期的那样，作业失败了，我收到以下错误：

Traceback (most recent call last):
  File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
    wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

请更正我所犯的语法和基本错误。

谢谢并恭祝安康！

score 1 · Accepted Answer

我认为您的问题在于命令行调用的以下部分：

-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord"

我认为您在这里的假设是字符串“Accord”作为第一个参数传递给映射器。我很确定情况并非如此，事实上字符串“Accord”很可能被流驱动程序入口点类（StreamJob.java）忽略了。

要解决此问题，您需要返回使用该-cmdenv参数，然后在您的 python 代码中提取此键/值对（我不是 python 程序员，但我确信快速的 Google 会指向您需要的代码段）。

score 0 · Accepted Answer

实际上，我曾尝试使用 -cmdenv wordToBeCounted="Accord" 但问题出在我的 python 文件上 - 我忘记对其进行更改以从环境变量（而不是从参数数组）中读取值“Accord”。我附上了 CountWordOccurence_mapper.py 的代码，以防万一有人想使用它作为参考：

#!/usr/bin/env python
import sys
import os

wordToBeCounted = os.environ['wordToBeCounted']

for text in sys.stdin:  

    wordCount = text.count(wordToBeCounted)

    print '%s\t%s' % (wordToBeCounted,wordCount)

谢谢并恭祝安康！

python - Hadoop 流式调用 python 脚本

2 回答 2

Related

Reference