mapreduce - 使用 hadoop-streaming-0.20.205.0.jar 作为自定义 JAR，Amazon Elastic Mapreduce

Question

当我使用 Amazon Elastic Mapreduce 时，我想将使用 hadoop-streaming-0.20.205.0.jar 用于 hadoop 流而不是 Elastic Mapreduce 的流。我需要设计自定义分区器、输入格式、输出格式等等。

所以我尝试创建一个新的自定义 JAR 作业

JAR Location: stt.streaming/hadoop-streaming-0.20.205.0.jar
JAR Arguments: 
    -input s3n://stt.streaming/test_input 
    -output s3n://stt.streaming/test_output 
    -mapper s3n://stt.streaming/mapper.py 
    -reducer s3n://stt.streaming/reducer.py

使用 EMR 的流作业执行 python 脚本 mapper.py 和 reducer.py 没有任何问题。

但是，我收到以下错误消息

java.io.IOException: Cannot run program "s3n://stt.streaming/mapper.py": java.io.IOException: error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:166)
    at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:63)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    ...
    ...
    ...

我的问题的重点是

我的工作是否存在潜在问题？我不知道还有哪些问题，因为我的工作在访问 mapper.py 和 reducer.py 时失败了。
如何访问我的 mapper.py 和 reducer.py？
EMR 的流式作业似乎使用 /home/hadoop/contrib/streaming/hadoop-streaming.jar。我可以得到这个来源吗？如果我得到这个来源，我的问题就可以解决。谢谢。

mapreduce - 使用 hadoop-streaming-0.20.205.0.jar 作为自定义 JAR，Amazon Elastic Mapreduce

0 回答 0

Related

Reference