shell - 将目录传递给 hadoop 流：需要一些帮助

Question

上下文是我正在尝试使用我运行的 bash 脚本在 Amazon EMR （Web UI）上运行流式作业：

-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE

输入目录中有子目录，这些子目录有 gzip 压缩的数据文件。

失败的相关部分mapperScript.sh是：

for filename in "$input"/*; do

dir_name=`dirname $filename`
fname=`basename $filename`

echo "$fname">/dev/stderr

modelname=${fname}.model

modelfile=$model_location/$modelname

echo "$modelfile">/dev/stderr

inputfile=$dirname/$fname

echo "$inputfile">/dev/stderr

outputfile=$output/$fname

echo "$outputfile">/dev/stderr

# Will do some processing on the files in the sub-directories here

done # this is the loop for getting input from all sub-directories

基本上，我需要以流模式读取子目录，当我运行它时，hadoop 抱怨说：

2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):               
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file:      s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main):  Error Launching
job : Not a file: s3://emrdata/test_data/input/data1

我知道这里已经问过类似的 q

那里的建议是编写自己的 InputFormat。我想知道在编写脚本/给出 EMR 输入的方式上是否遗漏了其他内容，或者用 Java 编写自己的 InputFormat 是否是我唯一的选择。

我也尝试将我的输入与“输入/*”一起提供给 EMR，但没有运气。

score 2 · Accepted Answer

似乎虽然可能有一些临时的解决方法，但 hadoop 本质上还不支持这一点，因为您可能会看到这里有一张开放的票。因此inputpatth/*/*可能适用于 2 级子目录，它可能无法进一步嵌套。

您现在可以做的最好的事情是获取文件/文件夹的列表 - without-any-subdirectory 并在创建 inputPaths 的 csv 列表后递归地添加它们。您可以为此使用像 s3cmd 这样的简单工具。

shell - 将目录传递给 hadoop 流：需要一些帮助

1 回答 1

Related

Reference