hadoop - 如果我将相同的路径两次添加到 Hadoop 会发生什么？

Question

我正在使用弹性地图减少。我想知道如果我在 main 方法中两次使用完全相同的行会发生什么。

FileInputFormat.addInputPath(job, new Path("s3n://mybucket/data/lolcat/*"));

hadoop 会两次处理相同的文件吗？或者它会发现它们是相同的文件并且会跳过重复的文件？

score 4 · Accepted Answer

这是添加输入路径的源代码：


public static void addInputPath(JobConf conf, Path path ) {
    path = new Path(conf.getWorkingDirectory(), path);
    String dirStr = StringUtils.escapeString(path.toString());
    String dirs = conf.get("mapred.input.dir");
    conf.set("mapred.input.dir", dirs == null ? dirStr :
      dirs + StringUtils.COMMA_STR + dirStr);
}

因此，正如您所看到的，它只是将您的输入附加到mapred.input.dir中，而无需查看之前的内容。

除了getSplits函数只使用List和没有Set，所以如果你有相同的输入路径N次，它将被处理N次。在 Hadoop 流作业上进行测试，如果我复制相同的输入路径，我会得到两倍的映射器。

hadoop - 如果我将相同的路径两次添加到 Hadoop 会发生什么？

1 回答 1

Related

Reference