hadoop - Hadoop 映射器从 2 个不同的源输入文件中读取

Question

我有链接很多 Mappers 和 Reducers 的工具，在某些时候我需要合并以前的 map-reduce 步骤的结果，例如作为输入，我有两个包含数据的文件：

/input/a.txt
apple,10
orange,20

*/input/b.txt*
apple;5
orange;40

结果应该是 c.txt，其中c.value = a.value * b.value

/output/c.txt
apple,50   // 10 * 5
orange,800 // 40 * 20

怎么可能做到？我通过引入简单的 Key => MyMapWritable (type=1,2, value) 并在 reducer 中合并（实际上是相乘）数据解决了这个问题。它有效，但是：

感觉可以做得更容易（闻起来不好）
是否有可能在 Mapper 内部知道哪个文件被用作记录提供者（a.txt 或 b.txt）。现在，我只使用了不同的分隔符：逗号和分号 :(

score 3 · Accepted Answer

假设它们已经以相同的方式进行了分区和排序，那么您可以使用CompositeInputFormat来执行 map-side-join。这里有一篇关于使用它的文章。我不认为它已被移植到新的 mapreduce api。

其次，您可以通过调用来获取映射器中的输入文件context.getInputSplit()，这将返回InputSplit，如果您正在使用TextInputFormat，您可以将其转换为a FileInputSplit，然后调用getPath()以获取文件名。我认为您不能将此方法与 CompositeInputFormat 一起使用，因为您不知道 TupleWritable 中的 Writables 来自何处。

score 1 · Accepted Answer

String fileName = ((FileSplit) context.getInputSplit()).getPath()
                .toString();

if (fileName.contains("file_1")) {
   //TODO for file 1
} else {
   //TODO for file 2
}

hadoop - Hadoop 映射器从 2 个不同的源输入文件中读取

2 回答 2

Related

Reference