hadoop - Hadoop：如何在同一个作业中输出不同的格式类型？

Question

我想在一项工作中同时输出gzip和格式化。lzo

我使用MultipleOutputs, 并添加两个命名输出，如下所示：

MultipleOutputs.addNamedOutput(job, "LzoOutput", GBKTextOutputFormat.class, Text.class, Text.class);

GBKTextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);

MultipleOutputs.addNamedOutput(job, "GzOutput", TextOutputFormat.class, Text.class, Text.class);

TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

（GBKTextOutputFormat这里是我自己写的延伸FileOutputFormat）

它们用于减速器，例如：

multipleOutputs.write("LzoOutput", NullWritable.get(), value, "/user/hadoop/lzo/"+key.toString());

multipleOutputs.write("GzOutput", NullWritable.get(), value, "/user/hadoop/gzip/"+key.toString());

结果是：

我可以在两条路径中获得输出，但它们都是gzip格式。

有人可以帮助我吗？谢谢！

==================================================== =========================

更多的：

我只是查看了setOutputCompressorClassin的源代码FileOutputFormat，其中conf.setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class);

调用 setOutputCompressorClass 时，配置中的 mapred.output.compression.codec 似乎会被重置。

那么实际的压缩格式是我们最后设置的，不能在同一个job中设置两种不同的压缩格式？或者还有什么被忽略的？

score 2 · Accepted Answer

So maybe as a work-around, try setting the correct outputCompressorClass directly in the configuration

context.getConfiguration().setOutputCompressorClass(GzipCodec.class);

just before your write call to each of the outputs. It does look like any output format configuration parameters other than key class, value class and output path are not handled well by MultipleOutputs and we may have to write a bit of code to offset that oversight.

hadoop - Hadoop：如何在同一个作业中输出不同的格式类型？

1 回答 1

Related

Reference