0

我正在压缩 Hadoop MR 作业的输出:

conf.setOutputFormat(TextOutputFormat.class);
TextOutputFormat.setCompressOutput(conf, true);
TextOutputFormat.setOutputCompressorClass(conf, GzipCodec.class);

我正在使用 MultipleOutputs,例如:

MultipleOutputs.addMultiNamedOutput(conf, "a", TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(conf, "b", TextOutputFormat.class, Text.class, Text.class);
LazyOutputFormat.setOutputFormatClass(conf, TextOutputFormat.class);

另外,这些都很好用。我可以得到我想要的输出划分,我可以得到 gzipped 输出。但是,将它们一起使用时,gzip 压缩的文件最终似乎已损坏。每个输出文件大约有 25000 行。当我做类似的事情时

hadoop dfs -cat output/*.gz | less +24000

一切看起来都很正常。但如果我这样做

hadoop dfs -cat output/*.gz | less +40000

我收到错误,例如

zcat: stdin: invalid compressed data--crc error
zcat: stdin: invalid compressed data--length error
zcat: stdin: invalid compressed data--format violated

如果我重复第一个命令并开始扫描文件,最终我会在一行不完整或几行非常长、非常损坏的行之后遇到上述错误之一(我认为它们很长,因为换行符也已损坏)并且less可以不要再进一步了。

So, my question is: has anyone seen this before, and is there a way to fix it?

Note: I am using the mapred API instead of mapreduce. I can try to translate to the new API, but if I can find a solution using mapred, that would be preferable.

4

1 回答 1

3

Simply a guess (without seeing your reducer code), but are you calling MultipleOutputs.close() (on the instance of MultipleOutputs, rather than a static method, which doesn't exist) in the cleanup method of your reducer?

It looks like the final block of the gzip files are not being written - consistent with not calling the above method

于 2013-01-01T16:26:55.433 回答