hadoop - 在 Hadoop 中使用 MultipleOutputs 时 GZIP 文件末尾损坏

Question

我正在压缩 Hadoop MR 作业的输出：

conf.setOutputFormat(TextOutputFormat.class);
TextOutputFormat.setCompressOutput(conf, true);
TextOutputFormat.setOutputCompressorClass(conf, GzipCodec.class);

我正在使用 MultipleOutputs，例如：

MultipleOutputs.addMultiNamedOutput(conf, "a", TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(conf, "b", TextOutputFormat.class, Text.class, Text.class);
LazyOutputFormat.setOutputFormatClass(conf, TextOutputFormat.class);

另外，这些都很好用。我可以得到我想要的输出划分，我可以得到 gzipped 输出。但是，将它们一起使用时，gzip 压缩的文件最终似乎已损坏。每个输出文件大约有 25000 行。当我做类似的事情时

hadoop dfs -cat output/*.gz | less +24000

一切看起来都很正常。但如果我这样做

hadoop dfs -cat output/*.gz | less +40000

我收到错误，例如

zcat: stdin: invalid compressed data--crc error
zcat: stdin: invalid compressed data--length error
zcat: stdin: invalid compressed data--format violated

如果我重复第一个命令并开始扫描文件，最终我会在一行不完整或几行非常长、非常损坏的行之后遇到上述错误之一（我认为它们很长，因为换行符也已损坏）并且less可以不要再进一步了。

So, my question is: has anyone seen this before, and is there a way to fix it?

Note: I am using the mapred API instead of mapreduce. I can try to translate to the new API, but if I can find a solution using mapred, that would be preferable.

score 3 · Accepted Answer

Simply a guess (without seeing your reducer code), but are you calling MultipleOutputs.close() (on the instance of MultipleOutputs, rather than a static method, which doesn't exist) in the cleanup method of your reducer?

It looks like the final block of the gzip files are not being written - consistent with not calling the above method

hadoop - 在 Hadoop 中使用 MultipleOutputs 时 GZIP 文件末尾损坏

1 回答 1

Related

Reference