java - hadoop中哪种压缩格式适合大图输出？

Question

我是 hadoop 的新手，我正在使用一个程序，它的地图输出与输入文件的大小相比非常大。

我安装了 lzo 库并更改了配置文件，但它对我的程序没有任何影响。如何压缩地图输出？lzo是最好的情况吗？

如果是，我如何在我的程序中实现它？

score 4 · Accepted Answer

To compress the intermediate output (your map output), you need to set the following properties in your mapred-site.xml:

<property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
</property>

<property>
    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.LzoCodec</value>
</property>

If you want to do it on a job per job basis, you could also directly implement that in your code in 1 of the following ways:

conf.set("mapred.compress.map.output", "true")
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.LzoCodec");

or

jobConf.setMapOutputCompressorClass(LzoCodec.class);

Also it's worth mentioning that the property mapred.output.compression.type should be left to the default of RECORD, because BLOCK compression for intermediate output causes bad performance.

When choosing what type of compression to use, I think you need to consider 2 aspects:

Compression ratio: how much compression actually occurs. The higher the %, the better the compression.
IO performance: since compression is an IO intensive operation, different methods of compression have different performance implication.

The goal is to balance compression ratio and IO performance, you can have a compression codec with very high compression ratio but poor IO performance.

It's really hard to tell you which one you should use and which one you should not, it also depends on your data, so you should try a few ones and see what makes more sense. In my experience, Snappy and LZO are the most efficient ones. Recently I heard about LZF which sounds like a good candidate too. I found a post proposing a benchmark of compressions here, but I would definitely advise to not take that for ground truth and do your own benchmark.

score 2 · Accepted Answer

如果您使用的是 Hadoop 0.21 或更高版本，则必须在 mapred-site.xml 中设置这些属性：

<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

并且不要忘记在更改后重新启动 hadoop。还要确保您同时安装了 32 位和 64 位 liblzo2。有关如何设置的详细帮助，您可以参考以下链接：

https://github.com/toddlipcon/hadoop-lzo

https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1

除了查尔斯先生提出的观点外，您还应该记住另外一个方面：

CPU 周期：您要使用的压缩算法应该消耗更少的 CPU 周期。否则，压缩成本会抵消或逆转速度优势。

Snappy 是另一种选择，但它主要针对64-bit机器进行了优化。如果您使用的是 32 位计算机，则最好小心。

基于最近的进展LZ4似乎也不错，最近已经集成到Hadoop中。它速度很快，但对内存的要求更高。你可以去这里找到更多关于 LZ4 的信息。

但正如查尔斯爵士所说，只有经过一些实验才能做出公平的决定。

java - hadoop中哪种压缩格式适合大图输出？

2 回答 2

Related

Reference