hadoop - Why does compression in YARN slow down the job by several times?

Question

When I run the job in YARN (2.4.0) using the compression (snappy), there is a big impact on the job completion time. For example, I ran the following experiments. Job: invertedindex Cluster: 10 slaves VMs(4 CPU 8GB RAM).

Job completion time of 5GB invertedindex without compression(snappy): 226s, with compression: 1600s

Job completion time of 50GB invertedindex without compression(snappy): 2000s, with compression: 14000s

My configuration in mapred-site.xml is like this:

<name>mapreduce.map.output.compress</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

I have read a lot of material that says that the compression should improve the performance, but here it has slowed down the job by almost 7 times. What am I doing wrong here?

score 0 · Accepted Answer

我通过以下步骤解决了这个压缩问题：

1、修复 Hadoop“Unable to load native-hadoop library for your platform”警告“Unable to load native-hadoop library”的问题

2、安装snappy http://code.google.com/p/snappy/

3、复制/usr/local/lib/libsnappy*到$HADOOP_HOME/lib/native/

4、在hadoop-env.sh和mapred-site.xml中配置LD_LIBRARY_PATH

<property>  
    <name>mapred.child.env</name>  
    <value>LD_LIBRARY_PATH=$HADOOP_HOME/lib/native</value>  
</property

score 0 · Accepted Answer

它可能是设置为的默认mapreduce.output.fileoutputformat.compress.type设置RECORD。

基本上它会尝试压缩每条记录，如果您的记录是小文本片段（例如倒排索引中的标记），它最终可能会比以前更大。

您可以尝试将此属性设置为BLOCK，它应该在块级别上压缩，从而对冗余文本数据提供更好的压缩。

hadoop - Why does compression in YARN slow down the job by several times?

2 回答 2

Related

Reference