2

When I run the job in YARN (2.4.0) using the compression (snappy), there is a big impact on the job completion time. For example, I ran the following experiments. Job: invertedindex Cluster: 10 slaves VMs(4 CPU 8GB RAM).

Job completion time of 5GB invertedindex without compression(snappy): 226s, with compression: 1600s

Job completion time of 50GB invertedindex without compression(snappy): 2000s, with compression: 14000s

My configuration in mapred-site.xml is like this:

<name>mapreduce.map.output.compress</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

I have read a lot of material that says that the compression should improve the performance, but here it has slowed down the job by almost 7 times. What am I doing wrong here?

4

2 回答 2

0

我通过以下步骤解决了这个压缩问题:

1、修复 Hadoop“Unable to load native-hadoop library for your platform”警告“Unable to load native-hadoop library”的问题

2、安装snappy http://code.google.com/p/snappy/

3、复制/usr/local/lib/libsnappy*到$HADOOP_HOME/lib/native/

4、在hadoop-env.sh和mapred-site.xml中配置LD_LIBRARY_PATH

<property>  
    <name>mapred.child.env</name>  
    <value>LD_LIBRARY_PATH=$HADOOP_HOME/lib/native</value>  
</property
于 2014-07-23T01:57:21.400 回答
0

它可能是设置为的默认mapreduce.output.fileoutputformat.compress.type设置RECORD

基本上它会尝试压缩每条记录,如果您的记录是小文本片段(例如倒排索引中的标记),它最终可能会比以前更大。

您可以尝试将此属性设置为BLOCK,它应该在块级别上压缩,从而对冗余文本数据提供更好的压缩。

于 2014-07-22T07:29:20.897 回答