When I run the job in YARN (2.4.0) using the compression (snappy), there is a big impact on the job completion time. For example, I ran the following experiments. Job: invertedindex Cluster: 10 slaves VMs(4 CPU 8GB RAM).
Job completion time of 5GB invertedindex without compression(snappy): 226s, with compression: 1600s
Job completion time of 50GB invertedindex without compression(snappy): 2000s, with compression: 14000s
My configuration in mapred-site.xml is like this:
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
I have read a lot of material that says that the compression should improve the performance, but here it has slowed down the job by almost 7 times. What am I doing wrong here?