apache-spark - Using LZ4 with Apache Spark

Question

I am trying to use LZ4 compression with Apache Spark and I understand that using the regular textFile method should be sufficient. However, if I load my file uncompressed everything works as expected but if I do it lz4-compressed the output ends up being empty.

I am wondering if the issue is related to the way I am compressing and decompressing. I am compressing my files using the java library https://github.com/jpountz/lz4-java version 1.3.0 (lz4 version 123). However, in the machine where the Spark workers are installed I have the hadoop native libraries for other versions. If I run the command to check them it shows:

./hadoop checknative -a
15/03/04 05:11:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
15/03/04 05:11:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0
zlib:   true /lib64/libz.so.1
snappy: false
lz4:    true revision:99
bzip2:  false

Then the RPM I am installing for having the lz4.so library installed is the following:

http://rpm.pbone.net/index.php3/stat/4/idpl/28577074/dir/redhat_el_6/com/lz4-r127-1.el6.x86_64.rpm.html

As you see, it looks like I have three different versions of LZ4 but I am unable to find the same version. My first question is: should this work even if I don't have the same version?

If not, what should I do to configure correctly the native libs for Spark to understand lz4 compressed files?

I am using Spark 1.1.0 and passing the location of the native libraries via --driver-library-path with spark-submit.

apache-spark - Using LZ4 with Apache Spark

0 回答 0

Related

Reference