I am trying to use LZ4 compression with Apache Spark and I understand that using the regular textFile
method should be sufficient. However, if I load my file uncompressed everything works as expected but if I do it lz4-compressed the output ends up being empty.
I am wondering if the issue is related to the way I am compressing and decompressing. I am compressing my files using the java library https://github.com/jpountz/lz4-java version 1.3.0 (lz4 version 123). However, in the machine where the Spark workers are installed I have the hadoop native libraries for other versions. If I run the command to check them it shows:
./hadoop checknative -a
15/03/04 05:11:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
15/03/04 05:11:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/hadoop/hadoop-2.4.0/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: false
lz4: true revision:99
bzip2: false
Then the RPM I am installing for having the lz4.so library installed is the following:
As you see, it looks like I have three different versions of LZ4 but I am unable to find the same version. My first question is: should this work even if I don't have the same version?
If not, what should I do to configure correctly the native libs for Spark to understand lz4 compressed files?
I am using Spark 1.1.0 and passing the location of the native libraries via --driver-library-path
with spark-submit
.