java - 多个小文件的 SequenceFile 压缩器仅在一个 file.seq 中

Question

HDFS 和 Hadoop 中的 Novell：我正在开发一个程序，它应该获取特定目录的所有文件，我们可以在其中找到几个任何类型的小文件。

获取everyfile并在SequenceFile中进行附加压缩，其中键必须是文件的路径，值必须是得到的文件，现在我的代码是：

    import java.net.*;

    import org.apache.hadoop.fs.*;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.io.compress.BZip2Codec;

public class Compact {
        public static void main (String [] args) throws Exception{
                try{
                        Configuration conf = new Configuration();
                        FileSystem fs =
                                FileSystem.get(new URI("hdfs://quickstart.cloudera:8020"),conf);
                        Path destino = new Path("/user/cloudera/data/testPractice.seq");//test args[1]
                    
                        if ((fs.exists(destino))){
                            System.out.println("exist : " + destino);
                            return;
                        }
                        BZip2Codec codec=new BZip2Codec();
                        
                        SequenceFile.Writer outSeq = SequenceFile.createWriter(conf
                                   ,SequenceFile.Writer.file(fs.makeQualified(destino))
                                   ,SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,codec)
                                   ,SequenceFile.Writer.keyClass(Text.class)
                                   ,SequenceFile.Writer.valueClass(FSDataInputStream.class));
    
                        FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
                        for (int i=0;i<status.length;i++){
                                FSDataInputStream in = fs.open(status[i].getPath());
                                                            
                                
                                outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new FSDataInputStream(in));
                                fs.close();
                                
                        }
                        outSeq.close();
                        System.out.println("End Program");
                }catch(Exception e){
                        System.out.println(e.toString());
                        System.out.println("File not found");
                }
        }
}

但是在每次执行之后我都会收到这个异常：

java.io.IOException：找不到值类的序列化程序：“org.apache.hadoop.fs.FSDataInputStream”。如果您使用自定义序列化，请确保配置“io.serializations”已正确配置。文件未找到

我知道错误必须在我正在创建的文件类型和我为添加到 sequenceFile 中定义的对象类型中，但我不知道应该添加哪个，有人可以帮我吗？

score 0 · Accepted Answer

FSDataInputStream 与任何其他 InputStream 一样，不打算被序列化。在字节流上序列化“迭代器”应该做什么？

您最可能想要做的是将文件的内容存储为值。例如，您可以将值类型从 FsDataInputStream 切换为 BytesWritable，然后从 FSDataInputStream 中获取所有字节。将 Key/Value SequenceFile 用于此类目的的一个缺点是每个文件的内容必须适合内存。对于小文件可能没问题，但您必须注意这个问题。

我不确定您真正想要实现什么，但也许您可以通过使用Hadoop Archives之类的东西来避免重新发明轮子？

score 0 · Accepted Answer

非常感谢您的评论，问题出在您所说的序列化程序上，最后我使用了 BytesWritable：

FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
                    for (int i=0;i<status.length;i++){
                        FSDataInputStream in = fs.open(status[i].getPath());
                        byte[] content = new byte [(int)fs.getFileStatus(status[i].getPath()).getLen()];                    

                        outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new org.apache.hadoop.io.BytesWritable(in));
                    }
                        outSeq.close();

在 hadoop 生态系统中可能还有其他更好的解决方案，但这个问题是我正在开发的学位的练习，现在我们正在重新制造理解概念的轮子;-)。

java - 多个小文件的 SequenceFile 压缩器仅在一个 file.seq 中

2 回答 2

Related

Reference