java - 使用 superCSV 读取 80GB 的大文本文件

Question

我想阅读一个巨大的 csv 文件。我们通常使用 superCSV 来解析文件。在这种特殊情况下，文件很大，并且由于明显的原因总是存在内存不足的问题。

最初的想法是将文件作为块读取，但我不确定这是否适用于 superCSV，因为当我对文件进行分块时，只有第一个块具有标头值并将被加载到 CSV bean 中，而其他块没有标头值，我觉得它可能会引发异常。所以

a) 我想知道我的思维过程是否正确
b) 有没有其他方法可以解决这个问题。

所以我的主要问题是

superCSV 是否具有处理大型 csv 文件的能力，我看到 superCSV 通过 BufferedReader 读取文档。但我不知道缓冲区的大小是多少，我们可以根据我们的要求更改它吗？

@Gilbert Le Blanc我已尝试根据您的建议将其拆分为较小的块，但是将大文件分解为较小的块需要很长时间。这是我为此编写的代码。

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.LineNumberReader;

public class TestFileSplit {

public static void main(String[] args) {

    LineNumberReader lnr = null;
    try {
        //RandomAccessFile input = new RandomAccessFile("", "r");
        File file = new File("C:\\Blah\\largetextfile.txt");
        lnr = new LineNumberReader(new FileReader(file), 1024);
        String line = "";
        String header = null;
        int noOfLines = 100000;
        int i = 1;
        boolean chunkedFiles = new File("C:\\Blah\\chunks").mkdir();
        if(chunkedFiles){
            while((line = lnr.readLine()) != null) {
                if(lnr.getLineNumber() == 1) {
                    header = line;
                    continue;
                }
                else {
                    // a new chunk file is created for every 100000 records
                    if((lnr.getLineNumber()%noOfLines)==0){
                        i = i+1;
                    }

                    File chunkedFile = new File("C:\\Blah\\chunks\\" + file.getName().substring(0,file.getName().indexOf(".")) + "_" + i + ".txt");

                    // if the file does not exist create it and add the header as the first row
                    if (!chunkedFile.exists()) {
                        file.createNewFile();
                        FileWriter fw = new FileWriter(chunkedFile.getAbsoluteFile(), true);
                        BufferedWriter bw = new BufferedWriter(fw);
                        bw.write(header);
                        bw.newLine();
                        bw.close();
                        fw.close();
                    }

                    FileWriter fw = new FileWriter(chunkedFile.getAbsoluteFile(), true);
                    BufferedWriter bw = new BufferedWriter(fw);
                    bw.write(line);
                    bw.newLine();
                    bw.close();
                    fw.close();
                }
            }
        }
        lnr.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
    }
}
}

score 2 · Accepted Answer

您可以在解析器 java 类本身中定义标头。这样您就不需要 CSV 文件中的标题行。

// only map the first 3 columns - setting header elements to null means those columns are ignored
final String[] header = new String[] { "customerNo", "firstName", "lastName", null, null, null, null, null, null, null };
beanReader.read(CustomerBean.class, header)

或者

您还可以使用 SuperCSV api 的推土机扩展。

score 1 · Accepted Answer

我不确定问题是什么。作为 bean 一次读取一行需要大致恒定的内存消耗。如果您一次存储所有读取的对象，那么是的，您的内存不足。但是这个超级csv的错是什么？

java - 使用 superCSV 读取 80GB 的大文本文件

2 回答 2

Related

Reference