15

我正在尝试读取大约行或更多行的大文件CSVTSV(制表符分隔的)文件。1000000现在我试图用 读取TSV包含~2500000opencsv,但它抛出了一个java.lang.NullPointerException. 它适用于带有线条的较小TSV文件。~250000所以我想知道是否还有其他Libraries支持读取大文件CSVTSV文件的方法。你有什么想法?

每个对我的代码感兴趣的人(我缩短了它,所以Try-Catch显然是无效的):

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}

编辑:这是我构造的方法InputStreamReader

private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }
4

4 回答 4

16

不要使用 CSV 解析器来解析 TSV 输入。例如,如果 TSV 具有带引号字符的字段,它将中断。

uniVocity-parsers带有一个 TSV 解析器。您可以毫无问题地解析十亿行。

解析 TSV 输入的示例:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

如果您的输入太大而无法保存在内存中,请执行以下操作:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));

披露:我是这个库的作者。它是开源和免费的(Apache V2.0 许可证)。

于 2014-11-23T08:38:33.193 回答
7

我没有尝试过,但我之前研究过 superCSV。

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

检查这是否适合您,250 万行。

于 2012-12-14T13:56:12.660 回答
1

尝试按照Satish. 如果这没有帮助,您必须将整个文件拆分为令牌并处理它们。

认为您CSV的逗号没有任何转义字符

// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
    file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");

然后你可以处理它。不要忘记在使用之前修剪令牌。

于 2012-12-14T13:59:08.710 回答
1

我不知道该问题是否仍然有效,但这是我成功使用的问题。仍然可能需要实现更多接口,例如 Stream 或 Iterable,但是:

import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;

/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable 
{
    final Scanner in;
    String peekLine = null;

    public TSVReader(InputStream stream) throws FileNotFoundException
    {
        in = new Scanner(stream);
    }

    /**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
    public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}

    public boolean hasNextTokens()
    {
        if(peekLine!=null) return true;
        if(!in.hasNextLine()) {return false;}
        String line = in.nextLine().trim();
        if(line.isEmpty())  {return hasNextTokens();}
        this.peekLine = line;       
        return true;        
    }

    public String[] nextTokens()
    {
        if(!hasNextTokens()) return null;       
        String[] tokens = peekLine.split("[\\s\t]+");
//      System.out.println(Arrays.toString(tokens));
        peekLine=null;      
        return tokens;
    }

    @Override public void close() throws IOException {in.close();}
}
于 2014-04-02T12:44:40.310 回答