java - 逐行比较两个大文件中的数据

Question

我需要分析两个应该具有相同结构的大型数据文件之间的差异。每个文件大小为几千兆字节，可能包含 3000 万行或文本数据。数据文件太大了，以至于我不愿将每个文件加载到自己的数组中，因为按顺序遍历行可能更容易。每行具有以下结构：

topicIdx, recordIdx, other fields...

topicIdx 和recordIdx 是连续的，从零开始，每次迭代递增+1，因此很容易在文件中找到它们。（无需四处搜索；只需按顺序向前递增）。

我需要做类似的事情：

for each line in fileA  
    store line in String itemsA  
       get topicIdx and recordIdx  
           find line in fileB with same topicIdx and recordIdx  
               if exists  
                   store this line in string itemsB  
                       for each item in itemsA  
                           compare value with same index in itemsB  
                               if these two items are not virtually equal  
                                   //do something  
                else  
                    //do something else

我用 FileReader 和 BufferedReader 编写了以下代码，但这些 api 似乎没有提供我需要的功能。谁能告诉我如何修复下面的代码以实现我想要的？

void checkData(){  
    FileReader FileReaderA;  
    FileReader FileReaderB;  
    int topicIdx = 0;  
    int recordIdx = 0;  
    try {  
        int numLines = 0;
        FileReaderA = new FileReader("B:\\mypath\\fileA.txt");  
        FileReaderB = new FileReader("B:\\mypath\\fileB.txt");  
        BufferedReader readerA = new BufferedReader(FileReaderA);  
        BufferedReader readerB = new BufferedReader(FileReaderB);
        String lineA = null;
        while ((lineA = readerA.readLine()) != null) {
            if (lineA != null && !lineA.isEmpty()) {
                List<String> itemsA = Arrays.asList(lineA.split("\\s*,\\s*"));
                topicIdx = Integer.parseInt(itemsA.get(0));
                recordIdx = Integer.parseInt(itemsA.get(1));
                String lineB = null;
                //lineB = readerB.readLine();//i know this syntax is wrong
                setB = rows from FileReaderB where itemsB.get(0).equals(itemsA.get(0));
                for each lineB in setB{
                    List<String> itemsB = Arrays.asList(lineB.split("\\s*,\\s*"));
                    for(int m = 0;m<itemsB.size();m++){}
                    for(int j=0;j<itemsA.size();j++){  
                    double myDblA = Double.parseDouble(itemsA.get(j));  
                    double myDblB = Double.parseDouble(itemsB.get(j));  
                    if(Math.abs(myDblA-myDblB)>0.0001){  
                        //do something  
                    }  
                 }  
            }  
        }  
        readerA.close();  
    }   catch (IOException e) {e.printStackTrace();}  
}

score 2 · Accepted Answer

您需要按搜索键（recordIdx 和 topicIdx）排序的两个文件，因此您可以执行类似这样的合并操作

open file 1
open file 2
read lineA from file1
read lineB from file2
while (there is lineA and lineB) 
    if (key lineB < key lineA) 
        read lineB from file 2
        continue loop
    if (key lineB > key lineA)
        read lineA from file 1
        continue
    // at this point, you have lineA and lineB with matching keys
    process your data
    read lineB from file 2

请注意，您的内存中只会有两条记录。

score 2 · Accepted Answer

如果你真的在 Java 中需要这个，为什么不使用java-diff-utils呢？它实现了一个众所周知的差异算法。

score 1 · Accepted Answer

1

考虑https://code.google.com/p/java-diff-utils/。让别人做繁重的工作。

于 2013-07-15T20:00:28.990 回答

java - 逐行比较两个大文件中的数据

3 回答 3

Related

Reference