我需要分析两个应该具有相同结构的大型数据文件之间的差异。每个文件大小为几千兆字节,可能包含 3000 万行或文本数据。数据文件太大了,以至于我不愿将每个文件加载到自己的数组中,因为按顺序遍历行可能更容易。每行具有以下结构:
topicIdx, recordIdx, other fields...
topicIdx 和recordIdx 是连续的,从零开始,每次迭代递增+1,因此很容易在文件中找到它们。(无需四处搜索;只需按顺序向前递增)。
我需要做类似的事情:
for each line in fileA
store line in String itemsA
get topicIdx and recordIdx
find line in fileB with same topicIdx and recordIdx
if exists
store this line in string itemsB
for each item in itemsA
compare value with same index in itemsB
if these two items are not virtually equal
//do something
else
//do something else
我用 FileReader 和 BufferedReader 编写了以下代码,但这些 api 似乎没有提供我需要的功能。谁能告诉我如何修复下面的代码以实现我想要的?
void checkData(){
FileReader FileReaderA;
FileReader FileReaderB;
int topicIdx = 0;
int recordIdx = 0;
try {
int numLines = 0;
FileReaderA = new FileReader("B:\\mypath\\fileA.txt");
FileReaderB = new FileReader("B:\\mypath\\fileB.txt");
BufferedReader readerA = new BufferedReader(FileReaderA);
BufferedReader readerB = new BufferedReader(FileReaderB);
String lineA = null;
while ((lineA = readerA.readLine()) != null) {
if (lineA != null && !lineA.isEmpty()) {
List<String> itemsA = Arrays.asList(lineA.split("\\s*,\\s*"));
topicIdx = Integer.parseInt(itemsA.get(0));
recordIdx = Integer.parseInt(itemsA.get(1));
String lineB = null;
//lineB = readerB.readLine();//i know this syntax is wrong
setB = rows from FileReaderB where itemsB.get(0).equals(itemsA.get(0));
for each lineB in setB{
List<String> itemsB = Arrays.asList(lineB.split("\\s*,\\s*"));
for(int m = 0;m<itemsB.size();m++){}
for(int j=0;j<itemsA.size();j++){
double myDblA = Double.parseDouble(itemsA.get(j));
double myDblB = Double.parseDouble(itemsB.get(j));
if(Math.abs(myDblA-myDblB)>0.0001){
//do something
}
}
}
}
readerA.close();
} catch (IOException e) {e.printStackTrace();}
}