java - 在 Java 中可以快速地重复读取/写入大量 int[] (BitSet) 数据到文件中吗？

Question

我的主程序如下所示（伪代码）：

public void main(String[] args) {

    // produce lots of int[] data which is stored inside a list of hashmaps
    List<HashMap<Integer, int[]>> dataArray1 = new
                                    ArrayList<HashMap<Integer, int[]>>();
    ...

    // create a new list of data, similar to dataArray1
    // now we will write into dataArray2 and read from dataArray1
    List<HashMap<Integer, int[]>> dataArray2 = new
                                    ArrayList<HashMap<Integer, int[]>>();
    while (true) {
        if (exitCondition) break;
        ...
        for index1, index2 in a set of indices {
            int[] a1 = dataArray1.get(index1).get(key1);
            int[] a2 = dataArray1.get(index2).get(key2);
            int[] b = intersect a1 and a2;
            int i = generateIndex(index1, index2);
            int key = generateKey(key1, key2);
            dataArray2.get(i).put(key, b);
        }
    }

    // now we can remove dataArray1
    dataArray1 = null;

    // create a new list of data, similar to dataArray2
    // now we will write into dataArray3 and read from dataArray2
    List<HashMap<Integer, int[]>> dataArray3 = new
                                    ArrayList<HashMap<Integer, int[]>>();
    while (true) {
        if (exitCondition) break;
        ...
        for index1, index2 in a set of indices {
            int[] a1 = dataArray2.get(index1).get(key1);
            int[] a2 = dataArray2.get(index2).get(key2);
            int[] b = intersect a1 and a2;
            int i = generateIndex(index1, index2);
            int key = generateKey(key1, key2);
            dataArray3.get(i).put(key, b);
        }
    }

    // now we can remove dataArray2
    dataArray2 = null;

    ...
    // and so on 20 times

}

我的问题是，有些人在某些时候dataArrayk会k > 1变得很重（比如 20 Gb），因此不可能将其存储在内存中。我可以换上int[]，BitSet但这无济于事，内存消耗得更多。

解决方案是使用数据库或文件系统。你会推荐使用什么？我需要性能（时间执行），内存无所谓。如果您的经验是数据库，那么请推荐用于处理特定（哪个？）数据库的最快接口，无论是 bd4 (Berkeley db)、postgresql 还是其他。如果它说 FileSystem，那么请推荐最快的接口（文件库）。

至于读取和写入的统计信息：在我的代码的每个 while 循环中，我的3读取次数比写入次数多，例如：对于一级 k，我从dataArray_k 12000时间读取并写入dataArray_(k+1) 4000时间。

我可以将每个哈希图存储List<HashMap<Integer, int[]>> dataArray1在单独的文件中。

score 4 · Accepted Answer

昨天我对不同 java io/nio 技术的读取性能进行了评估。事实证明，在 pc 上， withMemory Map提供的读取性能最好。此处包含代码的详细信息：从二进制文件中读取大量 int 的最快方法java.nioIntBuffer

当然，事实证明，算法更改更有可能提高速度。例如，在您的情况下，考虑多维搜索结构，如四叉树或 R* 树，以减少对密切相关的生物数据的磁盘访问。

更新：正如我现在看到的代码，您似乎总是遍历所有值（这不是很清楚）。首先尝试使用一个短数组，它需要一半的空间。

score 0 · Accepted Answer

老实说，用 Java 读取这么多数据可能会是一场噩梦。我只处理了多达 5 GB 的文本文件，这真的很慢而且很困难。您可以使用更接近操作系统的东西（sed、grep、find 等）。如果 Java 是必须的，那么我认为 NIO 包会比简单的文件更快

看这里

java - 在 Java 中可以快速地重复读取/写入大量 int[] (BitSet) 数据到文件中吗？

2 回答 2

Related

Reference