java - 如何从 Java 中的未排序数组中快速获取前 N 个出现项？

Question

我试过两种方法。

使用 HashMap 计算每个项目的计数，然后导航地图

HashMap<Integer, Integer> doc_counts = new HashMap<Integer, Integer>();
for (int i = 0; i < p; ++i) {
    int doc = alld[i];
    Integer count = doc_counts.get(doc);
    if (null == count)
        count = 0;
    doc_counts.put(doc, count + 1);
}
// to now it cost 200ms already
for (Entry<Integer, Integer> item : doc_counts.entrySet()) {
    heapCheck(h, hsize, item.getKey(), item.getValue());    // heap sort top hsize items
}

首先对数组进行排序，然后使用堆排序获得前 N 个。

Arrays.sort(alld, 0, p); // the sort costs about 160ms
int curr = alld[0];
int count = 0;
for(int i = 0; i < p; i++) {
    int doc = alld[i];
    if(doc == curr) {
        ++count;
    } else {
        ++nHits;
        //curr += base;
        heapCheck(h, hsize, curr, count);
        curr = doc;
        count = 1;
    }
}
//
// Handle the last document that was collected.
heapCheck(h, hsize, curr, count);

对一个有 1,600,000 个元素的数组进行测试表明，第二种方法花费了大约 170 毫秒，并且大部分时间都花在了排序上（大约 160 毫秒），第一种方法花费了 200 毫秒，即使只是将所有元素添加到 HashMap 中。如何提高性能、找到更快的映射或排序函数或将其更改为并行函数以使用多线程？

score 0 · Accepted Answer

该任务非常适合并行化。您可以使用FokJoinPool 框架来实现分而治之的算法。例如，您可以使用并行排序算法对数组进行排序并减少 160 毫秒。

或者，如果您想试验 Java 8，它有一个内置Arrays.parallelSort()方法。

score 0 · Accepted Answer

堆排序是 O(n log n)，而将所有内容添加到 Hashmap 是 O(n)，因此很可能由于 Hashmap 的大小调整/重新散列，您会遭受恒定因素的性能影响。尝试指定较大的初始容量以避免过多的调整大小操作。

score 0 · Accepted Answer

具有原始类型的Collection框架非常昂贵。

尝试使用GNU Trove TIntIntHashMap代替第一种方法，即计数图。

根据我的观点和经验，第二个应该更快，特别是如果您已经在内存中拥有数据，并且可以使用原始排序，这比排序对象快得多。

score 0 · Accepted Answer

不要排序 - 那是 O(n log n)。有一个 O(n) + O(N log N) 解决方案：

创建一个Map<Integer, Integer>来保存每个数字 O(n) 的计数
遍历数组创建/更新计数 O(n)
通过地图保持前 N 最大，可能使用可导航地图 O(N log N)

如果 N << n，则为 O(n)。如果 N ≈ n，则为 O(N log N)

java - 如何从 Java 中的未排序数组中快速获取前 N 个出现项？

4 回答 4

Related

Reference