algorithm - 从 1,000,000 个总值中找出最大的 10,000 个

Question

我有一个包含 1,000,000 个浮点值的文件。我需要找到 10,000 个最大值。

我在想：

读取文件
将字符串转换为浮点数
将浮点数放入最大堆（最大值为根的堆）
在所有值都在堆中之后，删除根 10,000 次并将这些值添加到列表/数组列表中。

我知道我会有

1,000,000 次插入到堆中
从堆中删除 10,000 次
10,000 个插入返回列表

这会是一个很好的解决方案吗？这是一个家庭作业。

score 7 · Accepted Answer

您的解决方案大多是好的。它基本上是一个在获取 K 个元素后停止的堆排序，这将运行时间从（O(NlogN)对于完整排序）提高到O(N + KlogN). 这里 N = 1000000 和 K = 10000。

However, you should not do N inserts to the heap initially, as this would take O(NlogN) - instead, use a heapify operation which turns an array to a heap in linear time.

If the K numbers don't need to be sorted, you can find the Kth largest number in linear time using a selection algorithm, and then output all numbers larger than it. This gives an O(n) solution.

score 0 · Accepted Answer

如何使用 mergesort（在最坏的情况下进行 log n 操作）将 1,000,000 个整数排序到一个数组中，然后直接获取最后一个 10000？

score 0 · Accepted Answer

Sorting is expensive, and your input set is not small. Fortunately, you don't care about order. All you need is to know that you have the top X numbers. So, don't sort.

如果您不是在 1,000,000 中寻找前 10,000 个，而是在 100 个中寻找前 1 个（即单个最大值），您将如何解决这个问题？您只需要跟踪到目前为止所看到的最大值，并将其与下一个数字和下一个数字进行比较，直到找到更大的数字或输入用完为止。您能否将该想法扩展回您正在查看的输入大小？什么是大 O（提示：您只会查看每个输入数字一次）？

最后一点，因为你说这是家庭作业：如果你刚刚在课堂上学习堆，并且你认为你的老师/教授正在寻找堆解决方案，那么是的，你的想法很好。

score -1 · Accepted Answer

将数组中的值全部读入后，能否对它们进行合并排序？这是对值进行排序的快速方法。然后你可以请求 your_array[10000] 并且你会知道它是第 10000 个最大的。合并排序听起来像你想要的。此外，如果您真的需要速度，您可以查看基数排序的值的格式，这将需要一些格式，但听起来这将是解决这个问题的绝对最快的方法。

algorithm - 从 1,000,000 个总值中找出最大的 10,000 个

4 回答 4

Related

Reference