algorithm - Finding k-largest elements of a very large file (while k is very LARGE)

Question

Let's assume that we have a very large file which contains billions of integers , and we want to find k largest elements of these values ,

the tricky part is that k itself is very large too , which means we cannot keep k elements in the memory (for example we have a file with 100 billon elements and we want to find 10 billion largest elements)

How can we do this in O(n) ?

What I thought :

We start reading the file and we check it with another file which keeps the k largest elements (sorted in increasing order) , if the read element is larger than the first line of the second file we delete the first line and we insert it into the second file , the time complexity would be of O(NlogK) (if we have random access to that file , otherwise it would be 'O(Nk)'

Any idea to do this in O(n) , I guess if we have external version of Selection algorithm (the partitioning algorithm in quicksort) we would be able to do this in O(n) but I couldn't find it anywhere

score 11 · Accepted Answer

You can do this pretty easily with a standard merge type algorithm.

Say you have 100 billion numbers and you want the top 10 billion. We'll say you can hold 1 billion numbers in memory at any time.

So you make a pass:

while not end of input
    read 1 billion numbers
    sort them in descending order
    save position of output file
    write sorted numbers to output file

You then have a file that contains 100 blocks of 1 billion numbers each. Each block is sorted in descending order.

Now create a max heap. Add the first number of each block to the heap. You'll also have to add the block number or the number's position in the file so that you can read the next number.

Then:

while num_selected < 10 billion
    selected = heap.remove()
    ++num_selected
    write selected to output
    read next number from the selected block and place on heap

There's a small bit of complexity involved, keeping track of which block the number came from, but it's not too bad.

The max heap never contains more than 100 items (basically, one item per block), so memory isn't an issue in the second pass. With a bit of work, you can avoid a lot of reads by creating a smallish buffer for each block so that you don't incur the cost of a disk read for every number that's selected.

It's basically just a disk merge sort, but with an early out.

Complexity of the first pass is b * (m log m), where b is the number of blocks and m is the number of items in a block. N, the total number of items in the file, is equal to b * m. Complexity of the second pass is k log b, where k is the number of items to select and b is the number of blocks.

score 4 · Accepted Answer

PS: My definition of K is different. It is a smallish number say 2 or 100 or 1000. Here m corresponds to OPS's definition of k. Sorry about this.

Depends on how many reads you can do of the original data and how much more space you have. This approach assumes you have extra space equivalent to the original data.

Step 1: Pick K random numbers across the whole data
Step 2: Sort the K numbers (assume index are from 1 to K)
Step 3: Create K+1 separate files and name them 0 to K
Step 4: For every element in the data, if it is between ith and i+th element put it in ith file.
Step 5: Based on the size of each file, choose the file that is going to have mth number.
Step 6: Repeat everything with the new file and new m (new_m = m - sum_of_size_of_all_lower_files)

Regarding the last step, if K=2, m=1000 and size of file 0 is 800, 1 is 900 and 2 is 200, new_m = m-800 = 200 and work through file 1 iteratively.

score 4 · Accepted Answer

you can do this by maintaining a min heap of max size k.

Every time a new number arrives - check if the heap is smaller then k, if it is - add it.
If it is not - check if the minimum is smaller then the new element, and if it is, pop it out and insert the new element instead.

When you are done - you have a heap containing k largest elements. This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need .

It can be done also in O(n) using selection algorithm - store all the elements, and then find the (k+1)th largest element, and return everything bigger then it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed

score 2 · Accepted Answer

If all values are distinct or we can ignore doublets and we have 32bit integers, I would simply use one bit per possible value (needs 2^32 bits = 2^29 bytes = 512 megabytes (should fit in your RAM)).

Initialize the 512MB with 0
While linear reading the file ( O(n) ) set the corresponding bit for each read value.
At the end look for the first k set bits to get the k largest values. ( O(2^32) bit tests)

If the values are not distinct and you want to know how often the values occur, you can add a 4th step where you read the file again and count the number of occurrences of the values found in the first 3 steps. That is still O(n).

score 0 · Accepted Answer

We can use the PriorityQueue with size 'k'.

Keep adding the values in the PriorityQueue.
If the size becomes more than k then remove the 1st element. PriorityQueue sorts in the ascending order by default.
After all the elements have been added in the PriorityQueue, we can later pop the element to get the k-th largest element.

score -1 · Accepted Answer

Use randomised selection to find the kth largest element in the file. You can do this in linearly many passes over the input as long as it's not too ridiculously many times larger than memory. Then just dump out everything that's at least as large as it.

algorithm - Finding k-largest elements of a very large file (while k is very LARGE)

6 回答 6

Related

Reference