Select and filter algorithm

2019-08-13

An answer to this question on Stack Overflow.

Question

I would like to select the top n values from a dataset, but ignore elements based on what I have already selected - i.e., given a set of points (x,y), I would like to select the top 100 values of x (which are all distinct) but not select any points such that y equals the y of any already-selected point. I would like to make sure that the highest values of x are prioritized.

Is there any existing algorithm for this, or at least similar ones? I have a huge amount of data and would like to do this as efficiently as possible. Memory is not as much of a concern.

Answer

You can do this in O(n log k) time where n is the number of values in the dataset and k are the number of top values you'd like to get.

Store the values you wish to exclude in a hash table.
Make an empty min-heap.
Iterate over all of the values and for each value:
If it is in the hash table skip it.
If the heap contains fewer than k values, add it to the heap.
If the heap contains >=k values, if the value you're looking at is greater than the smallest member of the minheap, pop that value and add the new one.