I am working at Microsoft Bing as an intern this summer. To get an internship I had lots of interviews with various tech companies this year, and this was my favorite question: In an integer array with N elements (N is large), find the minimum k elements (k << N).
The first solution that comes to mind is sorting the array, and returning the first k elements. The complexity is O(NlogN) due to sorting. However, since N is large this algorithm is not very efficient, and it does more work than necessary. We just need the minimum k elements, we don’t need the absolute ordering of all elements. So, we would rather have a solution that depends on k.
A naive method that depends on k is having a list of size k, and comparing every element of the original array with all the elements in the list, and deciding whether the current element is one of the minimum k elements seen so far. This approach is O(Nk), and depending on the relative magnitudes of N and k, this method might perform better than the previous one (if k<logN this solution is more efficient).
Let’s try to improve the previous algorithm. We are comparing every element in the original array with all elements of the list of size k. Can we use a better data structure than a list so that we will perform less than k comparisons for every element? Apparently we can. We can use a max-heap of size k. The element that is at the root of the heap is the maximum element among the minimum k elements. So, for every element in the original array, we will compare it to the root if the heap. If it is larger than the root, we will do nothing and continue to the next element in the original array. But if the current element is smaller than the root of the max-heap, we will delete the root and insert the current element to the heap. The reason is, if the current element is smaller than the root, it means that the root of the heap is no longer one of the minimum k elements, since the current element has a smaller value. Why do we use a heap then? Because we can check whether the current element is one of the minimum k elements we have seen so far in O(1), we just look at the root of the heap. The reason we use a max-heap instead of a min-heap is that we want to compare the current element with the largest element among the minimum k elements. Because that value will be replaced with the current element and we want it to be at the root for O(1) comparison (if we wanted to find the maximum k elements, then we would use a min-heap). After deciding that the root should be replaced, extracting it from the heap and inserting the current element takes O(logk) time. Therefore, complexity of this solution is O(Nlogk). This is much better than the previous approach since it’s logarithmic on k instead of linear. And it is very close to optimum, which is O(N). We can’t do better than O(N) since we have to look at every element in the original array at least once. But can we do it in O(N)?
Without using any additional data structures, let’s form a min-heap from the original array in-place. This can be done in O(N) time using Floyd’s Algorithm (also known as fast heap construction), which forms the heap bottom-up. Remember that heaps are complete binary trees which can also be represented as arrays. When we start to construct a heap bottom-up we don’t perform any operations for the last N/2 elements (the leaves of the tree, level 0) because they don’t have any children, so they already satisfy the min-heap ordering. Then for N/4 elements at level 1 (parents of the leaves) we perform at most 1 bubble-down operation, because they may not satisfy min-heap property (the leaves may be smaller). Then for N/8 elements at level 2 we perform at most 2 bubble-down operations. And it goes on like this up to the root of the tree where we perform at most logN bubble-down operations. If we sum up, we get 1*N/4 + 2*N/8 + 3*N/16 + … + logN*N/2^(logN+1) <= 2N which is O(N). Since this is a geometric series, the sum converges. So, we can construct a heap from an array in linear time. Now we have a min-heap and we need its k minimum elements. We simply perform k delete-min operations on the heap, each of which takes O(logN) time, and get the minimum k elements. The complexity of this second part is O(klogN), and overall complexity of the algorithm is O(N) + O(klogN). Is this O(N)? Depending on the relative values of k and N, this algorithm can be O(N). Specifically, if k<O(N/logN), which is not a small number, then the algorithm is linear. Let’s see with an example. If N is 1 million, we can have k up to 50,000 and the algorithm would still be linear. Which is a reasonable value since we said k << N in the beginning.
So, overall we have a very efficient linear solution which uses no extra space. If we are not allowed to destroy the original array, we can copy its contents to a new array and perform this algorithm on that one, but this would require O(N) space, which may be prohibitive. Then, depending whether space or time is scarce resource, we can use either this approach or the previous solution, which uses O(k) space. The reason this is my favorite interview question is because it shows that use of correct data structures provide very elegant solutions. And heap is a very important data structure that is generally overlooked by candidates, while trying to excel binary search trees, stacks, queues etc. Also fast heap construction is a subtle algorithm that is not very widely known.