# My Favorite Interview Question

I am working at Microsoft Bing as an intern this summer. To get an internship I had lots of interviews with various tech companies this year, and this was my favorite question: In an integer array with N elements (N is large), find the minimum k elements (k << N).

The first solution that comes to mind is sorting the array, and returning the first k elements. The complexity is O(NlogN) due to sorting. However, since N is large this algorithm is not very efficient, and it does more work than necessary. We just need the minimum k elements, we don’t need the absolute ordering of all elements. So, we would rather have a solution that depends on k.

A naive method that depends on k is having a list of size k, and comparing every element of the original array with all the elements in the list, and deciding whether the current element is one of the minimum k elements seen so far. This approach is O(Nk), and depending on the relative magnitudes of N and k, this method might perform better than the previous one (if k<logN this solution is more efficient).

Let’s try to improve the previous algorithm. We are comparing every element in the original array with all elements of the list of size k. Can we use a better data structure than a list so that we will perform less than k comparisons for every element? Apparently we can. We can use a max-heap of size k. The element that is at the root of the heap is the maximum element among the minimum k elements. So, for every element in the original array, we will compare it to the root if the heap. If it is larger than the root, we will do nothing and continue to the next element in the original array. But if the current element is smaller than the root of the max-heap, we will delete the root and insert the current element to the heap. The reason is, if the current element is smaller than the root, it means that the root of the heap is no longer one of the minimum k elements, since the current element has a smaller value. Why do we use a heap then? Because we can check whether the current element is one of the minimum k elements we have seen so far in O(1), we just look at the root of the heap. The reason we use a max-heap instead of a min-heap is that we want to compare the current element with the largest element among the minimum k elements. Because that value will be replaced with the current element and we want it to be at the root for O(1) comparison (if we wanted to find the maximum k elements, then we would use a min-heap). After deciding that the root should be replaced, extracting it from the heap and inserting the current element takes O(logk) time. Therefore, complexity of this solution is O(Nlogk). This is much better than the previous approach since it’s logarithmic on k instead of linear. And it is very close to optimum, which is O(N). We can’t do better than O(N) since we have to look at every element in the original array at least once. But can we do it in O(N)?

Without using any additional data structures, let’s form a min-heap from the original array in-place. This can be done in O(N) time using Floyd’s Algorithm (also known as fast heap construction), which forms the heap bottom-up. Remember that heaps are complete binary trees which can also be represented as arrays. When we start to construct a heap bottom-up we don’t perform any operations for the last N/2 elements (the leaves of the tree, level 0) because they don’t have any children, so they already satisfy the min-heap ordering. Then for N/4 elements at level 1 (parents of the leaves) we perform at most 1 bubble-down operation, because they may not satisfy min-heap property (the leaves may be smaller). Then for N/8 elements at level 2 we perform at most 2 bubble-down operations. And it goes on like this up to the root of the tree where we perform at most logN bubble-down operations. If we sum up, we get 1*N/4 + 2*N/8 + 3*N/16 + … + logN*N/2^(logN+1) <= 2N which is O(N). Since this is a geometric series, the sum converges. So, we can construct a heap from an array in linear time. Now we have a min-heap and we need its k minimum elements. We simply perform k delete-min operations on the heap, each of which takes O(logN) time, and get the minimum k elements. The complexity of this second part is O(klogN), and overall complexity of the algorithm is O(N) + O(klogN). Is this O(N)? Depending on the relative values of k and N, this algorithm can be O(N). Specifically, if k<O(N/logN), which is not a small number, then the algorithm is linear. Let’s see with an example. If N is 1 million, we can have k up to 50,000 and the algorithm would still be linear. Which is a reasonable value since we said k << N in the beginning.

So, overall we have a very efficient linear solution which uses no extra space. If we are not allowed to destroy the original array, we can copy its contents to a new array and perform this algorithm on that one, but this would require O(N) space, which may be prohibitive. Then, depending whether space or time is scarce resource, we can use either this approach or the previous solution, which uses O(k) space. The reason this is my favorite interview question is because it shows that use of correct data structures provide very elegant solutions. And heap is a very important data structure that is generally overlooked by candidates, while trying to excel binary search trees, stacks, queues etc. Also fast heap construction is a subtle algorithm that is not very widely known.

VN:F [1.9.22_1171]
My Favorite Interview Question, 10.0 out of 10 based on 29 ratings
This entry was posted in Programming Interview. Bookmark the permalink.
• mert

yorumla geliyorum sana.
if n is large compared to k ( k << n), and numbers are uniformly distributed, as long you compare each element with the smallest of the current set of top-k elements (stored as a list/array) the expected complexity will still be linear without using a heap.

• http://www.egeakpinar.net Ege

Arden guzel yazi olmus eline saglik :)

• Arden

Tesekkurler Ege! :)

• Li

nice one. It really demonstrate when to use a heap.

• Ugur

Nice question,
I’m not sure but maybe you can use the worst case N time selection algorithm which is referred as Linear general selection algorithm – Median of Medians algorithm here, you can choose the kth element in linear time and take the elements in the left set of course this will add O(N) space complexity if you don’t want to destroy the data but you have it with the heap already :)

• Arden

Good idea, but we want to return all k minimum elements, instead of just the kth minimum element. So we can perform the selection algorithm k times for i=[1, k]. As a result, the complexity would be O(Nk).

• Ugur

I mean in the progress of the algorithm it recurse with the left and the right sets which represents the elements that are on the left of the selected and the right of the selected. So we can return the “left set” instead of returning the kth element and that would be O(N) I guess.

• Arden

Ok, now I get it. That’s a great idea! Sorry for the confusion. Yes the time complexity would be O(N) and space complexity would be O(logN) which is the stack space of recursion. I’ll add this solution to the post by mentioning you.

• Ugur

Sorry I forgot the link after writing here:) ==> http://en.wikipedia.org/wiki/Selection_algorithm

• Sumesh

well said, excellent explanation …thanks a lot.

• Saurabh Verma

the solution can be optimized to O(n)+O(k*logK) instead of O(n+k*logn) but the cost is increase in space complexity O(k). After building original heap in O(n) time, you can create a new heap of size K in following way.
1) Initialize the root of new heap with the root node of the original heap.
2) Extract the root of new heap and add children nodes of original heap and increase count to +1.
3) Follow up this procedure until count is greater than or equal to k.

All the extracted nodes are minimum k element not necessarily in same order.

• Pingback: LeetCode该怎么刷？ | codingforthought()

• 9ddang

Hi, thanks for this article. Very informative. “We simply perform k delete-min operations on the heap, each of which takes O(logN) time” ==> Isn’t it O(logK) time? Therefore, complexity of this second part is O(Nlogk), instead of O(klogN)?