Programming Interview Questions 13: Median of Integer Stream

Given a stream of unsorted integers, find the median element in sorted order at any given time. So, we will be receiving a continuous stream of numbers in some random order and we don’t know the stream length in advance. Write a function that finds the median of the already received numbers efficiently at any time. We will be asked to find the median multiple times. Just to recall, median is the middle element in an odd length sorted array, and in the even case it’s the average of the middle elements.

This is a data structure question. We will insert the received numbers into such a data structure that we’ll be able to find the median very efficiently. Let’s analyse the possible options.

We can insert the integers to an unsorted array, so we’ll just append the numbers to the array one by one as we receive. Insertion complexity is O(1) but finding the median will take O(N) time, if we use the Median of Medians algorithm that I described in my previous post. However, our goal is to find the median most efficiently, we don’t care that much about insertion performance. But this algorithm does the exact opposite, so unsorted array is not a feasible solution.

What about using a sorted array? We can find the position to insert the received number in O(logN) time using binary search. And at any time if we’re asked for the median we can just return the middle element if the array length is odd, or the average of middle elements if the length is even. This can be done in O(1) time, which is exactly what we’re looking for. But there’s a major drawback of using a sorted array. To keep the array sorted after inserting an element, we may need to shift the elements to the right, which will take O(N) time. So, even if finding the position to insert the number takes O(logN) time, the overall insertion complexity is O(N) due to shifting. But finding the median is still extremely efficient, constant time. However, linear time insertion is pretty inefficient and we would prefer a better performance.

Let’s try linked lists. First unsorted linked list. Insertion is O(1), we can insert either to the head or tail but we suffer from the same problem of unsorted array. Finding the median is O(N). What if we keep the linked list sorted? We can find the median in O(1) time if we keep track of the middle elements. Insertion to a particular location is also O(1) in any linked list, so it seems great thus far. But, finding the right location to insert is not O(logN) as in sorted array, it’s instead O(N) because we can’t perform binary search in a linked list even if it is sorted. So, using a sorted linked list doesn’t worth the effort, insertion is O(N) and finding median is O(1), same as the sorted array. In sorted array insertion is linear due to shifting, here it’s linear because we can’t do binary search in a linked list. This is a very fundamental data structure knowledge that we should keep at the top of our heads all the time.

Using a stack or queue wouldn’t help as well. Insertion would be O(1) but finding the median would be O(N), very inefficient.

What if we use trees? Let’s use a binary search tree with additional information at each node, number of children on the left and right subtrees. We also keep the number of total nodes in the tree. Using this additional information we can find the median in O(logN) time, taking the appropriate branch in the tree based on number of children on the left and right of the current node. However, the insertion complexity is O(N) because a standard binary search tree can degenerate into a linked list if we happen to receive the numbers in sorted order.

So, let’s use a balanced binary search tree to avoid worst case behaviour of standard binary search trees. In a height balanced binary search tree (i.e. AVL tree) the balance factor is the difference between the heights of left and right subtrees. A node with balance factor 0, +1, or -1 is considered to be balanced. However, in our tree the balance factor won’t be height, it is the number of nodes in the left subtree minus the number of nodes in the right subtree. And only the nodes with balance factor of +1 or 0 are considered to be balanced. So, the number of nodes on the left subtree is either equal to or 1 more than the number of nodes on the right subtree, but not less. If we ensure this balance factor on every node in the tree, then the root of the tree is the median, if the number of elements is odd. In the even case, the median is the average of the root and its inorder successor, which is the leftmost descendent of its right subtree. So, complexity of insertion maintaining balance condition is O(logN) and find median operation is O(1) assuming we calculate the inorder successor of the root at every insertion if the number of nodes is even. Insertion and balancing is very similar to AVL trees. Instead of updating the heights, we update the number of nodes information.

Balanced binary search trees seem to be the most optimal solution, insertion is O(logN) and find median is O(1). Can we do better? We can achieve the same complexity with a simpler and more elegant solution. We will use 2 heaps simultaneously, a max-heap and a min-heap with 2 requirements. The first requirement is that the max-heap contains the smallest half of the numbers and min-heap contains the largest half. So, the numbers in max-heap are always less than or equal to the numbers in min-heap. Let’s call this the order requirement. The second requirement is that, the number of elements in max-heap is either equal to or 1 more than the number of elements in the min-heap. So, if we received 2N elements (even) up to now, max-heap and min-heap will both contain N elements. Otherwise, if we have received 2N+1 elements (odd), max-heap will contain N+1 and min-heap N. Let’s call this the size requirement.

The heaps are constructed considering the two requirements above. Then once we’re asked for the median, if the total number of received elements is odd, the median is the root of the max-heap. If it’s even, then the median is the average of the roots of the max-heap and min-heap. Let’s now analyse why this approach works, and how we construct the heaps.

We will have two methods, insert a new received number to the heaps and find median. The insertion procedure takes the two requirements into account, and it’s executed every time we receive a new element. We take two different approaches depending on whether the total number of elements is even or odd before insertion.

Let’s first analyze the size requirement during insertion. In both cases we insert the new element to the max-heap, but perform different actions afterwards. In the first case, if the total number of elements in the heaps is even before insertion, then there are N elements both in max-heap and min-heap because of the size requirement. After inserting the new element to the max-heap, it contains N+1 elements but this doesn’t violate the size requirement. Max-heap can contain 1 more element than min-heap. In the second case, if the number of elements is odd before insertion, then there are N+1 elements in max-heap and N in min-heap. After we insert the new element to the max-heap, it contains N+2 elements. But this violates the size constraint, max-heap can contain at most 1 more element than min-heap. So we pop an element from max-heap and push it to min-heap. The details will be described soon.

Now let’s analyse the order requirement. This requirement forces every element in the max-heap to be less than or equal to all the elements in min-heap. So the max-heap contains the smaller half of the numbers and the min-heap contains the larger half. Note that by design the root of the max-heap is the maximum of the lower half, and root of the min-heap is the minimum of the upper half. Keeping these in mind, we again take two different actions depending on whether the total number of elements is even or odd before insertion. In the even case we just inserted the new element to the max-heap. If the new element is less than all the elements in the min-heap, then the order constraint is satisfied and we’re done. We can perform this check by comparing the new element to the root of the min-heap in O(1) time since the root of the min-heap is the minimum. But if the new element is larger than the root of min-heap then we should exchange those elements to satisfy the order requirement. Note that in this case the root of the max-heap is the new element. So we pop the the root of min-heap and insert it to max-heap. Also pop the root of max-heap and insert it to min-heap. In second case, where the total number of elements before insertion is odd, we inserted the new element to max-heap, then we popped an element and pushed it to the min-heap. To satisfy the order constraint, we pop the maximum element of the max-heap, the root, and insert it to the min-heap. Insertion complexity is O(logN), which is the insertion complexity of a heap.

That is exactly how the insertion procedure works. We ensured that both size and order requirements are satisfied during insertion. Find median function works as follows. At any time we will be queried for the median element. If the total number of elements at that time is odd, then the median is the root of the max-heap. Let’s visualize this with an example. Assume that we have received 7 elements up to now, so the median is the 4th number in sorted order. Currently, max-heap contains 4 smallest elements and min-heap contains 3 largest because of the requirements described above. And since the root of the max-heap is the maximum of the smallest four elements, it’s the 4th element in sorted order, which is the median. Else if the total number of elements is even, then the median is the average of the roots of max-heap and min-heap. Let’s say we have 8 elements, so the median is the average of 4th and 5th elements in sorted order. Currently, both the max-heap and min-heap contain 4 numbers. Root of the max-heap is the maximum of the smallest numbers, which is 4th in sorted order. And root of the min-heap is the minimum of the largest numbers, which is 5th in sorted order. So, the median is the average of the roots. In both cases we can find the median in O(1) time because we only access the roots of the heaps, neither insertion nor removal is performed. Therefore, overall this solution provides O(1) find heap and O(logN) insert.

A code is worth a thousand words, here is the code of the 2-heaps solution. As you can see, it’s much less complicated than it’s described. We can use the heapq module in python, which provides an implementation of min-heap only. But we need a max-heap as well, so we can make a min-heap behave like a max-heap by multiplying the number to be inserted by -1 and then inserting. So, every time we insert or access an element from the max-heap, we multiply the value by -1 to get the original number:

class streamMedian:
    def __init__(self):
        self.minHeap, self.maxHeap = [], []
        self.N=0
 
    def insert(self, num):
        if self.N%2==0:
            heapq.heappush(self.maxHeap, -1*num)
            self.N+=1
            if len(self.minHeap)==0:
                return
            if -1*self.maxHeap[0]>self.minHeap[0]:
                toMin=-1*heapq.heappop(self.maxHeap)
                toMax=heapq.heappop(self.minHeap)
                heapq.heappush(self.maxHeap, -1*toMax)
                heapq.heappush(self.minHeap, toMin)
        else:
            toMin=-1*heapq.heappushpop(self.maxHeap, -1*num)
            heapq.heappush(self.minHeap, toMin)
            self.N+=1
 
    def getMedian(self):
        if self.N%2==0:
            return (-1*self.maxHeap[0]+self.minHeap[0])/2.0
        else:
            return -1*self.maxHeap[0]

We have a class streamMedian, and every time we receive an element, insert function is called. The median is returned using the getMedian function.

This is a great interview question that tests data structure fundamentals in a subtle way.

VN:F [1.9.22_1171]
Rating: 9.5/10 (67 votes cast)
Programming Interview Questions 13: Median of Integer Stream, 9.5 out of 10 based on 67 ratings
This entry was posted in Programming Interview. Bookmark the permalink.
  • serdark

    This is a question I got in one of the interviews. I think it is a nice one that allows brainstorming about different structures. Arden, this is great explanation, and even better, this is very concise code –you might want to have a wrapper for min/max transformation but obviously not needed here.

  • codebook

    Great explaination ….You simply roxx :))

  • Kowshik

    Very well written!

  • King Panther

    Awesome article. Thanks.
    There is a typo in this paragraph “That is exactly how the insertion procedure works……………”
    9th line
    “Else if the total number of elements is odd………..”
    I think it should be “even”

    Thanks
    King Panther

    • Arden

      Yes you’re right, updated the article. Thanks a lot for noticing..

  • ANONYMOUS

    Arden, you are a great tech writer. concise, clean and well explained.

    • Arden

      Thanks a lot for the compliment, it really motivates me.

  • Yang

    Just awesome!

  • Elite

    Very eloquent explanation, thanks a lot! In a long text like this, some illustrations about the heaps would be helpful.
    The asymmetrical handling of the heaps looks a bit artificial. For example, consider the case when max heap has N+1 elements and min heap has N elements. If a new element is larger or equal to the root of the min heap, then we can simply insert that element to the min heap. Now both heaps have the same number of elements and there is no need to pop the root and push it to the other heap (2 log N time is saved).
    Similarly, if a new element is between the roots of the two heaps and the sizes of the heaps are not equal, then we can insert the element to the smaller heap. Again the additional push and pop operations are avoided.

  • http://ranga Ranga

    How about a small extension to your problem? You are maintaining 2 heaps right. Lets say now you are asked to remove certain numbers and then find the median.
    So you get an input stream of numbers and every number comes with a flag to indicate whether to add to the list of numbers or remove from the list of numbers and accordingly update the median
    Got the question? How would you solve this?

  • Glenn C

    Got asked this exact question today in a phone interview, disguised as a stream of query commands w/ their execution times.

    It was a slotted 30 minute interview, with 10 minutes spent introducing ourselves and the company.

    So in 20 minutes, someone is supposed to be able to come up with this solution and then explain it? Keep in mind we’re not supposed to have known the answer beforehand. Kinda rediculous.

    • Anonymous Bastard

      Which company?

    • Holden

      What you were asked? Order statistics? How the problem you said is related to this question?

  • http://vedri.ca Vedrana

    Hi Arden!
    This is a great explanation! Here’s also a Java implementation :)


    import java.util.Comparator;
    import java.util.PriorityQueue;
    import java.util.Queue;

    public class MedianOfIntegerStream {

    public Queue minHeap;
    public Queue maxHeap;
    public int numOfElements;

    public MedianOfIntegerStream() {
    minHeap = new PriorityQueue();
    maxHeap = new PriorityQueue(10, new MaxHeapComparator());
    numOfElements = 0;
    }

    public void addNumberToStream(Integer num) {
    maxHeap.add(num);
    if (numOfElements%2 == 0) {
    if (minHeap.isEmpty()) {
    numOfElements++;
    return;
    }
    else if (maxHeap.peek() > minHeap.peek()) {
    Integer maxHeapRoot = maxHeap.poll();
    Integer minHeapRoot = minHeap.poll();
    maxHeap.add(minHeapRoot);
    minHeap.add(maxHeapRoot);
    }
    } else {
    minHeap.add(maxHeap.poll());
    }
    numOfElements++;
    }

    public Double getMedian() {
    if (numOfElements%2 != 0)
    return new Double(maxHeap.peek());
    else
    return (maxHeap.peek() + minHeap.peek()) / 2.0;
    }

    private class MaxHeapComparator implements Comparator {
    @Override
    public int compare(Integer o1, Integer o2) {
    return o2 - o1;
    }
    }

    public static void main(String[] args) {
    MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();

    streamMedian.addNumberToStream(1);
    System.out.println(streamMedian.getMedian()); // should be 1

    streamMedian.addNumberToStream(5);
    streamMedian.addNumberToStream(10);
    streamMedian.addNumberToStream(12);
    streamMedian.addNumberToStream(2);
    System.out.println(streamMedian.getMedian()); // should be 5

    streamMedian.addNumberToStream(3);
    streamMedian.addNumberToStream(8);
    streamMedian.addNumberToStream(9);
    System.out.println(streamMedian.getMedian()); // should be 6.5
    }
    }

    • http://vedri.ca Vedrana

      … and now for some actually readable code: https://gist.github.com/3675434
      :)

      • Priyank

        Hi Vedrana, your code has a problem, try it for input,
        -50
        50
        10
        The answer should be 10, but your output is -50.

        • Berkay

          her implementation is correct, and works fine, and the complexity is logarithmic, O(logn).
          You should define the maxHeap as
          maxHeap = new PriorityQueue(10, new MaxHeapComparator());

          • James Fraser

            O(logn) ? isn’t it O(nlogn)? as we need to read the stream, and due to heap insertions/deletions. What do you say?

          • Berkay

            the extract/insert into a heap should be O(log N). I believe everything else involved should be constant complexity.

            you can read the post from here http://stackoverflow.com/questions/7842347/find-median-in-olog-n

  • Raghavendra

    Hi Arden, thank you very much for the excellent article. I usually do not prefer reading long articles, but this one is just simply awesome…

  • Coder

    Amazing explanation. Thank you!

  • http://www.advancedcoiltubing.com/prada-canapa-stampata-frame-bag-brings-out-memory-of-childhood/ Issac Maez

    There is no vagueness in it that if you can manage to buy luxury products then sandals or pumps will be surly going to be into your shopping list
    Issac Maez http://www.advancedcoiltubing.com/prada-canapa-stampata-frame-bag-brings-out-memory-of-childhood/

  • sachin gupta

    In the balanced bst method,how to balance in terms of size?

  • http://www.facebook.com/chander2 Chander Ramesh

    Wonderful!

  • Akash K.T

    Gr8 article

  • ue

    i had a question regarding the size requirement: “In the second case, if the number of elements is odd before insertion, then there are N+1 elements in max-heap and N in min-heap. After we insert the new element to the max-heap, it contains N+2 elements. But this violates the size constraint, max-heap can contain at most 1 more element than min-heap. So we pop an element from max-heap and push it to min-heap. ”

    Why do we need to insert element first for the odd case? since we already know that the number of elements are odd and that the N+1th element is in the maxheap, why not directly add the new element in the min-heap?

  • Berkay

    to better understand the problem, you can refer to http://www.youtube.com/watch?v=NGG-C6GLi7k 13:04

    • Holden

      The video link is broken …

  • Sat

    Excellent explanation. Just loved the post !

  • Vigya

    Hi Arden,

    Great Article. There’s just 1 doubt I wanted to clarify,
    In BST approach
    You say, left subtree should have size equal to or 1 greater than right subtree. Since you are considering the inorder successor for even number of nodes, it should be the other way round, i.e. right subtree should have size equal to or 1+leftSubtreeSize.
    Otherwise use inorder predecessor.


    So, the number of nodes on the left subtree is either equal to or 1 more than the number of nodes on the right subtree, but not less. If we ensure this balance factor on every node in the tree, then the root of the tree is the median, if the number of elements is odd. In the even case, the median is the average of the root and its inorder successor, which is the leftmost descendent of its right subtree.

  • PhilM

    Great explanation. Don’t have any problems with it. But I do take exception to the claim that this is a great interview question. Allow me explain.

    Answers to hard/clever questions are difficult to come by in the setting of an interview. Unless you have studied a particular algorithm or data structure, you really have near zero chance of coming up with something that is not an obvious solution. This may be difficult for some to believe but the insights required for these kinds of problems do not occur to most people. Please search the net for Jon Bentley’s discussion of Kadane’s algorithm and see for yourself that an O(n) solution was ruled out by those seriously smart Bell Labs people only to have Kadane come up with it in a matter of minutes. That doesn’t happen always and I don’t know if Kadane consistently produces miraculous algorithms.

    More seriously, this line of interviewing will bear no better result than evaluating one’s academic performance, job experience and administering simple tests which demonstrate skills that are required for the job.

  • http://algorithm3.wordpress.com/ fox_3

    Great one. !

  • Holden

    How we can get median in O(n) in an unsorted linked list?
    I think we need to sort it first, in O(nlogn), then find median in O(n), am I right?

  • atul_gavle

    while finding the median using avl tree……if we r maintaining the more number of nodes int the left subtree (i.e maintaining the balance factor as 1) always,then in case of even number of nodes,the median should be avg of root and its inorder predecessor ,i think .plz correct me if wrong . thank you :-)

  • Anonymous

    This solution assumes that the stream fits into system’s memory. I was curious on a solution where the stream does not fit in the system’s memory. Let’s say, there is a file with data for 100GB and there is only 8GB memory in the system.