Arden Dertat » Programming Interview Information Retrieval and Machine Learning Mon, 11 Mar 2013 06:32:35 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.1 Programming Interview Questions 28: Longest Compound Word /2012/06/15/programming-interview-questions-28-longest-compound-word/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-28-longest-compound-word /2012/06/15/programming-interview-questions-28-longest-compound-word/#comments Fri, 15 Jun 2012 06:36:44 +0000 Arden /?p=1130 Continue reading ]]> This is my last post of the long programming interview questions series. I’m starting my first full-time job next week at Microsoft Bing Relevance team. Wish me luck..

Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: ['cat', 'cats', 'catsdogcats', 'catxdogcatsrat', 'dog', 'dogcatsdog', 'hippopotamuses', 'rat', 'ratcat', 'ratcatdog', 'ratcatdogcat']. Then the longest compound word is ‘ratcatdogcat’ with 12 letters. Note that the longest individual words are ‘catxdogcatsrat’ and ‘hippopotamuses’ with 14 letters, but they’re not fully constructed by other words. Former one has an extra ‘x’ letter, and latter is an individual word by itself not a compound word.

We will use the  data structure, also known as a prefix tree. Tries are space and time efficient structures for text storage and search.  They let words to share prefixes, and prefixes are exactly what we’ll be dealing with in this question. There’s a sample trie implementation in a previous question find word positions in text. The one we’re going to use is slightly modified, with a different node definition and two additional Trie class functions:

class Node:
    def __init__(self, letter=None, isTerminal=False):
        self.letter=letter
        self.children={}
        self.isTerminal=isTerminal
def insert(self, word):
    current=self.root
    for letter in word:
        if letter not in current.children:
            current.children[letter]=Node(letter)
        current=current.children[letter]
    current.isTerminal=True
 
def getAllPrefixesOfWord(self, word):
    prefix=''
    prefixes=[]
    current=self.root
    for letter in word:
        if letter not in current.children:
            return prefixes
        current=current.children[letter]
        prefix+=letter
        if current.isTerminal:
            prefixes.append(prefix)
    return prefixes

The isTerminal field in a Node class means that the word constructed by letters from the root to the current node is a valid word. For example the word cat is formed by 3 nodes, one for each letter. And the last letter ‘t’ has isTerminal value True.

The insert function of the trie just adds the given word to the trie. The more important getAllPrefixesOfWord function takes a word as input, and returns all the valid prefixes of that word. Where valid means that prefix word appears in the given input list. The list being sorted helps us here. While scanning the list from the beginning, all the prefixes of the current word have already been seen and added to the trie. So finding all the prefixes of a word can be done easily by following a single path down from the root with nodes being the letters of the word, and returning the prefix words corresponding to nodes with true isTerminal value

.

The algorithm works as follows, we scan the input list from the beginning and insert each word into the trie. Right after inserting a word, we check whether it has any prefixes. If yes, then it’s a candidate for the longest compound word. We append a pair of current word and its suffix (word-prefix) into a queue. The reason is that the current word is a valid construct only if the suffix is also a valid word or a compound word. So we build the trie and queue while scanning the array.

Let’s illustrate the process with an example, the first word is cat and we add it to the trie. Since it doesn’t have a prefix, we continue. Second word is cats, we add it to the trie and check whether it has a prefix, and yes it does, the word cat. So we append the pair <’cats’, ‘s’> to the queue, which is the current word and the suffix. The third word is catsdogcats, we again insert it to the trie and see that it has 2 prefixes, cat and cats. So we add 2 pairs <’catsdogcats’, ‘sdogcats’> and <’catsdogcats’, ‘dogcats’> where the former suffix correspond to the prefix cat and the latter to cats. We continue like this by adding <’catxdogcatsrat’, ‘xdogcatsrat’> to the queue and so on. And here’s the trie formed by adding example the words in the problem definition:

After building the trie and the queue, then we start processing the queue by popping a pair from the beginning. As explained above, the pair contains the original word and the suffix we want to validate. We check whether the suffix is a valid or compound word. If it’s a valid word and the original word is the longest up to now, we store the result. Otherwise we discard the pair. The suffix may be a compound word itself, so we check if the it has any prefixes. If it does, then we apply the above procedure by adding the original word and the new suffix to the queue. If the suffix of the original popped pair is neither a valid word nor has a prefix, we simply discard that pair.

An example will clarify the procedure, let’s check the pair <’catsdogcats’, ‘dogcats’> popped from the queue. Dogcats is not a valid word, so we check whether it has a prefix. Dog is a prefix, and cats is the new suffix. So we add the pair <’catsdogcats’, ‘cats’> to the queue. Next time we pop this pair we’ll see that cats is a valid word and finish processing the word catsdogcats. As you can see, the suffix will get shorter and shorter at each iteration, and at some point the pair will either be discarded or stored as the longest word. And as the pairs are being discarded, the length of the queue will decrease and the algorithm will gradually come to an end. Here’s the complete algorithm:

def longestWord(words):
    #Add words to the trie, and pairs to the queue
    trie=Trie()
    queue=collections.deque()
    for word in words:
        prefixes=trie.getAllPrefixesOfWord(word)
        for prefix in prefixes:
            queue.append( (word, word[len(prefix):]) )
        trie.insert(word)
 
    #Process the queue
    longestWord=''
    maxLength=0
    while queue:
        word, suffix = queue.popleft()
        if suffix in trie and len(word)&gt;maxLength:
            longestWord=word
            maxLength=len(word)
        else:
            prefixes=trie.getAllPrefixesOfWord(suffix)
            for prefix in prefixes:
                queue.append( (word, suffix[len(prefix):]) )
 
    return longestWord

The complexity of this algorithm is O(kN) where N is the number of words in the input list, and k the maximum number of words in a compound word. The number k may vary from one list to another, but it’ll generally be a constant number like 5 or 10. So, the algorithm is linear in number of words in the list, which is an optimal solution to the problem.

And here’s the impementation of the trie for completeness:

class Trie:
    def __init__(self):
        self.root=Node('')
 
    def __repr__(self):
        self.output([self.root])
        return ''
 
    def output(self, currentPath, indent=''):
        #Depth First Search
        currentNode=currentPath[-1]
        if currentNode.isTerminal:
            word=''.join([node.letter for node in currentPath])
            print indent+word
            indent+='  '
        for letter, node in sorted(currentNode.children.items()):
            self.output(currentPath[:]+[node], indent)
 
    def insert(self, word):
        current=self.root
        for letter in word:
            if letter not in current.children:
                current.children[letter]=Node(letter)
            current=current.children[letter]
        current.isTerminal=True
 
    def __contains__(self, word):
        current=self.root
        for letter in word:
            if letter not in current.children:
                return False
            current=current.children[letter]
        return current.isTerminal
 
    def getAllPrefixesOfWord(self, word):
        prefix=''
        prefixes=[]
        current=self.root
        for letter in word:
            if letter not in current.children:
                return prefixes
            current=current.children[letter]
            prefix+=letter
            if current.isTerminal:
                prefixes.append(prefix)
        return prefixes

]]> /2012/06/15/programming-interview-questions-28-longest-compound-word/feed/ 12 Programming Interview Questions 27: Squareroot of a Number /2012/01/26/programming-interview-questions-27-squareroot-of-a-number/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-27-squareroot-of-a-number /2012/01/26/programming-interview-questions-27-squareroot-of-a-number/#comments Thu, 26 Jan 2012 12:10:36 +0000 Arden /?p=920 Continue reading ]]> Find the squareroot of a given number rounded down to the nearest integer, without using the sqrt function. For example, squareroot of a number between [9, 15] should return 3, and [16, 24] should be 4.

The squareroot of a (non-negative) number N always lies between 0 and N/2. The straightforward way to solve this problem would be to check every number k between 0 and N/2, until the square of k becomes greater than or rqual to N. If k^2 becomes equal to N, then we return k. Otherwise, we return k-1 because we’re rounding down. Here’s the code:

def sqrt1(num):
    if num<0:
        raise ValueError
    if num==1:
        return 1
    for k in range(1+(num/2)):
        if k**2==num:
            return k
        elif k**2>num:
            return k-1
    return k

The complexity of this approach is O(N), because we have to check N/2 numbers in the worst case. This linear algorithm is pretty inefficient, we can use some sort of binary search to speed it up. We know that the result is between 0 and N/2, so we can first try N/4 to see whether its square is less than, greater than, or equal to N. If it’s equal then we simply return that value. If it’s less, then we continue our search between N/4 and N/2. Otherwise if it’s greater, then we search between 0 and N/4. In both cases we reduce the potential range by half and continue, this is the logic of binary search. We’re not performing regular binary search though, it’s modified. We want to ensure that we stop at a number k, where k^2<=N but (k+1)^2>N. The code will clarify any doubts:

def sqrt2(num):
    if num<0:
        raise ValueError
    if num==1:
        return 1
    low=0
    high=1+(num/2)
    while low+1<high:
        mid=low+(high-low)/2
        square=mid**2
        if square==num:
            return mid
        elif square<num:
            low=mid
        else:
            high=mid
    return low

One difference from regular binary search is the condition of the while loop, it’s low+1<high instead of low<high. Also we have low=mid instead of low=mid+1, and high=mid instead of high=mid-1. These are the modifications we make to standard binary search. The complexity is still the same though, it’s logarithmic O(logN). Which is much better than the naive linear solution.

There’s also a constant time O(1) solution which involves a clever math trick. Here it is:

\sqrt{N} = N^{0.5} = 2^{log_2N^{0.5}} = 2^{0.5log_2{N}}

This solution exploits the property that if we take the exponent of the log of a number, the result  doesn’t change, it’s still the number itself. So we can first calculate the log of a number, multiply with 0.5, take the exponent, and finally take the floor of that value to round it down. This way we can avoid using the sqrt function by using the log function. Logarithm of a number rounded down to the nearest integer can be calculated in constant time, by looking at the position of the leftmost 1 in the binary representation of the number. For example, the number 6 in binary is 110, and the leftmost 1 is at position 2 (starting from right counting from 0). So the logarithm of 6 rounded down is 2. This solution doesn’t always give the same result as above algorithms though, because of the rounding effects. And depending on the interviewer’s perspective this approach can be regarded as either very elegant and clever, or tricky and invalid. But in any case it definitely worths mentioning, ultimately it proves that you’re aware of math shortcuts. Which is always a desired talent in every job candidate.

]]>
/2012/01/26/programming-interview-questions-27-squareroot-of-a-number/feed/ 11
Programming Interview Questions 26: Trim Binary Search Tree /2012/01/17/programming-interview-questions-26-trim-binary-search-tree/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-26-trim-binary-search-tree /2012/01/17/programming-interview-questions-26-trim-binary-search-tree/#comments Tue, 17 Jan 2012 10:14:28 +0000 Arden /?p=918 Continue reading ]]> Given the root of a binary search tree and 2 numbers min and max, trim the tree such that all the numbers in the new tree are between min and max (inclusive). The resulting tree should still be a valid binary search tree. So, if we get this tree as input:

and we’re given min value as 5 and max value as 13, then the resulting binary search tree should be:

We should remove all the nodes whose value is not between min and max. We can do this by performing a post-order traversal of the tree. We first process the left children, then right children, and finally the node itself. So we form the new tree bottom up, starting from the leaves towards the root. As a result while processing the node itself, both its left and right subtrees are valid trimmed binary search trees (may be NULL as well).

At each node we’ll return a reference based on its value, which will then be assigned to its parent’s left or right child pointer, depending on whether the current node is left or right child of the parent. If current node’s value is between min and max (min<=node<=max) then there’s no action need to be taken, so we return the reference to the node itself. If current node’s value is less than min, then we return the reference to its right subtree, and discard the left subtree. Because if a node’s value is less than min, then its left children are definitely less than min since this is a binary search tree. But its right children may or may not be less than min we can’t be sure, so we return the reference to it. Since we’re performing bottom-up post-order traversal, its right subtree is already a trimmed valid binary search tree (possibly NULL), and left subtree is definitely NULL because those nodes were surely less than min and they were eliminated during the post-order traversal. Remember that in post-order traversal we first process all the children of a node, and then finally the node itself.

Similar situation occurs when node’s value is greater than max, we now return the reference to its left subtree. Because if a node’s value is greater than max, then its right children are definitely greater than max. But its left children may or may not be greater than max. So we discard the right subtree and return the reference to the already valid left subtree. The code is easier to understand:

def trimBST(tree, minVal, maxVal):
    if not tree:
        return
    tree.left=trimBST(tree.left, minVal, maxVal)
    tree.right=trimBST(tree.right, minVal, maxVal)
    if minVal<=tree.val<=maxVal:
        return tree
    if tree.val<minVal:
        return tree.right
    if tree.val>maxVal:
        return tree.left

The complexity of this algorithm is O(N), where N is the number of nodes in the tree. Because we basically perform a post-order traversal of the tree, visiting each and every node one. This is optimal because we should visit every node at least once. This is a very elegant question that demonstrates the effectiveness of recursion in trees.

]]>
/2012/01/17/programming-interview-questions-26-trim-binary-search-tree/feed/ 8
Programming Interview Questions /2012/01/09/programming-interview-questions/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions /2012/01/09/programming-interview-questions/#comments Mon, 09 Jan 2012 20:23:39 +0000 Arden /?p=1048 Continue reading ]]> The complete list of all my programming interview question articles with pointers to original posts. There are 28 questions in total, and since 28 is a  (as Donald Knuth also ) I decided that’s a good place to stop.

1. Array Pair Sum
Given an integer array, output all pairs that sum up to a specific value k.

2. Matrix Region Sum
Given a matrix of integers and coordinates of a rectangular region within the matrix, find the sum of numbers falling inside the rectangle. Our program will be called multiple times with different rectangular regions from the same matrix.

3. Largest Continuous Sum
Given an array of integers (positive and negative) find the largest continuous sum.

4. Find Missing Element
There is an array of non-negative integers. A second array is formed by shuffling the elements of the first array and deleting a random element. Given these two arrays, find which element is missing in the second array.

5. Linked List Remove Nodes
Given a linkedlist of integers and an integer value, delete every node of the linkedlist containing that value.

6. Combine Two Strings
We are given 3 strings: str1, str2, and str3. Str3 is said to be a shuffle of str1 and str2 if it can be formed by interleaving the characters of str1 and str2 in a way that maintains the left to right ordering of the characters from each string. For example, given str1=”abc” and str2=”def”, str3=”dabecf” is a valid shuffle since it preserves the character ordering of the two strings. So, given these 3 strings write a function that detects whether str3 is a valid shuffle of str1 and str2.

7. Binary Search Tree Check
Given a binary tree, check whether it’s a binary search tree or not.

8. Transform Word
Given a source word, target word and an English dictionary, transform the source word to target by changing/adding/removing 1 character at a time, while all intermediate words being valid English words. Return the transformation chain which has the smallest number of intermediate words.

9. Convert Array
Given an array [a1, a2, ..., aN, b1, b2, ..., bN, c1, c2, ..., cN]  convert it to [a1, b1, c1, a2, b2, c2, ..., aN, bN, cN] in-place using constant extra space

10. Kth Largest Element in Array
Given an array of integers find the kth element in the sorted order (not the kth distinct element). So, if the array is [3, 1, 2, 1, 4] and k is 3 then the result is 2, because it’s the 3rd element in sorted order (but the 3rd distinct element is 3).

11. All Permutations of String
Generate all permutations of a given string.

12. Reverse Words in a String
Given an input string, reverse all the words. To clarify, input: “Interviews are awesome!” output: “awesome! are Interviews”. Consider all consecutive non-whitespace characters as individual words. If there are multiple spaces between words reduce them to a single white space. Also remove all leading and trailing whitespaces. So, the output for ”   CS degree”, “CS    degree”, “CS degree   “, or ”   CS   degree   ” are all the same: “degree CS”.

13. Median of Integer Stream
Given a stream of unsorted integers, find the median element in sorted order at any given time. So, we will be receiving a continuous stream of numbers in some random order and we don’t know the stream length in advance. Write a function that finds the median of the already received numbers efficiently at any time. We will be asked to find the median multiple times. Just to recall, median is the middle element in an odd length sorted array, and in the even case it’s the average of the middle elements.

14. Check Balanced Parentheses
Given a string of opening and closing parentheses, check whether it’s balanced. We have 3 types of parentheses: round brackets: (), square brackets: [], and curly brackets: {}. Assume that the string doesn’t contain any other character than these, no spaces words or numbers. Just to remind, balanced parentheses require every opening parenthesis to be closed in the reverse order opened. For example ‘([])’ is balanced but ‘([)]‘ is not.

 15. First Non Repeated Character in String
Find the first non-repeated (unique) character in a given string.

16. Anagram Strings
Given two strings, check if they’re anagrams or not. Two strings are anagrams if they are written using the same exact letters, ignoring space, punctuation and capitalization. Each letter should have the same count in both strings. For example, ‘Eleven plus two’ and ‘Twelve plus one’ are meaningful anagrams of each other.

17. Search Unknown Length Array
Given a sorted array of unknown length and a number to search for, return the index of the number in the array. Accessing an element out of bounds throws exception. If the number occurs multiple times, return the index of any occurrence. If it isn’t present, return -1.

18. Find Even Occurring Element
Given an integer array, one element occurs even number of times and all others have odd occurrences. Find the element with even occurrences.

19. Find Next Palindrome Number
Given a number, find the next smallest palindrome larger than the number. For example if the number is 125, next smallest palindrome is 131.

20. Tree Level Order Print
Given a binary tree of integers, print it in level order. The output will contain space between the numbers in the same level, and new line between different levels.

21. Tree Reverse Level Order Print
This is very similar to the previous post level order print. We again print the tree in level order, but now starting from bottom level to the root.

22. Find Odd Occurring Element
Given an integer array, one element occurs odd number of times and all others have even occurrences. Find the element with odd occurrences.

23. Find Word Positions in Text
Given a text file and a word, find the positions that the word occurs in the file. We’ll be asked to find the positions of many words in the same file.

24. Find Next Higher Number With Same Digits
Given a number, find the next higher number using only the digits in the given number. For example if the given number is 1234, next higher number with same digits is 1243.

25. Remove Duplicate Characters in String
Remove duplicate characters in a given string keeping only the first occurrences. For example, if the input is ‘tree traversal’ the output will be ‘tre avsl’.

26. Trim Binary Search Tree
Given the root of a binary search tree and 2 numbers min and max, trim the tree such that all the numbers in the new tree are between min and max (inclusive). The resulting tree should still be a valid binary search tree.

27. Squareroot of a Number
Find the squareroot of a given number rounded down to the nearest integer, without using the sqrt function. For example, squareroot of a number between [9, 15] should return 3, and [16, 24] should be 4.

28. Longest Compound Word
Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: ['cat', 'cats', 'catsdogcats', 'catxdogcatsrat', 'dog', 'dogcatsdog', 'hippopotamuses', 'rat', 'ratcat', 'ratcatdog', 'ratcatdogcat']. Then the longest compound word is ‘ratcatdogcat’ with 12 letters. Note that the longest individual words are ‘catxdogcatsrat’ and ‘hippopotamuses’ with 14 letters, but they’re not fully constructed by other words. Former one has an extra ‘x’ letter, and latter is an individual word by itself not a compound word.

]]>
/2012/01/09/programming-interview-questions/feed/ 17
Programming Interview Questions 25: Remove Duplicate Characters in String /2012/01/06/programming-interview-questions-25-remove-duplicate-characters-in-string/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-25-remove-duplicate-characters-in-string /2012/01/06/programming-interview-questions-25-remove-duplicate-characters-in-string/#comments Fri, 06 Jan 2012 09:36:33 +0000 Arden /?p=916 Continue reading ]]> Remove duplicate characters in a given string keeping only the first occurrences. For example, if the input is ‘tree traversal’ the output will be ‘tre avsl’.

We need a data structure to keep track of the characters we have seen so far, which can perform efficient find operation. If the input is guaranteed to be in standard ASCII form, we can just create a boolean array of size 128 and perform lookups by accessing the index of the character’s ASCII value in constant time. But if the string is Unicode then we would need a much larger array of size more than 100K, which will be a waste since most of it would generally be unused.

Set data structure perfectly suits our purpose. It stores keys and provides constant time search for key existence. So, we’ll loop over the characters of the string, and at each iteration we’ll check whether we have seen the current character before by searching the set. If it’s in the set then it means we’ve seen it before, so we ignore it. Otherwise, we include it in the result and add it to the set to keep track for future reference. The code is easier to understand:

def removeDuplicates(string):
    result=[]
    seen=set()
    for char in string:
        if char not in seen:
            seen.add(char)
            result.append(char)
    return ''.join(result)

The time complexity of the algorithm is O(N) where N is the number of characters in the input string, because set supports O(1) insert and find. This is an optimal solution to one of the most common string interview questions.

Note: Complexity of insert and find for set is language and implementation dependent. Average case is mostly constant, but worst case can be logarithmic or even linear. But in an interview setting I think it’s safe to assume constant time insert and find for set.

]]>
/2012/01/06/programming-interview-questions-25-remove-duplicate-characters-in-string/feed/ 7
Programming Interview Questions 24: Find Next Higher Number With Same Digits /2012/01/02/programming-interview-questions-24-find-next-higher-number-with-same-digits/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-24-find-next-higher-number-with-same-digits /2012/01/02/programming-interview-questions-24-find-next-higher-number-with-same-digits/#comments Mon, 02 Jan 2012 11:41:37 +0000 Arden /?p=914 Continue reading ]]> Given a number, find the next higher number using only the digits in the given number. For example if the given number is 1234, next higher number with same digits is 1243.

The naive approach is to generate the numbers with all digit permutations and sort them. Then find the given number in the sorted sequence and return the next number in sorted order as a result. The complexity of this approach is pretty high though, because of the permutation step involved. A given number N has logN+1 digits, so there are O(logN!) permutations. After generating the permutations, sorting them will require O(logN!loglogN!) operations. We can simplify this further, remember that O(logN!) is equivalent to O(NlogN). And O(loglogN!) is O(logN). So, the complexity is O(N(logN)^2).

Let’s visualize a better solution using an example, the given number is 12543 and the resulting next higher number should be 13245. We scan the digits of the given number starting from the tenths digit (which is 4 in our case) going towards left. At each iteration we check the right digit of the current digit we’re at, and if the value of right is greater than current we stop, otherwise we continue to left. So we start with current digit 4, right digit is 3, and 4>=3 so we continue. Now current digit is 5, right digit is 4, and 5>= 4, continue. Now current is 2, right is 5, but it’s not 2>=5, so we stop. The digit 2 is our pivot digit. From the digits to the right of 2, we find the smallest digit higher than 2, which is 3. This part is important, we should find the smallest higher digit for the resulting number to be precisely the next higher than original number. We swap this digit and the pivot digit, so the number becomes 13542. Pivot digit is now 3. We sort all the digits to the right of the pivot digit in increasing order, resulting in 13245. This is it, here’s the code:

def nextHigher(num):
    strNum=str(num)
    length=len(strNum)
    for i in range(length-2, -1, -1):
        current=strNum[i]
        right=strNum[i+1]
        if current<right:
            temp=sorted(strNum[i:])
            next=temp[temp.index(current)+1]
            temp.remove(next)
            temp=''.join(temp)
            return int(strNum[:i]+next+temp)
    return num

Note that if the digits of the given number is monotonically increasing from right to left, like 43221 then we won’t perform any operations, which is what we want because this is the highest number obtainable from these digits. There’s no higher number, so we return the given number itself. The same case occurs when the number has only a single digit, like 7. We can’t form a different number since there’s only a single digit.

The complexity of this algorithm also depends on the number of digits, and the sorting part dominates. A given number N has logN+1 digits and in the worst case we’ll have to sort logN digits. Which happens when all digits are increasing from right to left except the leftmost digit, for example 1987. For sorting we don’t have to use comparison based algorithms such as quicksort, mergesort, or heapsort which are O(KlogK), where K is the number of elements to sort. Since we know that digits are always between 0 and 9, we can use counting sort, radix sort, or bucket sort which can work in linear time O(K). So the overall complexity of sorting logN digits will stay linear resulting in overall complexity O(logN). Which is optimal since we have to check each digit at least once.

]]>
/2012/01/02/programming-interview-questions-24-find-next-higher-number-with-same-digits/feed/ 19
Programming Interview Questions 23: Find Word Positions in Text /2011/12/20/programming-interview-questions-23-find-word-positions-in-text/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-23-find-word-positions-in-text /2011/12/20/programming-interview-questions-23-find-word-positions-in-text/#comments Tue, 20 Dec 2011 17:46:50 +0000 Arden /?p=911 Continue reading ]]> Given a text file and a word, find the positions that the word occurs in the file. We’ll be asked to find the positions of many words in the same file.

Since we’ll have to answer multiple queries, precomputation would be useful. We’ll build a data structure that stores the positions of all the words in the text file. This is known as inverted index in Information Retrieval. It’s basically a mapping between the words in the file and their positions. You can read more about it in my how to implement a search engine post, where I describe how to actually implement a working search engine with real code.

We can use hashtable as our inverted index. The key is a word in the file, and the value is the array of positions that word occurs. So, we’ll get all the words in the file and populate the hashtable with their positions. The words may contain capital letters, but we don’t want separate entries for apple, and Apple in our inverted index. So, we’ll write our  own parsing function which converts all the words to lowercase, eliminates non-alphanumeric characters, and then splits on whitespace to get all the actual words in the file. We can do it easily using regular expressions:

def getWords(text):
    return re.sub(r'[^a-z0-9]',' ',text.lower()).split()

The function first converts all letters to lowercase, then replaces non-alphanumeric characters with space, and splits on whitespace. The result is all the words in their original order. Then we build the inverted index by looping over the words in the file we just got, and append the position of each word to its list of positions in the hashtable:

def createIndex1(text):
    index=collections.defaultdict(list)
    words=getWords(text)
    for pos, word in enumerate(words):
        index[word].append(pos)
    return index

Once we build the index we can answer any query. Given a word, just return its position array if it exists in the hashtable index. Otherwise, if the word isn’t present in the index, it means that the word doesn’t appear in the file, so return an empty list. Here’s the code:

def queryIndex1(index, word):
    if word in index:
        return index[word]
    else:
        return []

This is it actually! We have built a search engine for a single file, if we generalize it to work on multiple files (web pages) we’ll have a basic search engine. The complexity of create index is O(N), linear in the number of words in the text. The complexity of query index is constant O(1), because it’s just a simple lookup operation in the hashtable index. So, both creating the index and querying for word positions is optimal. The space complexity is also O(N), linear in the number of words. Because we have each word as a key in the hashtable.

Better Solution
Using hashtable as the inverted index is pretty efficient. But we can do better by using a more space efficient data structure for text, namely a , also known as prefix tree. Trie is a tree which is very efficient in text storage and search. It saves space by letting the words that have the same prefixes to share data. And most languages contain many words that share the same prefix.

Let’s demonstrate this with an example using the words subtree, subset, and submit. In a hashtable index, each of these words will be stored separately as individual keys. So, it will take 7+6+6=19 characters of space in the index. But all these words share the same prefix ‘sub’. We can take advantage of this fact by using a trie and letting them share that prefix in the index. In a trie index these words will take 7+3+3=13 characters of space, cutting the size of the index around 30%. We can have even more gain with the words that share longer prefixes. For example author, authority, and authorize. Hashtable index uses 6+9+9=24 characters but trie uses only 6+3+2=11, leading to 55% compression. Considering the extremely big web indexes of search engines, reducing the index size even by 10% is a big gain, which results in saving terabytes of space and reduces the number of machines by 1/10. Which is huge given that thousands of machines are used at Google, Microsoft, and Yahoo.

To summarize, it’s very useful to use a trie while performing storage and retrieval on text data. There are existing implementations of tries, but they’re more complicated than we actually need. So let’s implement our own simple trie, it’ll be more fun and informative.

A trie is simply a tree where each node stores a character as key, and the value in our case will be the occurrence positions of the word associated with the node. The word associated with a node is concatenation of the characters from root of the tree to the node. Every node in the tree has a corresponding word, but not all of them are valid English words. Most of them are intermediate words. Only the words that occur in the given text will contain position data. We call these nodes terminal nodes. Maximum number of children of a node is the number of different characters that appear in the text, which is 36 if we only consider lowercase alphanumeric characters. Here is the structure of a node:

class Node:
    def __init__(self, letter):
        self.letter=letter
        self.isTerminal=False
        self.children={}
        self.positions=[]

The trie data structure is composed of these nodes, and it’ll include the following 3 functions: getItem, contains, and output. The getItem function both returns the occurrence positions of a given word and it’s used for insertion, contains function checks whether a word is in the trie or not, and output prints the trie in a nice formatted manner. We insert a new occurrence position to a word by first getting its existing position list using the getItem function, and then appending the new position to the end of the list. Trying to access an element that’s not in the trie automatically creates that element, similar to . Code of getItem and contains functions are very similar. Both navigate through the tree until they reach the node that corresponds to the given word. Output prints the words in the trie in sorted order and indented in a way that prefix relations can be visually identified. Here’s the complete implementation:

class Trie:
    def __init__(self):
        self.root=Node('')        
 
    def __contains__(self, word):
        current=self.root
        for letter in word:
            if letter not in current.children:
                return False
            current=current.children[letter]
        return current.isTerminal
 
    def __getitem__(self, word):
        current=self.root
        for letter in word:
            if letter not in current.children:
                current.children[letter]=Node(letter)
            current=current.children[letter]
        current.isTerminal=True
        return current.positions
 
    def __str__(self):
        self.output([self.root])
        return ''
 
    def output(self, currentPath, indent=''):
        #Depth First Search
        currentNode=currentPath[-1]
        if currentNode.isTerminal:
            word=''.join([node.letter for node in currentPath])
            print indent+word+' '+str(currentNode.positions)
            indent+='  '
        for letter, node in sorted(currentNode.children.items()):
            self.output(currentPath[:]+[node], indent)

New createIndex and queryIndex functions are very similar to hashtable ones:

def createIndex2(text):
    trie=Trie()
    words=getWords(text)
    for pos, word in enumerate(words):
        trie[word].append(pos)
    return trie
 
def queryIndex2(index, word):
    if word in index:
        return index[word]
    else:
        return []

Let’s see an example trie constructed from words that share prefixes, here’s an artificial example text for demo: ‘us use uses used user users using useful username user utah’. The output function prints the following trie:

Indentation is based on shared word prefixes. For example used, useful, and user all share the prefix word use, so they’re indented. Also username and users share the prefix word user. But us and utah are at the same indentation level because the prefix they share (character u) is not a proper word. The values of words are their occurrence positions in the text.

The complexity of adding or finding an element in a trie is also constant O(1), like hashtable. It doesn’t depend on the number of words stored in the trie, it only depends on the number of characters in the word, just like hashing. And maximum number of characters in a word for a particular language is constant. Average number of characters in an English word is around 4-5, and most of them are shorter than 15. Which means it’ll generally take around 4-5 operations, and mostly less than 15, to find a word in the trie independent of the number of words stored. It can be millions or even billions, the complexity stays the same. Which is perfect, because it’s not linear on the number of words, instead it’s just a constant number.

This is personally one of my favorite questions because we actually implement a basic search engine that operates on a single file, and can easily be extended to search multiple files. If you would like to see how to build an actual search engine with working code, I recommend my previous posts: create index, query index, and ranking.

]]>
/2011/12/20/programming-interview-questions-23-find-word-positions-in-text/feed/ 3
Programming Interview Questions 22: Find Odd Occurring Element /2011/12/13/programming-interview-questions-22-find-odd-occurring-element/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-22-find-odd-occurring-element /2011/12/13/programming-interview-questions-22-find-odd-occurring-element/#comments Tue, 13 Dec 2011 18:02:01 +0000 Arden /?p=883 Continue reading ]]> Given an integer array, one element occurs odd number of times and all others have even occurrences. Find the element with odd occurrences.

This question is very similar to the previous find even occurring element problem. And we can actually use the same solutions. One approach is again to build a hashtable of element occurrence counts and return the element with odd count. Both time and space complexity of this solution is O(N).

But we can do much better by using the XOR trick described in that post. It’s the following: if we XOR a number with itself odd number of times the result is 0, otherwise if we XOR even number of times the result is the number itself. So, if we XOR all the elements in the array, the result is the odd occurring element itself. Because all even occurring elements will be XORed with themselves odd number of times, producing 0. And the only odd occurring element will be XORed with itself even number of times, producing its own value.

Let’s say we’re given the following array: [1, 2, 3, 1, 2, 3, 1]. If  we XOR all the elements in this array the result is 1^2^3^1^2^3^1 = 1. Because the numbers 2 and 3 will be XORed with themselves 1 time, producing 0. And the number 1 will be XORed with itself 2 times, resulting in its own value. So, the overall result of the XOR operations is the number 1, odd occurring element in the array. Here’s the code:

def getOdd(arr):
    return reduce(lambda x, y: x^y, arr)

Simple as that! The time complexity of this solution is still O(N), but now the space complexity is constant O(1). Because we’re just using constant extra memory, not proportional to the size of the input array. This is the most optimal solution to the problem, since we’re accessing every element only once and using constant extra space.

This is a great question because it demonstrates the power and effectiveness of bit manipulation operators.

]]>
/2011/12/13/programming-interview-questions-22-find-odd-occurring-element/feed/ 7
Programming Interview Questions 21: Tree Reverse Level Order Print /2011/12/08/programming-interview-questions-21-tree-reverse-level-order-print/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-21-tree-reverse-level-order-print /2011/12/08/programming-interview-questions-21-tree-reverse-level-order-print/#comments Thu, 08 Dec 2011 17:54:27 +0000 Arden /?p=878 Continue reading ]]> This is very similar to the previous post level order print. We again print the tree in level order, but now starting from bottom level to the root. Using the same tree as before:

The output should be:
4 5 6
2 3
1

The solution of this problem is similar to level order print. We start a breadth first search (BFS) from the root of the tree and push each node to a queue. We don’t print any node at this point because we want to output bottom up, but BFS progresses from top to bottom. Printing will be take place a separate loop after completing breadth first search. We also count the number of nodes in each level and push them to a stack, in order to print the new lines in correct places. So after completion of BFS we have the following two data structures:

Queue of nodes: [1, 2, 3, 4, 5, 6]. This queue is constructed by BFS from top to bottom and left to right.
Stack of node counts at each level: [3, 2, 1]. Note that since this is a stack the first element is the node count at the deepest level, and the last count is always 1 which corresponds to the root of the tree.

After constructing these data structures, the nodes we want to print as the first line of output are at the end of the queue. And the number of nodes to print is at the top of the stack. So we start a loop where at each iteration we pop an element from the stack and print that many number of nodes from the end of the queue. Using the example tree above, the first line of the output contains 3 nodes, the first element in the stack. And the nodes to print are the 3 nodes at the end of the queue, which is [4, 5, 6]. The second line of the output contains 2 nodes, the second value in the stack. These are the 2 nodes in the queue that are just before the first line nodes, namely [2, 3]. Finally, the number of nodes in the last line is at the end of the stack, which is 1. And the node to print is at the beginning of the queue, the value 1.

Here is the code:

def reverseLevelOrderPrint(tree):
    if not tree:
        return
    nodes=[tree]    #queue
    levelCount=collections.deque([1])   #stack
    currentCount, nextCount = 1, 0
    i=0
    while i<len(nodes):
        currentNode=nodes[i]
        currentCount-=1
        if currentNode.left:
            nodes.append(currentNode.left)
            nextCount+=1
        if currentNode.right:
            nodes.append(currentNode.right)
            nextCount+=1
        if currentCount==0:
            #finished this level
            if nextCount==0:
                #no more nodes at next level
                break
            #continue with next level
            levelCount.appendleft(nextCount)
            currentCount, nextCount = nextCount, currentCount
        i+=1
    printIndex=len(nodes)
    for count in levelCount:
        output=nodes[printIndex-count:printIndex]
        print ' '.join(map(str, output)), '\n',
        printIndex-=count

This is a great question that uses the most fundamental data structures: tree, stack, and queue.

]]>
/2011/12/08/programming-interview-questions-21-tree-reverse-level-order-print/feed/ 6
Programming Interview Questions 20: Tree Level Order Print /2011/12/05/programming-interview-questions-20-tree-level-order-print/?utm_source=rss&utm_medium=rss&utm_campaign=programming-interview-questions-20-tree-level-order-print /2011/12/05/programming-interview-questions-20-tree-level-order-print/#comments Mon, 05 Dec 2011 20:03:40 +0000 Arden /?p=862 Continue reading ]]> Given a binary tree of integers, print it in level order. The output will contain space between the numbers in the same level, and new line between different levels. For example, if the tree is:

The output should be:
1
2 3
4 5 6

It won’t be practical to solve this problem using recursion, because recursion is similar to depth first search, but what we need here is breadth first search. So we will use a queue as we did previously in breadth first search. First, we’ll push the root node into the queue. Then we start a while loop with the condition queue not being empty. Then, at each iteration we pop a node from the beginning of the queue and push its children to the end of the queue. Once we pop a node we print its value and space.

To print the new line in correct place we should count the number of nodes at each level. We will have 2 counts, namely current level count and next level count. Current level count indicates how many nodes should be printed at this level before printing a new line. We decrement it every time we pop an element from the queue and print it. Once the current level count reaches zero we print a new line. Next level count contains the number of nodes in the next level, which will become the current level count after printing a new line. We count the number of nodes in the next level by counting the number of children of the nodes in the current level. Understanding the code is easier than its explanation:

class Node:
    def __init__(self, val=None):
        self.left, self.right, self.val = None, None, val        
 
def levelOrderPrint(tree):
    if not tree:
        return
    nodes=collections.deque([tree])
    currentCount, nextCount = 1, 0
    while len(nodes)!=0:
        currentNode=nodes.popleft()
        currentCount-=1
        print currentNode.val,
        if currentNode.left:
            nodes.append(currentNode.left)
            nextCount+=1
        if currentNode.right:
            nodes.append(currentNode.right)
            nextCount+=1
        if currentCount==0:
            #finished printing current level
            print '\n',
            currentCount, nextCount = nextCount, currentCount

The time complexity of this solution is O(N), which is the number of nodes in the tree, so it’s optimal. Because we should visit each node at least once. The space complexity depends on maximum size of the queue at any point, which is the most number of nodes at one level. The worst case occurs when the tree is a complete binary tree, which means each level is completely filled with maximum number of nodes possible. In this case, the most number of nodes appear at the last level, which is (N+1)/2 where N is the total number of nodes. So the space complexity is also O(N). Which is also optimal while using a queue.

This is one of the most common tree interview questions and everyone should know it off the top of their head.

]]>
/2011/12/05/programming-interview-questions-20-tree-level-order-print/feed/ 10