The post About me.. appeared first on Arden Dertat.

]]>

I completed my Master’s in Computer Science at Brown University on May 2012. My specialization was Information Retrieval and Machine Learning. My thesis is a personal news aggregation website. The system enables users to receive fresh and relevant news about the topics they are particularly interested in. Performs personalized retrieval and ranking by analyzing user behavior. It gives the audience full control about the news they want to read, and filters the information based on their specific needs. Which I believe is a fundamental matter in today’s data-driven knowledge era. I worked with Prof Eli Upfal during my Master’s degree.

I was a visiting student at MIT Computer Science and Artificial Intelligence Laboratory – CSAIL for a semester on Fall 2011, while I was pursuing my degree at Brown. I completed my Bachelor’s in Computer Science at Bogazici University on May 2010. I was born and raised in Istanbul – Turkey, and this is my 5th year in the U.S. I currently live in San Francisco, previously I lived in Seattle for 2 years, and before that Rhode Island for 2 years.

I write blog posts about search engines where I explain how to implement one in detail with the code that I have written. I also share various programming interview questions with my solutions. I hope that these articles will be informative to anyone interested in these areas.

Feel free to contact me at: ardendertat AT gmail DOT com.

The post About me.. appeared first on Arden Dertat.

]]>The post Programming Interview Questions 28: Longest Compound Word appeared first on Arden Dertat.

]]>Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: [‘cat’, ‘cats’, ‘catsdogcats’, ‘catxdogcatsrat’, ‘dog’, ‘dogcatsdog’, ‘hippopotamuses’, ‘rat’, ‘ratcat’, ‘ratcatdog’, ‘ratcatdogcat’]. Then the longest compound word is ‘ratcatdogcat’ with 12 letters. Note that the longest individual words are ‘catxdogcatsrat’ and ‘hippopotamuses’ with 14 letters, but they’re not fully constructed by other words. Former one has an extra ‘x’ letter, and latter is an individual word by itself not a compound word.

We will use the trie data structure, also known as a prefix tree. Tries are space and time efficient structures for text storage and search. They let words to share prefixes, and prefixes are exactly what we’ll be dealing with in this question. There’s a sample trie implementation in a previous question find word positions in text. The one we’re going to use is slightly modified, with a different node definition and two additional Trie class functions:

class Node: def __init__(self, letter=None, isTerminal=False): self.letter=letter self.children={} self.isTerminal=isTerminal |

def insert(self, word): current=self.root for letter in word: if letter not in current.children: current.children[letter]=Node(letter) current=current.children[letter] current.isTerminal=True def getAllPrefixesOfWord(self, word): prefix='' prefixes=[] current=self.root for letter in word: if letter not in current.children: return prefixes current=current.children[letter] prefix+=letter if current.isTerminal: prefixes.append(prefix) return prefixes |

The isTerminal field in a Node class means that the word constructed by letters from the root to the current node is a valid word. For example the word cat is formed by 3 nodes, one for each letter. And the last letter ‘t’ has isTerminal value True.

The insert function of the trie just adds the given word to the trie. The more important getAllPrefixesOfWord function takes a word as input, and returns all the valid prefixes of that word. Where valid means that prefix word appears in the given input list. The list being sorted helps us here. While scanning the list from the beginning, all the prefixes of the current word have already been seen and added to the trie. So finding all the prefixes of a word can be done easily by following a single path down from the root with nodes being the letters of the word, and returning the prefix words corresponding to nodes with true isTerminal value

.

The algorithm works as follows, we scan the input list from the beginning and insert each word into the trie. Right after inserting a word, we check whether it has any prefixes. If yes, then it’s a candidate for the longest compound word. We append a pair of current word and its suffix (word-prefix) into a queue. The reason is that the current word is a valid construct only if the suffix is also a valid word or a compound word. So we build the trie and queue while scanning the array.

Let’s illustrate the process with an example, the first word is cat and we add it to the trie. Since it doesn’t have a prefix, we continue. Second word is cats, we add it to the trie and check whether it has a prefix, and yes it does, the word cat. So we append the pair <‘cats’, ‘s’> to the queue, which is the current word and the suffix. The third word is catsdogcats, we again insert it to the trie and see that it has 2 prefixes, cat and cats. So we add 2 pairs <‘catsdogcats’, ‘sdogcats’> and <‘catsdogcats’, ‘dogcats’> where the former suffix correspond to the prefix cat and the latter to cats. We continue like this by adding <‘catxdogcatsrat’, ‘xdogcatsrat’> to the queue and so on. And here’s the trie formed by adding example the words in the problem definition:

After building the trie and the queue, then we start processing the queue by popping a pair from the beginning. As explained above, the pair contains the original word and the suffix we want to validate. We check whether the suffix is a valid or compound word. If it’s a valid word and the original word is the longest up to now, we store the result. Otherwise we discard the pair. The suffix may be a compound word itself, so we check if the it has any prefixes. If it does, then we apply the above procedure by adding the original word and the new suffix to the queue. If the suffix of the original popped pair is neither a valid word nor has a prefix, we simply discard that pair.

An example will clarify the procedure, let’s check the pair <‘catsdogcats’, ‘dogcats’> popped from the queue. Dogcats is not a valid word, so we check whether it has a prefix. Dog is a prefix, and cats is the new suffix. So we add the pair <‘catsdogcats’, ‘cats’> to the queue. Next time we pop this pair we’ll see that cats is a valid word and finish processing the word catsdogcats. As you can see, the suffix will get shorter and shorter at each iteration, and at some point the pair will either be discarded or stored as the longest word. And as the pairs are being discarded, the length of the queue will decrease and the algorithm will gradually come to an end. Here’s the complete algorithm:

def longestWord(words): #Add words to the trie, and pairs to the queue trie=Trie() queue=collections.deque() for word in words: prefixes=trie.getAllPrefixesOfWord(word) for prefix in prefixes: queue.append( (word, word[len(prefix):]) ) trie.insert(word) #Process the queue longestWord='' maxLength=0 while queue: word, suffix = queue.popleft() if suffix in trie and len(word)>maxLength: longestWord=word maxLength=len(word) else: prefixes=trie.getAllPrefixesOfWord(suffix) for prefix in prefixes: queue.append( (word, suffix[len(prefix):]) ) return longestWord |

The complexity of this algorithm is O(kN) where N is the number of words in the input list, and k the maximum number of words in a compound word. The number k may vary from one list to another, but it’ll generally be a constant number like 5 or 10. So, the algorithm is linear in number of words in the list, which is an optimal solution to the problem.

And here’s the impementation of the trie for completeness:

class Trie: def __init__(self): self.root=Node('') def __repr__(self): self.output([self.root]) return '' def output(self, currentPath, indent=''): #Depth First Search currentNode=currentPath[-1] if currentNode.isTerminal: word=''.join([node.letter for node in currentPath]) print indent+word indent+=' ' for letter, node in sorted(currentNode.children.items()): self.output(currentPath[:]+[node], indent) def insert(self, word): current=self.root for letter in word: if letter not in current.children: current.children[letter]=Node(letter) current=current.children[letter] current.isTerminal=True def __contains__(self, word): current=self.root for letter in word: if letter not in current.children: return False current=current.children[letter] return current.isTerminal def getAllPrefixesOfWord(self, word): prefix='' prefixes=[] current=self.root for letter in word: if letter not in current.children: return prefixes current=current.children[letter] prefix+=letter if current.isTerminal: prefixes.append(prefix) return prefixes |

The post Programming Interview Questions 28: Longest Compound Word appeared first on Arden Dertat.

]]>The post Programming Interview Questions 27: Squareroot of a Number appeared first on Arden Dertat.

]]>

The squareroot of a (non-negative) number N always lies between 0 and N/2. The straightforward way to solve this problem would be to check every number k between 0 and N/2, until the square of k becomes greater than or rqual to N. If k^2 becomes equal to N, then we return k. Otherwise, we return k-1 because we’re rounding down. Here’s the code:

def sqrt1(num): if num<0: raise ValueError if num==1: return 1 for k in range(1+(num/2)): if k**2==num: return k elif k**2>num: return k-1 return k |

The complexity of this approach is O(N), because we have to check N/2 numbers in the worst case. This linear algorithm is pretty inefficient, we can use some sort of binary search to speed it up. We know that the result is between 0 and N/2, so we can first try N/4 to see whether its square is less than, greater than, or equal to N. If it’s equal then we simply return that value. If it’s less, then we continue our search between N/4 and N/2. Otherwise if it’s greater, then we search between 0 and N/4. In both cases we reduce the potential range by half and continue, this is the logic of binary search. We’re not performing regular binary search though, it’s modified. We want to ensure that we stop at a number k, where k^2<=N but (k+1)^2>N. The code will clarify any doubts:

def sqrt2(num): if num<0: raise ValueError if num==1: return 1 low=0 high=1+(num/2) while low+1<high: mid=low+(high-low)/2 square=mid**2 if square==num: return mid elif square<num: low=mid else: high=mid return low |

One difference from regular binary search is the condition of the while loop, it’s low+1<high instead of low<high. Also we have low=mid instead of low=mid+1, and high=mid instead of high=mid-1. These are the modifications we make to standard binary search. The complexity is still the same though, it’s logarithmic O(logN). Which is much better than the naive linear solution.

There’s also a constant time O(1) solution which involves a clever math trick. Here it is:

This solution exploits the property that if we take the exponent of the log of a number, the result doesn’t change, it’s still the number itself. So we can first calculate the log of a number, multiply with 0.5, take the exponent, and finally take the floor of that value to round it down. This way we can avoid using the sqrt function by using the log function. Logarithm of a number rounded down to the nearest integer can be calculated in constant time, by looking at the position of the leftmost 1 in the binary representation of the number. For example, the number 6 in binary is 110, and the leftmost 1 is at position 2 (starting from right counting from 0). So the logarithm of 6 rounded down is 2. This solution doesn’t always give the same result as above algorithms though, because of the rounding effects. And depending on the interviewer’s perspective this approach can be regarded as either very elegant and clever, or tricky and invalid. But in any case it definitely worths mentioning, ultimately it proves that you’re aware of math shortcuts. Which is always a desired talent in every job candidate.

The post Programming Interview Questions 27: Squareroot of a Number appeared first on Arden Dertat.

]]>The post Programming Interview Questions 26: Trim Binary Search Tree appeared first on Arden Dertat.

]]>and we’re given min value as 5 and max value as 13, then the resulting binary search tree should be:

We should remove all the nodes whose value is not between min and max. We can do this by performing a post-order traversal of the tree. We first process the left children, then right children, and finally the node itself. So we form the new tree bottom up, starting from the leaves towards the root. As a result while processing the node itself, both its left and right subtrees are valid trimmed binary search trees (may be NULL as well).

At each node we’ll return a reference based on its value, which will then be assigned to its parent’s left or right child pointer, depending on whether the current node is left or right child of the parent. If current node’s value is between min and max (min<=node<=max) then there’s no action need to be taken, so we return the reference to the node itself. If current node’s value is less than min, then we return the reference to its right subtree, and discard the left subtree. Because if a node’s value is less than min, then its left children are definitely less than min since this is a binary search tree. But its right children may or may not be less than min we can’t be sure, so we return the reference to it. Since we’re performing bottom-up post-order traversal, its right subtree is already a trimmed valid binary search tree (possibly NULL), and left subtree is definitely NULL because those nodes were surely less than min and they were eliminated during the post-order traversal. Remember that in post-order traversal we first process all the children of a node, and then finally the node itself.

Similar situation occurs when node’s value is greater than max, we now return the reference to its left subtree. Because if a node’s value is greater than max, then its right children are definitely greater than max. But its left children may or may not be greater than max. So we discard the right subtree and return the reference to the already valid left subtree. The code is easier to understand:

def trimBST(tree, minVal, maxVal): if not tree: return tree.left=trimBST(tree.left, minVal, maxVal) tree.right=trimBST(tree.right, minVal, maxVal) if minVal<=tree.val<=maxVal: return tree if tree.val<minVal: return tree.right if tree.val>maxVal: return tree.left |

The complexity of this algorithm is O(N), where N is the number of nodes in the tree. Because we basically perform a post-order traversal of the tree, visiting each and every node one. This is optimal because we should visit every node at least once. This is a very elegant question that demonstrates the effectiveness of recursion in trees.

The post Programming Interview Questions 26: Trim Binary Search Tree appeared first on Arden Dertat.

]]>The post Implementing Search Engines appeared first on Arden Dertat.

]]>

1. Create Index

Building the index of our search engine.

2. Query Index

Answering search queries on the index that we built.

3. Ranking

Ranking the search results.

The post Implementing Search Engines appeared first on Arden Dertat.

]]>The post Programming Interview Questions appeared first on Arden Dertat.

]]>

1. Array Pair Sum

Given an integer array, output all pairs that sum up to a specific value k.

2. Matrix Region Sum

Given a matrix of integers and coordinates of a rectangular region within the matrix, find the sum of numbers falling inside the rectangle. Our program will be called multiple times with different rectangular regions from the same matrix.

3. Largest Continuous Sum

Given an array of integers (positive and negative) find the largest continuous sum.

4. Find Missing Element

There is an array of non-negative integers. A second array is formed by shuffling the elements of the first array and deleting a random element. Given these two arrays, find which element is missing in the second array.

5. Linked List Remove Nodes

Given a linkedlist of integers and an integer value, delete every node of the linkedlist containing that value.

6. Combine Two Strings

We are given 3 strings: str1, str2, and str3. Str3 is said to be a shuffle of str1 and str2 if it can be formed by interleaving the characters of str1 and str2 in a way that maintains the left to right ordering of the characters from each string. For example, given str1=”abc” and str2=”def”, str3=”dabecf” is a valid shuffle since it preserves the character ordering of the two strings. So, given these 3 strings write a function that detects whether str3 is a valid shuffle of str1 and str2.

7. Binary Search Tree Check

Given a binary tree, check whether it’s a binary search tree or not.

8. Transform Word

Given a source word, target word and an English dictionary, transform the source word to target by changing/adding/removing 1 character at a time, while all intermediate words being valid English words. Return the transformation chain which has the smallest number of intermediate words.

9. Convert Array

Given an array [a1, a2, …, aN, b1, b2, …, bN, c1, c2, …, cN] convert it to [a1, b1, c1, a2, b2, c2, …, aN, bN, cN] in-place using constant extra space

10. Kth Largest Element in Array

Given an array of integers find the kth element in the sorted order (not the kth distinct element). So, if the array is [3, 1, 2, 1, 4] and k is 3 then the result is 2, because it’s the 3rd element in sorted order (but the 3rd distinct element is 3).

11. All Permutations of String

Generate all permutations of a given string.

12. Reverse Words in a String

Given an input string, reverse all the words. To clarify, input: “Interviews are awesome!” output: “awesome! are Interviews”. Consider all consecutive non-whitespace characters as individual words. If there are multiple spaces between words reduce them to a single white space. Also remove all leading and trailing whitespaces. So, the output for ” CS degree”, “CS degree”, “CS degree “, or ” CS degree ” are all the same: “degree CS”.

13. Median of Integer Stream

Given a stream of unsorted integers, find the median element in sorted order at any given time. So, we will be receiving a continuous stream of numbers in some random order and we don’t know the stream length in advance. Write a function that finds the median of the already received numbers efficiently at any time. We will be asked to find the median multiple times. Just to recall, median is the middle element in an odd length sorted array, and in the even case it’s the average of the middle elements.

14. Check Balanced Parentheses

Given a string of opening and closing parentheses, check whether it’s balanced. We have 3 types of parentheses: round brackets: (), square brackets: [], and curly brackets: {}. Assume that the string doesn’t contain any other character than these, no spaces words or numbers. Just to remind, balanced parentheses require every opening parenthesis to be closed in the reverse order opened. For example ‘([])’ is balanced but ‘([)]‘ is not.

15. First Non Repeated Character in String

Find the first non-repeated (unique) character in a given string.

16. Anagram Strings

Given two strings, check if they’re anagrams or not. Two strings are anagrams if they are written using the same exact letters, ignoring space, punctuation and capitalization. Each letter should have the same count in both strings. For example, ‘Eleven plus two’ and ‘Twelve plus one’ are meaningful anagrams of each other.

17. Search Unknown Length Array

Given a sorted array of unknown length and a number to search for, return the index of the number in the array. Accessing an element out of bounds throws exception. If the number occurs multiple times, return the index of any occurrence. If it isn’t present, return -1.

18. Find Even Occurring Element

Given an integer array, one element occurs even number of times and all others have odd occurrences. Find the element with even occurrences.

19. Find Next Palindrome Number

Given a number, find the next smallest palindrome larger than the number. For example if the number is 125, next smallest palindrome is 131.

20. Tree Level Order Print

Given a binary tree of integers, print it in level order. The output will contain space between the numbers in the same level, and new line between different levels.

21. Tree Reverse Level Order Print

This is very similar to the previous post level order print. We again print the tree in level order, but now starting from bottom level to the root.

22. Find Odd Occurring Element

Given an integer array, one element occurs odd number of times and all others have even occurrences. Find the element with odd occurrences.

23. Find Word Positions in Text

Given a text file and a word, find the positions that the word occurs in the file. We’ll be asked to find the positions of many words in the same file.

24. Find Next Higher Number With Same Digits

Given a number, find the next higher number using only the digits in the given number. For example if the given number is 1234, next higher number with same digits is 1243.

25. Remove Duplicate Characters in String

Remove duplicate characters in a given string keeping only the first occurrences. For example, if the input is ‘tree traversal’ the output will be ‘tre avsl’.

26. Trim Binary Search Tree

Given the root of a binary search tree and 2 numbers min and max, trim the tree such that all the numbers in the new tree are between min and max (inclusive). The resulting tree should still be a valid binary search tree.

27. Squareroot of a Number

Find the squareroot of a given number rounded down to the nearest integer, without using the sqrt function. For example, squareroot of a number between [9, 15] should return 3, and [16, 24] should be 4.

28. Longest Compound Word

Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: [‘cat’, ‘cats’, ‘catsdogcats’, ‘catxdogcatsrat’, ‘dog’, ‘dogcatsdog’, ‘hippopotamuses’, ‘rat’, ‘ratcat’, ‘ratcatdog’, ‘ratcatdogcat’]. Then the longest compound word is ‘ratcatdogcat’ with 12 letters. Note that the longest individual words are ‘catxdogcatsrat’ and ‘hippopotamuses’ with 14 letters, but they’re not fully constructed by other words. Former one has an extra ‘x’ letter, and latter is an individual word by itself not a compound word.

The post Programming Interview Questions appeared first on Arden Dertat.

]]>The post Programming Interview Questions 25: Remove Duplicate Characters in String appeared first on Arden Dertat.

]]>

We need a data structure to keep track of the characters we have seen so far, which can perform efficient find operation. If the input is guaranteed to be in standard ASCII form, we can just create a boolean array of size 128 and perform lookups by accessing the index of the character’s ASCII value in constant time. But if the string is Unicode then we would need a much larger array of size more than 100K, which will be a waste since most of it would generally be unused.

Set data structure perfectly suits our purpose. It stores keys and provides constant time search for key existence. So, we’ll loop over the characters of the string, and at each iteration we’ll check whether we have seen the current character before by searching the set. If it’s in the set then it means we’ve seen it before, so we ignore it. Otherwise, we include it in the result and add it to the set to keep track for future reference. The code is easier to understand:

def removeDuplicates(string): result=[] seen=set() for char in string: if char not in seen: seen.add(char) result.append(char) return ''.join(result) |

The time complexity of the algorithm is O(N) where N is the number of characters in the input string, because set supports O(1) insert and find. This is an optimal solution to one of the most common string interview questions.

Note: Complexity of insert and find for set is language and implementation dependent. Average case is mostly constant, but worst case can be logarithmic or even linear. But in an interview setting I think it’s safe to assume constant time insert and find for set.

The post Programming Interview Questions 25: Remove Duplicate Characters in String appeared first on Arden Dertat.

]]>The post Programming Interview Questions 24: Find Next Higher Number With Same Digits appeared first on Arden Dertat.

]]>

The naive approach is to generate the numbers with all digit permutations and sort them. Then find the given number in the sorted sequence and return the next number in sorted order as a result. The complexity of this approach is pretty high though, because of the permutation step involved. A given number N has logN+1 digits, so there are O(logN!) permutations. After generating the permutations, sorting them will require O(logN!loglogN!) operations. We can simplify this further, remember that O(logN!) is equivalent to O(NlogN). And O(loglogN!) is O(logN). So, the complexity is O(N(logN)^2).

Let’s visualize a better solution using an example, the given number is 12543 and the resulting next higher number should be 13245. We scan the digits of the given number starting from the tenths digit (which is 4 in our case) going towards left. At each iteration we check the right digit of the current digit we’re at, and if the value of right is greater than current we stop, otherwise we continue to left. So we start with current digit 4, right digit is 3, and 4>=3 so we continue. Now current digit is 5, right digit is 4, and 5>= 4, continue. Now current is 2, right is 5, but it’s not 2>=5, so we stop. The digit 2 is our pivot digit. From the digits to the right of 2, we find the smallest digit higher than 2, which is 3. This part is important, we should find the smallest higher digit for the resulting number to be precisely the next higher than original number. We swap this digit and the pivot digit, so the number becomes 13542. Pivot digit is now 3. We sort all the digits to the right of the pivot digit in increasing order, resulting in 13245. This is it, here’s the code:

def nextHigher(num): strNum=str(num) length=len(strNum) for i in range(length-2, -1, -1): current=strNum[i] right=strNum[i+1] if current<right: temp=sorted(strNum[i:]) next=temp[temp.index(current)+1] temp.remove(next) temp=''.join(temp) return int(strNum[:i]+next+temp) return num |

Note that if the digits of the given number is monotonically increasing from right to left, like 43221 then we won’t perform any operations, which is what we want because this is the highest number obtainable from these digits. There’s no higher number, so we return the given number itself. The same case occurs when the number has only a single digit, like 7. We can’t form a different number since there’s only a single digit.

The complexity of this algorithm also depends on the number of digits, and the sorting part dominates. A given number N has logN+1 digits and in the worst case we’ll have to sort logN digits. Which happens when all digits are increasing from right to left except the leftmost digit, for example 1987. For sorting we don’t have to use comparison based algorithms such as quicksort, mergesort, or heapsort which are O(KlogK), where K is the number of elements to sort. Since we know that digits are always between 0 and 9, we can use counting sort, radix sort, or bucket sort which can work in linear time O(K). So the overall complexity of sorting logN digits will stay linear resulting in overall complexity O(logN). Which is optimal since we have to check each digit at least once.

The post Programming Interview Questions 24: Find Next Higher Number With Same Digits appeared first on Arden Dertat.

]]>The post Programming Interview Questions 23: Find Word Positions in Text appeared first on Arden Dertat.

]]>

Since we’ll have to answer multiple queries, precomputation would be useful. We’ll build a data structure that stores the positions of all the words in the text file. This is known as inverted index in Information Retrieval. It’s basically a mapping between the words in the file and their positions. You can read more about it in my how to implement a search engine post, where I describe how to actually implement a working search engine with real code.

We can use hashtable as our inverted index. The key is a word in the file, and the value is the array of positions that word occurs. So, we’ll get all the words in the file and populate the hashtable with their positions. The words may contain capital letters, but we don’t want separate entries for apple, and Apple in our inverted index. So, we’ll write our own parsing function which converts all the words to lowercase, eliminates non-alphanumeric characters, and then splits on whitespace to get all the actual words in the file. We can do it easily using regular expressions:

def getWords(text): return re.sub(r'[^a-z0-9]',' ',text.lower()).split() |

The function first converts all letters to lowercase, then replaces non-alphanumeric characters with space, and splits on whitespace. The result is all the words in their original order. Then we build the inverted index by looping over the words in the file we just got, and append the position of each word to its list of positions in the hashtable:

def createIndex1(text): index=collections.defaultdict(list) words=getWords(text) for pos, word in enumerate(words): index[word].append(pos) return index |

Once we build the index we can answer any query. Given a word, just return its position array if it exists in the hashtable index. Otherwise, if the word isn’t present in the index, it means that the word doesn’t appear in the file, so return an empty list. Here’s the code:

def queryIndex1(index, word): if word in index: return index[word] else: return [] |

This is it actually! We have built a search engine for a single file, if we generalize it to work on multiple files (web pages) we’ll have a basic search engine. The complexity of create index is O(N), linear in the number of words in the text. The complexity of query index is constant O(1), because it’s just a simple lookup operation in the hashtable index. So, both creating the index and querying for word positions is optimal. The space complexity is also O(N), linear in the number of words. Because we have each word as a key in the hashtable.

**Better Solution**

Using hashtable as the inverted index is pretty efficient. But we can do better by using a more space efficient data structure for text, namely a trie, also known as prefix tree. Trie is a tree which is very efficient in text storage and search. It saves space by letting the words that have the same prefixes to share data. And most languages contain many words that share the same prefix.

Let’s demonstrate this with an example using the words subtree, subset, and submit. In a hashtable index, each of these words will be stored separately as individual keys. So, it will take 7+6+6=19 characters of space in the index. But all these words share the same prefix ‘sub’. We can take advantage of this fact by using a trie and letting them share that prefix in the index. In a trie index these words will take 7+3+3=13 characters of space, cutting the size of the index around 30%. We can have even more gain with the words that share longer prefixes. For example author, authority, and authorize. Hashtable index uses 6+9+9=24 characters but trie uses only 6+3+2=11, leading to 55% compression. Considering the extremely big web indexes of search engines, reducing the index size even by 10% is a big gain, which results in saving terabytes of space and reduces the number of machines by 1/10. Which is huge given that thousands of machines are used at Google, Microsoft, and Yahoo.

To summarize, it’s very useful to use a trie while performing storage and retrieval on text data. There are existing implementations of tries, but they’re more complicated than we actually need. So let’s implement our own simple trie, it’ll be more fun and informative.

A trie is simply a tree where each node stores a character as key, and the value in our case will be the occurrence positions of the word associated with the node. The word associated with a node is concatenation of the characters from root of the tree to the node. Every node in the tree has a corresponding word, but not all of them are valid English words. Most of them are intermediate words. Only the words that occur in the given text will contain position data. We call these nodes terminal nodes. Maximum number of children of a node is the number of different characters that appear in the text, which is 36 if we only consider lowercase alphanumeric characters. Here is the structure of a node:

class Node: def __init__(self, letter): self.letter=letter self.isTerminal=False self.children={} self.positions=[] |

The trie data structure is composed of these nodes, and it’ll include the following 3 functions: getItem, contains, and output. The getItem function both returns the occurrence positions of a given word and it’s used for insertion, contains function checks whether a word is in the trie or not, and output prints the trie in a nice formatted manner. We insert a new occurrence position to a word by first getting its existing position list using the getItem function, and then appending the new position to the end of the list. Trying to access an element that’s not in the trie automatically creates that element, similar to collections.defaultdict. Code of getItem and contains functions are very similar. Both navigate through the tree until they reach the node that corresponds to the given word. Output prints the words in the trie in sorted order and indented in a way that prefix relations can be visually identified. Here’s the complete implementation:

class Trie: def __init__(self): self.root=Node('') def __contains__(self, word): current=self.root for letter in word: if letter not in current.children: return False current=current.children[letter] return current.isTerminal def __getitem__(self, word): current=self.root for letter in word: if letter not in current.children: current.children[letter]=Node(letter) current=current.children[letter] current.isTerminal=True return current.positions def __str__(self): self.output([self.root]) return '' def output(self, currentPath, indent=''): #Depth First Search currentNode=currentPath[-1] if currentNode.isTerminal: word=''.join([node.letter for node in currentPath]) print indent+word+' '+str(currentNode.positions) indent+=' ' for letter, node in sorted(currentNode.children.items()): self.output(currentPath[:]+[node], indent) |

New createIndex and queryIndex functions are very similar to hashtable ones:

def createIndex2(text): trie=Trie() words=getWords(text) for pos, word in enumerate(words): trie[word].append(pos) return trie def queryIndex2(index, word): if word in index: return index[word] else: return [] |

Let’s see an example trie constructed from words that share prefixes, here’s an artificial example text for demo: ‘us use uses used user users using useful username user utah’. The output function prints the following trie:

Indentation is based on shared word prefixes. For example used, useful, and user all share the prefix word use, so they’re indented. Also username and users share the prefix word user. But us and utah are at the same indentation level because the prefix they share (character u) is not a proper word. The values of words are their occurrence positions in the text.

The complexity of adding or finding an element in a trie is also constant O(1), like hashtable. It doesn’t depend on the number of words stored in the trie, it only depends on the number of characters in the word, just like hashing. And maximum number of characters in a word for a particular language is constant. Average number of characters in an English word is around 4-5, and most of them are shorter than 15. Which means it’ll generally take around 4-5 operations, and mostly less than 15, to find a word in the trie independent of the number of words stored. It can be millions or even billions, the complexity stays the same. Which is perfect, because it’s not linear on the number of words, instead it’s just a constant number.

This is personally one of my favorite questions because we actually implement a basic search engine that operates on a single file, and can easily be extended to search multiple files. If you would like to see how to build an actual search engine with working code, I recommend my previous posts: create index, query index, and ranking.

The post Programming Interview Questions 23: Find Word Positions in Text appeared first on Arden Dertat.

]]>The post Programming Interview Questions 22: Find Odd Occurring Element appeared first on Arden Dertat.

]]>

This question is very similar to the previous find even occurring element problem. And we can actually use the same solutions. One approach is again to build a hashtable of element occurrence counts and return the element with odd count. Both time and space complexity of this solution is O(N).

But we can do much better by using the XOR trick described in that post. It’s the following: if we XOR a number with itself odd number of times the result is 0, otherwise if we XOR even number of times the result is the number itself. So, if we XOR all the elements in the array, the result is the odd occurring element itself. Because all even occurring elements will be XORed with themselves odd number of times, producing 0. And the only odd occurring element will be XORed with itself even number of times, producing its own value.

Let’s say we’re given the following array: [1, 2, 3, 1, 2, 3, 1]. If we XOR all the elements in this array the result is 1^2^3^1^2^3^1 = 1. Because the numbers 2 and 3 will be XORed with themselves 1 time, producing 0. And the number 1 will be XORed with itself 2 times, resulting in its own value. So, the overall result of the XOR operations is the number 1, odd occurring element in the array. Here’s the code:

def getOdd(arr): return reduce(lambda x, y: x^y, arr) |

Simple as that! The time complexity of this solution is still O(N), but now the space complexity is constant O(1). Because we’re just using constant extra memory, not proportional to the size of the input array. This is the most optimal solution to the problem, since we’re accessing every element only once and using constant extra space.

This is a great question because it demonstrates the power and effectiveness of bit manipulation operators.

The post Programming Interview Questions 22: Find Odd Occurring Element appeared first on Arden Dertat.

]]>