Programming Interview Questions 28: Longest Compound Word

This is my last post of the long programming interview questions series. I’m starting my first full-time job next week at Microsoft Bing Relevance team. Wish me luck..

Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: [‘cat’, ‘cats’, ‘catsdogcats’, ‘catxdogcatsrat’, ‘dog’, ‘dogcatsdog’, ‘hippopotamuses’, ‘rat’, ‘ratcat’, ‘ratcatdog’, ‘ratcatdogcat’]. Then the longest compound word is ‘ratcatdogcat’ with 12 letters. Note that the longest individual words are ‘catxdogcatsrat’ and ‘hippopotamuses’ with 14 letters, but they’re not fully constructed by other words. Former one has an extra ‘x’ letter, and latter is an individual word by itself not a compound word.

We will use the trie data structure, also known as a prefix tree. Tries are space and time efficient structures for text storage and search.  They let words to share prefixes, and prefixes are exactly what we’ll be dealing with in this question. There’s a sample trie implementation in a previous question find word positions in text. The one we’re going to use is slightly modified, with a different node definition and two additional Trie class functions:

class Node:
    def __init__(self, letter=None, isTerminal=False):
        self.letter=letter
        self.children={}
        self.isTerminal=isTerminal
def insert(self, word):
    current=self.root
    for letter in word:
        if letter not in current.children:
            current.children[letter]=Node(letter)
        current=current.children[letter]
    current.isTerminal=True
 
def getAllPrefixesOfWord(self, word):
    prefix=''
    prefixes=[]
    current=self.root
    for letter in word:
        if letter not in current.children:
            return prefixes
        current=current.children[letter]
        prefix+=letter
        if current.isTerminal:
            prefixes.append(prefix)
    return prefixes

The isTerminal field in a Node class means that the word constructed by letters from the root to the current node is a valid word. For example the word cat is formed by 3 nodes, one for each letter. And the last letter ‘t’ has isTerminal value True.

The insert function of the trie just adds the given word to the trie. The more important getAllPrefixesOfWord function takes a word as input, and returns all the valid prefixes of that word. Where valid means that prefix word appears in the given input list. The list being sorted helps us here. While scanning the list from the beginning, all the prefixes of the current word have already been seen and added to the trie. So finding all the prefixes of a word can be done easily by following a single path down from the root with nodes being the letters of the word, and returning the prefix words corresponding to nodes with true isTerminal value

.

The algorithm works as follows, we scan the input list from the beginning and insert each word into the trie. Right after inserting a word, we check whether it has any prefixes. If yes, then it’s a candidate for the longest compound word. We append a pair of current word and its suffix (word-prefix) into a queue. The reason is that the current word is a valid construct only if the suffix is also a valid word or a compound word. So we build the trie and queue while scanning the array.

Let’s illustrate the process with an example, the first word is cat and we add it to the trie. Since it doesn’t have a prefix, we continue. Second word is cats, we add it to the trie and check whether it has a prefix, and yes it does, the word cat. So we append the pair <‘cats’, ‘s’> to the queue, which is the current word and the suffix. The third word is catsdogcats, we again insert it to the trie and see that it has 2 prefixes, cat and cats. So we add 2 pairs <‘catsdogcats’, ‘sdogcats’> and <‘catsdogcats’, ‘dogcats’> where the former suffix correspond to the prefix cat and the latter to cats. We continue like this by adding <‘catxdogcatsrat’, ‘xdogcatsrat’> to the queue and so on. And here’s the trie formed by adding example the words in the problem definition:

After building the trie and the queue, then we start processing the queue by popping a pair from the beginning. As explained above, the pair contains the original word and the suffix we want to validate. We check whether the suffix is a valid or compound word. If it’s a valid word and the original word is the longest up to now, we store the result. Otherwise we discard the pair. The suffix may be a compound word itself, so we check if the it has any prefixes. If it does, then we apply the above procedure by adding the original word and the new suffix to the queue. If the suffix of the original popped pair is neither a valid word nor has a prefix, we simply discard that pair.

An example will clarify the procedure, let’s check the pair <‘catsdogcats’, ‘dogcats’> popped from the queue. Dogcats is not a valid word, so we check whether it has a prefix. Dog is a prefix, and cats is the new suffix. So we add the pair <‘catsdogcats’, ‘cats’> to the queue. Next time we pop this pair we’ll see that cats is a valid word and finish processing the word catsdogcats. As you can see, the suffix will get shorter and shorter at each iteration, and at some point the pair will either be discarded or stored as the longest word. And as the pairs are being discarded, the length of the queue will decrease and the algorithm will gradually come to an end. Here’s the complete algorithm:

def longestWord(words):
    #Add words to the trie, and pairs to the queue
    trie=Trie()
    queue=collections.deque()
    for word in words:
        prefixes=trie.getAllPrefixesOfWord(word)
        for prefix in prefixes:
            queue.append( (word, word[len(prefix):]) )
        trie.insert(word)
 
    #Process the queue
    longestWord=''
    maxLength=0
    while queue:
        word, suffix = queue.popleft()
        if suffix in trie and len(word)&gt;maxLength:
            longestWord=word
            maxLength=len(word)
        else:
            prefixes=trie.getAllPrefixesOfWord(suffix)
            for prefix in prefixes:
                queue.append( (word, suffix[len(prefix):]) )
 
    return longestWord

The complexity of this algorithm is O(kN) where N is the number of words in the input list, and k the maximum number of words in a compound word. The number k may vary from one list to another, but it’ll generally be a constant number like 5 or 10. So, the algorithm is linear in number of words in the list, which is an optimal solution to the problem.

And here’s the impementation of the trie for completeness:

class Trie:
    def __init__(self):
        self.root=Node('')
 
    def __repr__(self):
        self.output([self.root])
        return ''
 
    def output(self, currentPath, indent=''):
        #Depth First Search
        currentNode=currentPath[-1]
        if currentNode.isTerminal:
            word=''.join([node.letter for node in currentPath])
            print indent+word
            indent+='  '
        for letter, node in sorted(currentNode.children.items()):
            self.output(currentPath[:]+[node], indent)
 
    def insert(self, word):
        current=self.root
        for letter in word:
            if letter not in current.children:
                current.children[letter]=Node(letter)
            current=current.children[letter]
        current.isTerminal=True
 
    def __contains__(self, word):
        current=self.root
        for letter in word:
            if letter not in current.children:
                return False
            current=current.children[letter]
        return current.isTerminal
 
    def getAllPrefixesOfWord(self, word):
        prefix=''
        prefixes=[]
        current=self.root
        for letter in word:
            if letter not in current.children:
                return prefixes
            current=current.children[letter]
            prefix+=letter
            if current.isTerminal:
                prefixes.append(prefix)
        return prefixes

VN:F [1.9.22_1171]
Rating: 7.9/10 (43 votes cast)
Programming Interview Questions 28: Longest Compound Word, 7.9 out of 10 based on 43 ratings
This entry was posted in Programming Interview. Bookmark the permalink.
  • Ilker Acar

    Arden, siteni begeniyle takip ediyorum. 28 soruda mukemmel! Yeni isin hayirli olsun bu arada. Ben Building 34’teyim. Birseye ihtiyacin olursa pingleyebilirsin.

  • Elite

    Great question and explanation, thank you. Small typo in sentence: “Second word is cat” should be “cats”.

    • Arden

      Thanks for noticing, I updated the post.

  • Dejun Zhou

    Arden, I think you don’t have to build the queue while building the trie. You can first build the trie. then try to find the longest compound word back from the longest words to second longest words and so on.

  • anki

    All the best with your career,thoroughly enjoyed going through your solutions,would be great if you would find time to continue with more

  • http://skyline skyline

    I am thinking a recursive function that check a word is a compound word based on input list.

    bool is_compound(word w, list[] words) {
    if (words.contain(w)) return true;
    for (int i=0; i<w.len-1) {
    part1 = w.substring(:i);
    part2 = w.substring(i:);
    if (is_compound(part1, words) && is_compound(part2, words))
    return true;
    }

    return false;
    }

    then we just need to check whether every word is a compound word and get maxium length. I feel the recursive approach’s logic is more straightforward.

    • Ajay

      Could you please share complete code.

    • YShen

      In this case, ‘dog’ ‘cat’ will also be considered as compound word. Is there any way to only consider word that has two or more components?

  • PhoenixNirvana

    Thanks for all your interview articles, it really helps. Wish you good luck!

  • nisha

    Thanks for the wonderful posts on Interview preparation. You are Brilliant :)
    Hope you have a wonderful and bright career

  • khalkho

    awesmm posts..
    good luck for MS

  • sanjay

    /********** C++ implementation is given below************/

    struct Trie
    {
    char data;
    struct Trie *child[26];
    bool issuffix;
    char *str;
    };
    struct Node
    {
    int count;
    struct Trie *trie;
    };
    struct QueueData
    {
    char *str;
    char suffix[15];
    };
    struct Trie * NewNode(char data)
    {
    struct Trie *temp=(struct Trie *)malloc(sizeof(struct Trie));
    temp->data=data;
    for(int i=0;ichild[i]=NULL;
    temp->issuffix=false;
    temp->str=NULL;
    return temp;
    }
    void Insert(struct Node &node,char *str,queue &myqueue)
    {
    int len=strlen(str);
    struct Trie *trie;
    bool flag=false;
    node.count++;
    trie=node.trie;
    for(int i=0;ichild[str[i]-97]==NULL)
    {
    trie->child[str[i]-97]=NewNode(str[i]);
    trie=trie->child[str[i]-97];
    }
    else
    {
    trie=trie->child[str[i]-97];
    if(trie->issuffix)
    {
    struct QueueData data;
    data.str=(char *)malloc(sizeof(char)*strlen(str)+1);
    memset(data.str,”,sizeof(char)*strlen(str)+1);
    strcpy(data.str,str);
    //data.suffix=(char *)malloc(sizeof(char)*(i+1));
    memset(data.suffix,”,sizeof(data.suffix));
    strncpy(data.suffix,str+i+1,strlen(str)-i+1);
    myqueue.push(data);
    }
    }
    }
    trie->str=(char *)malloc(sizeof(char)*strlen(str)+1);
    strcpy(trie->str,str);
    trie->issuffix=true;
    }
    bool search(struct Node node,char *str)
    {
    struct Trie *trie=node.trie;
    int len=strlen(str);
    for(int i=0;ichild[str[i]-97]==NULL)
    return false;
    else
    trie=trie->child[str[i]-97];
    }
    if(trie->str)
    {
    if(strcmp(str,trie->str)==0)
    return true;
    }
    return false;
    }
    bool isHassuffix(struct Node node,char *origstr,char *str,queue &myqueue)
    {
    if(str==NULL)
    return false;
    struct Trie *trie=node.trie;
    bool flag=false;
    int len=strlen(str);
    for(int i=0;iissuffix)
    {
    struct QueueData temp;
    temp.str=origstr;
    //temp.suffix=(char *)malloc(sizeof(char)*(i+1));
    strncpy(temp.suffix,str+i,strlen(str)-i+1);
    myqueue.push(temp);
    flag=true;
    }
    else
    {
    if(trie->child[str[i]-97])
    trie=trie->child[str[i]-97];
    else
    return false;
    }
    }
    return flag;
    }
    void Initialize(struct Node &node)
    {
    node.count=0;
    node.trie=NewNode(”);
    }
    int main()
    {
    char keys[][20] = {“cat”, “cats”,”catsdogcats”,”catxdogcatsrat”,”dog”,”dogcatsdog”,”hippopotamuses”,”rat”,”ratcat”,”ratcatdog”,”ratcatdogcat”};
    struct Node node;
    int max=-999;
    char *ans_str;
    queue myqueue;
    Initialize(node);
    for(int i=0; i<8; i++)
    Insert(node,keys[i],myqueue);
    bool ret=search(node,"there");
    while(!myqueue.empty())
    {
    QueueData temp=myqueue.front();
    myqueue.pop();
    if(search(node,temp.suffix))
    {
    int l_len=strlen(temp.str);
    if(max<l_len)
    {
    ans_str=temp.str;
    max=l_len;
    }
    }
    else if(isHassuffix(node,temp.str,temp.suffix,myqueue))
    continue;
    }
    printf("Longest Compound word is %s",ans_str);
    getchar();
    return 0;
    }
    Note: There might be chances that code gets execute in null pointers.

  • İbrahim Aygül

    If we change the order of the input list, this code fails to find the longest compound. Put ratcatdogcat to the first index in the input. queue will not have any prefix value for this word to be a suffix later. so while processing the queue that word will not be considered.
    I beleive, input order should not be important and ratcatdogcat should be considered as a compound word even if its components are coming later. Am I right ?

    • William

      Note to anyone reading this: Ibrahim’s last optimization, “returning only the length of the longest valid prefix, instead of a list of lengths,” will not work.

      Just take the case ‘catxdogcatrat’. If you have both prefixes ‘catx’ and ‘cat’, this code will only check for ‘catx’ and it will think that catxdogcatrat is an invalid compound word. However, if ‘xdogcatrat’ ends up being a valid suffix, you should have checked for ‘cat’ as a prefix as well and found that ‘catxdogcatrat’ is a valid compound word.

  • Zhou Fangfei

    I think there is a bug… when you build prefix tree, the first scanned word doesn’t have any prefixes, but this word may be the correct answer. In your example, you skip ‘cat’ because it was the first word scanned and the prefix tree was empty at that moment, so you skipped it; what if the first scanned word was a long word? you may miss the answer

    • Dilip

      I think you are correct. If the first word is “catsdogxxxxxxxxx”, followed by “dog”, “xx”,”xxx”,”xxxxx”. The code will not return the correct result

      • su

        Nope, because catsdogxxxxxxxxxx is not a valid word, since we don’t have the word ‘cats’ to build ‘catsdogxxxxxxxxxx’. So the solution is still correct.

    • Yogo

      I think the list should be processed in order of string length to fix this, since a prefix will always be shorter than any words that contain it. This would add an n*logn sorting step though.

      • Yogo

        Oh, just realised, the problem states that the input list is sorted, his solution should work fine as is.

  • Sorrowfull Blinger

    Thanks for the detailed expanantion !! . One question instead of storing the (current word ,Suffix) pair would’nt it be better if we store (array_index_of_the_word,suffix) so that we can easily identify the word which is the match?

  • Nikhil

    Can anyone give c# implementation of this

  • Chris B

    The complexity of this solution is not O(kN), it is exponential. Consider the input [‘a’, ‘aa’, ‘aaa’, ‘aaaa’, …]. When walking the trie (a-a-a-a..), every a node will match as a prefix, so all proper prefixes of every node will be added to the queue. And then the same will happen to those prefixes, and the prefixes of those prefixes, and so on. I modified your code to return the maximum length of the queue as well as the longest word then iterated over inputs of repeating ‘a’:

    >>> for x in [longestWord([‘a’*i for i in range(j)]) for j in range(25)]: print x

    (1, ‘aa’)
    (3, ‘aaa’)
    (6, ‘aaaa’)
    (10, ‘aaaaa’)
    (20, ‘aaaaaa’)
    (38, ‘aaaaaaa’)
    (73, ‘aaaaaaaa’)
    (134, ‘aaaaaaaaa’)
    (260, ‘aaaaaaaaaa’)
    (484, ‘aaaaaaaaaaa’)
    (946, ‘aaaaaaaaaaaa’)
    (1780, ‘aaaaaaaaaaaaa’)
    (3496, ‘aaaaaaaaaaaaaa’)
    (6631, ‘aaaaaaaaaaaaaaa’)
    (13066, ‘aaaaaaaaaaaaaaaa’)
    (24935, ‘aaaaaaaaaaaaaaaaa’)
    (49245, ‘aaaaaaaaaaaaaaaaaa’)
    (94433, ‘aaaaaaaaaaaaaaaaaaa’)
    (186811, ‘aaaaaaaaaaaaaaaaaaaa’)
    (359633, ‘aaaaaaaaaaaaaaaaaaaaa’)
    (712349, ‘aaaaaaaaaaaaaaaaaaaaaa’)
    (1375791, ‘aaaaaaaaaaaaaaaaaaaaaaa’)

    The number on the left is the maximum queue length, as you can see it does not grow linearly.