find the longest word made of other words

2015-10-07

An answer to this question on Stack Overflow.

Question

I am working on a problem, which is to write a program to find the longest word made of other words in a list of words.

EXAMPLE
Input: test, tester, testertest, testing, testingtester
Output: testingtester

I searched and find the following solution, my question is I am confused in step 2, why we should break each word in all possible ways? Why not use each word directly as a whole? If anyone could give some insights, it will be great.

The solution below does the following:

Sort the array by size, putting the longest word at the front
For each word, split it in all possible ways. That is, for “test”, split it into {“t”, “est”}, {“te”, “st”} and {“tes”, “t”}.
Then, for each pairing, check if the first half and the second both exist elsewhere in the array.
“Short circuit” by returning the first string we find that fits condition #3.

Answer

Answering your question indirectly, I believe the following is an efficient way to solve this problem using tries.

Build a trie from all of the words in your string.

Sort the words so that the longest word comes first.

Now, for each word W, start at the top of the trie and begin following the word down the tree one letter at a time using letters from the word you are testing.

Each time a word ends, recursively re-enter the trie from the top making a note that you have "branched". If you run out of letters at the end of the word and have branched, you've found a compound word and, because the words were sorted, this is the longest compound word.

If the letters stop matching at any point, or you run out and are not at the end of the word, just back track to wherever it was that you branched and keep plugging along.

I'm afraid I don't know Java that well, so I'm unable to provide you sample code in that language. I have, however, written out a solution in Python (using a trie implementation from this answer). Hopefully it is clear to you:

#!/usr/bin/env python3
#End of word symbol
_end = '_end_'
#Make a trie out of nested HashMap, UnorderedMap, dict structures
def MakeTrie(words):
  root = dict()
  for word in words:
    current_dict = root
    for letter in word:
      current_dict = current_dict.setdefault(letter, {})
    current_dict[_end] = _end
  return root
def LongestCompoundWord(original_trie, trie, word, level=0):
  first_letter = word[0]
  if not first_letter in trie:
    return False
  if len(word)==1 and _end in trie[first_letter]:
    return level>0
  if _end in trie[first_letter] and LongestCompoundWord(original_trie, original_trie, word[1:], level+1):
    return True
  return LongestCompoundWord(original_trie, trie[first_letter], word[1:], level)
#Words that were in your question
words = ['test','testing','tester','teste', 'testingtester', 'testingtestm', 'testtest','testingtest']
trie = MakeTrie(words)
#Sort words in order of decreasing length
words = sorted(words, key=lambda x: len(x), reverse=True)
for word in words:
  if LongestCompoundWord(trie,trie,word):
    print("Longest compound word was '{0:}'".format(word))
    break

With the above in mind, the answer to your original question becomes clearer: we do not know ahead of time which combination of prefix words will take us successfully through the tree. Therefore, we need to be prepared to check all possible combinations of prefix words.

Since the algorithm you found does not have an efficient way of knowing what subsets of a word are prefixes, it splits the word at all possible points in word to ensure that all prefixes are generated.