Skip to content

The n-gram that is the most frequent one among all the words

An answer to this question on Stack Overflow.

Question

I came across the following programming interview problem:

Challenge 1: N-grams

An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and an n-gram length Your task is to

 write a function that finds the n-gram that is the most frequent one among all the words
 print the result to the standard output (stdout)
 if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

Note that your function will receive the following arguments:

 text
	 which is a string containing words separated by whitespaces
 ngramLength
	 which is an integer value giving the length of the n-gram

Data constraints

 the length of the text string will not exceed 250,000 characters
 all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

Efficiency constraints

 your function is expected to print the result in less than 2 seconds

Example Input text: “aaaab a0a baaab c”

Output aaa ngramLength: 3

Explanation

For the input presented above the 3-grams sorted by frequency are:

 "aaa" with a frequency of 3
 "aab" with a frequency of 2
 "a0a" with a frequency of 1
 "baa" with a frequency of 1

If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...

If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....

Is that the fastest option possible?

Thank you!!!

Answer

You can solve this problem in O(nk) time where n is the number of words and k is the average number of n-grams per word.

You're correct in thinking that a hash table is a good solution to the problem.

However, since you have limited time to code a solution, I'd suggest using open addressing instead of a linked list. The implementation may be simpler: if you reach a collision you just walk farther along the list.

Also, be sure to allocate enough memory to your hash table: something about twice the size of the expected number of n-grams should be fine. Since the expected number of n-grams is <=250,000 a hash table of 500,000 should be more than sufficient.

In terms of coding speed, the small input length (250,000) makes sorting and counting a feasible option. The quickest way is probably to generate an array of pointers to each n-gram, sort the array using an appropriate comparator, and then walk along it keeping track of the which n-gram appeared the most.