How to efficiently compare large lists in Python?

2017-01-27

An answer to this question on Stack Overflow.

Question

I am trying to find 9 letter words, that when you split evenly into 3 parts, and jumble around, you get another nine letter word.

for i in nineWordList:
	for j in nineWordList:
		if (i[3:5] + i[0:2] + i[6:8]) == j:
			correctWords.append(i)
		elif (i[3:5] + i[6:8] + i[0:2]) == j:
			correctWords.append(i)
		elif (i[0:2] + i[6:8] + i[3:5]) == j:
			correctWords.append(i)
		elif (i[6:8] + i[0:2] + i[3:5]) == j:
			correctWords.append(i)
		elif (i[6:8] + i[3:5] + i[0:2]) == j:
			correctWords.append(i)

This is how I do it. The only problem is nineWordList is 68,000 elements long, and this takes ages. How can I improve this, to make it more efficient?

Answer

Put all your valid words in a Python set and then loop through the set rearranging the words in the manner you've described. For each rearrangement, check to see if it is in the set.

Since Python's set is based on a hash table the look-ups occur in O(1) (constant) time. For a constant number of rearrangements per word, your algorithm then works in O(n) time, which is far better than the O(n^2) algorithm you have now.

The revised code looks like this:

nineWordSet = set(nineWordList)
for i in nineWordSet:
  if i[3:5] + i[0:2] + i[6:8] in nineWordSet:
    correctWords.append(i)
  elif i[3:5] + i[6:8] + i[0:2] in nineWordSet:
    correctWords.append(i)
  elif i[0:2] + i[6:8] + i[3:5] in nineWordSet:
    correctWords.append(i)
  elif i[6:8] + i[0:2] + i[3:5] in nineWordSet:
    correctWords.append(i)
  elif i[6:8] + i[3:5] + i[0:2] in nineWordSet:
    correctWords.append(i)

Your previous code was slow because for each word you had to look at all the other words (technically, half of them, on average). That's about 2,312,000,000 words you have to look at; that's what's meant by O(n^2). In the new code for each word you only have to look in one well-defined place, so you only look at 68,000 words. That's the benefit of hash tables, which can often give you O(n) performance on a dataset.