Improve performance: Remove all strings in a (big) list appearing only once

2014-01-13

algorithm list performance python stackexchange stackoverflow string

An answer to this question on Stack Overflow.

Question

I have a big list (25000 items, 14000 words) like this:

before (please have a look below for the right list):

texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', ... 'hello Lorem ipsum']

I'd like to remove all words appearing only once in the whole list.

after:

texts = ['Lorem generator ipsum', 'Lorem ipsum generator machine', ..., 'Machine Lorem ipsum']

I'm doing this already now, but it's really slow (about 2 hours).

all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

How can I improve the performance?

EDIT:

@DSM was right, my input list looks like this. My fault, sorry.:

texts = [['Lorem', 'hello', 'ipsum'], ['Lorem', 'ipsum', 'generator', 'machine'], ... ['hello, 'Lorem', 'ipsum']]

Answer

Both your code and the counter have a lot of hidden machinery: I recommend coming up with a good algorithm and then thinking about the Python a bit.

Let's take a look at your code:

tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

Here we are making a set of all_tokens. If Python's set uses a hash table then you can build the set in O(N) time. If Python's set uses a binary tree of some sort, then you can build the set in O(N log N) time.

Okay, so you have N' unique elements. You now count how many times each element appears in all_tokens. That means you're doing O(N*N') work, or O(N^2) for short.

That's bad.

How can we do better?

One way would be to sort all_tokens.

Now we know that duplicate elements must be adjacent to each other in the sorted list. Therefore, we just have to walk along the list and see if an element is the same as the element preceding it: if so, we have a duplicate. Sorting takes O(N log N) time and finding the duplicates afterwards takes O(N) time. Copying the duplicated elements into a new vector takes O(1) amortized time, so the algorithm operates in O(N log N) time, which is quite good.

lorem="lorem ipsum dolor sit amet consetetur sadipscing elitr sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat sed diam voluptua at vero eos et accusam et justo duo dolores et ea rebum stet clita kasd gubergren no sea takimata sanctus est lorem ipsum dolor sit amet"

The algorithm for that looks like this:

lorem="lorem ipsum dolor sit amet consetetur sadipscing elitr sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat sed diam voluptua at vero eos et accusam et justo duo dolores et ea rebum stet clita kasd gubergren no sea takimata sanctus est lorem ipsum dolor sit amet"
all_tokens=lorem.split()
all_tokens.sort()
duplicate_count=0
more_than_once=[]
for i in range(1,len(all_tokens)):
  if all_tokens[i-1]==all_tokens[i]:
    duplicate_count+=1
    if duplicate_count==1:
      more_than_once.append(all_tokens[i])
  else:
    duplicate_count=0

Can we do better? Yes!

Imagine that we have a hash table that allows us to insert/retrieve an element in O(1) time. We can walk down a list of items in O(N) time and increment their entry in the hash table every time we see the item. We can then retrieve the list of duplicated items in O(N) time.

What does that look like?

hash_table={}
for i in all_tokens:
  if hash_table.has_key(i):
    hash_table[i]+=1
  else:
    hash_table[i]=1
more_than_once=[i for i in hash_table if hash_table[i]>=2]

In the case that Python's dictionary is not implemented as a hash table it will be implemented as a kind of binary search tree and this code falls back to being an O(N log N) algorithm.

Of course there are all kinds of fancy containers and libraries that Python could use to abstract this process, but, on a deep level, those will all be doing something a lot like what I've just described.

Personally, I prefer to find a good algorithm for solving a problem before looking for the language-specific features that will speed up an operation.