Implementing simple string interning

2016-05-18

An answer to this question on Stack Overflow.

Question

Problem

I have a huge collection of strings that are duplicated among some objects. What is need is string interning. These objects are serialized and deserialized with protobuf-net. I know it should handle .NET string intering, but my tests have shown that taking all those strings myself and creating a Dictionary<string, int> (mapping between a value and its unique identifier), replacing original string values by ints, gives better results.

The problem, though, is in the mapping. It is only one-way searchable (I mean O(1)-searchable). But I would like to search by key or by value in O(1). Not just by key.

Approach

The set of strings is fixed. This sounds like an array. Search by value is O(1), blinding fast. Not even amortized as in the dictionary - just constant, by the index.

The problem with an array is searching by keys. This sounds like hashes. But hey, n hashes aren't said to be evenly distributed among exactly n cells of the n-element array. Using modulo, this will likely lead to collisions. That's bad.

I could create, let's say, an n * 1.1-length array, and try random hashing functions until I get no collisions but... that... just... feels... wrong.

Question

How can I solve the problem and achieve O(1) lookup time both by keys (strings) and values (integers)?

Two dictionaries is not an option ;)

Answer

Can you use an array to store the strings and a hash table to relate the strings back to their indices in the array?

Your n*1.1 length array idea might be improved on by some reading on perfect hashing and dynamic perfect hashing. Wikipedia has a nice article about the latter here. Unfortunately, all of these solutions seem to involve hash tables which contain hash tables. This may break your requirement that only one hash table be used, but perhaps the way in which the hash tables are used is different here.