Why C++ Unordered Maps (Hash tables) Are Not Great For Strings

Random Woman Image With Enlarged Breasts
A random picture of a woman
for no good reason.
Hash maps, hash tables, unordered maps: what ever you call them - they are really popular these days, but in C++ they are not always a great idea.

C++ containers were not built with hashing in mind. Unordered map came in with TR1 which as an update to the first version of the standard library. It should be no shock therefore that the standard library containers were not designed with hashing in mind. This leads to all sorts of problems. I will discuss a few of those and some more general issues with the approach.

Problem 1 - lookup performance:
Containers to not have a hash code property. This means that a hashing function must run over the contents of the container and create a hash code each and every time a lookup is to be performed. This can be very expensive. We can compare this to the comparisons of the ordered implementation which only require traversing over however many elements are needed to give ordering information. For both implementations, a full comparison may be done at the end.

From this mind experiment it should be clear that unless the map is large (many nodes traversed in the ordered version) or the keys short the cost of the initial hash computation can outweigh any performance advantage given by the unordered implementation.

The effect of the hash overhead can be seen in this paper which shows lookup times for long strings being slower for unordered_maps up to 100 elements in size.

The classic solution to this problem is to pre-compute the hash function for all objects which we are going to maybe use in unordered_maps. This makes lookup of unordered_maps very much faster and in my Java experience means that we require maps to be tiny (4 or less elements) before other the effects like cache locality make the ordered maps faster.

Problem 2 - cost of precomputing:
As discussed above, precomputing hash functions can overcome the lookup cost. However, it has its own problems.

A) We need to wrap or sub-class standard containers to have the hash function computation and lookup members.
B) We could incur the cost of hash generation when we don't need it.

Issue B is more significant than we might expect. Let us consider vectors of integers. This means we need to iterate over the contents of the entire vector to create a new hash value every time any value in the vector changes. We could say 'we will only generate the hash when we need it'. That is a good idea but it only saves us if we re-use the looked up value. The end result is that we need to be very careful not to let our expensive to create or mutate objects leak into places where we don't need the unordered map functionality otherwise we could compromise the entire application's performance for the benefit of improving just a few lookups. This is why poor choices like the early Java hash for strings, which only included the first 7 characters, came about.

Issue A is less of a pain but it is a pain never the less. Any specialisation away from the standard containers risks losing the subtle optimisation benefits worked into modern standard library implementations. It also risks bugs working their way into the code as team members turn over and each new person needs to learn all the specialisations.

Problems 3:
Large regeneration pauses on insert. This nice thing about a tree based map (or set) is that the cost of insertion is relatively stable and relatively fast. OK, the cost of memory allocation is an issue but we will assume we are not trying for real-time here so we can also assume we have an OK allocator which will work well. From a concurrency stand point we can just use a lock-check-lock-check approach for insertion:

1) Read lock (concurrent lock)
2) Does my set contain this key: Yes continue :No goto No
3) Get value
4) Release read lock
5) Return
6) No:
7) Write lock- singular lock
8) Does my set contain the key: Yes - continue: :No goto Write
9) Release all locks
10) Return
11) Write: 
12) Insert the value
13) Release all locks
14) Return

Or some variation of the above. This works nicely because even though we lock all access to the map when we insert, the insertion is fast and has a dependable time. 

The same approach for unordered_map has a big problem. As the map fills up the number of buckets in the lookup becomes insufficient and lookup starts to move from zero ordered to linear. In other words, an over full hash based map scales no better than linear searching a vector!

To overcome the overfill issue when a particular threshold is reached the map must be regenerated with a larger number of buckets. This regeneration is complex and takes a significant amount of time. In the concurrent situation we can end up with very bad thread starvation as all the threads pile up trying to read from a map which is locked by a single writer.

If you think this post is a carefully constructed, maths based argument in favour of tree based sets and maps you have gotten it all wrong. It is much more a casual discussion on how the common assumption that unordered_map and unordered_set dominate the ordered equivalents in performance is just wrong. It is further an explanation why benchmarking has lead me to use set and map by default and treat unordered_set and unordered_map as post-hock optimisations.

My rule of thumb: use tree based ordered containers and only move over to the unordered version when benchmarks and careful analysis indicates there will be a real world benefit from doing so.