Studying Open Source game engines is a great learning tool for any game developer, specially for game programmers. Luckily we have a few good ones to look into, even though it is only a tiny fraction of the number of games out there that go Open Source. One of my favorites is the DOOM 3 Engine, AKA id Tech 4. Even if somewhat old by now, it was a very solid and well written game engine and it still offers a lot of knowledge about game programming.
Whenever I start looking into a new game engine, I like to begin with the core and support systems. They usually set the tone for the rest of the codebase and tend to be the most modularized parts that are the easiest ones to grasp for an outsider. Id Tech 4 has a very nice pair of support libraries called the idlib and the framework. I suggest forking the GitHub repo and checking out for yourself. The code snippets I’ll show here are taken from the DOOM 3 BFG source-code repository.
idlib is full of interesting stuff, but one little container class caught my eye
the first time I looked into the engine:
idHashIndex. This class caries a header
comment that reads:
Fast hash table for indexes and arrays.
Does not allocate memory until the first key/index pair is added.
When I tried to look at the implementation for the first time to find out what made it a “fast hash table”, (or rather, what made it faster than a regular hash table?), I was a bit thrown off by code involving bitwise operations and multiple underlaying arrays. So I just left it be and moved on to look at other stuff. This hash-table-like class is used in many places by the game and engine code. They pretty much use it anywhere a key+value store is required, so it must be a pretty efficient container.
Anyways, the other day I was watching this interesting CppCon talk by Chandler Carruth
where he ranted a couple times about how much the
suck for not being designed in a cache efficient way (to sum up,
map is a binary
unordered_map is a linked hash table, so both rely on sparsely allocated
nodes). After watching the talk on YT I immediately remembered about the mysterious
idHashIndex and was prompted to do some digging into that class to find out what
really makes it tick and check to see if it would beat the C++ equivalents
unordered_map) this many years after it was first written.
Figuring out idHashIndex
idHashIndex is not your usual hash table in the sense that it stores pairs of keys and values.
The name is fitting, as it actually stores integer numbers (indexes) that are themselves indexed by a hash key.
So the idea it that you will always use
idHashIndex combined with an array that stores the
actual values (the value store), while the hash-index will provide a quick way of mapping an
arbitrary key into a value in that array.
We can already begin to see how it might be more efficient. The “hot” lookup data is kept separate from the values, so you waste no CPU cache with unneeded data related to the values when looking up by key. When we are looking up a key+value store we only care about the keys, so it is a design mistake to marry both the keys and the values under the same structure. This is one of the examples Mike Acton uses in his Data-Oriented Design and C++ talk.
So the overall usage pattern on
idHashIndex is this:
- Values are kept in a separate array (usually a dynamic array, like
- The hash-index provides a way of mapping keys (like a
std::string) to a corresponding index into the value-store array/vector.
For instance, this is how they use a hash-index to lookup models by name in the engine:
In the above example, pointers to models are stored in a
models vector-like container,
while the hash-index provides lookup by the filename.
hash is a
GenerateKey() just computes a hash value for the input string. It has no business being
a member of the
idHashIndex class really, but for some reason they decided to go strict
OOPy there and made the hash function a public member of the class.
Now let’s take a look at the hash-index interface. I stripped all the comments and some irrelevant methods and constructors for brevity. You can find the original at the source repo.
The most interesting methods to note are
Next(). We can also see that they are
used in the lookup example above. They are necessary because the internal hash-index tables have
a finite size, so eventually a hash key produced by
GenerateKey() will collide with an already
in-use entry. For those cases, a short search must be performed until either the wanted item is found
or the end of the buckets chain is reached (-1). This is similar to the idea of Open Addressing.
In the average case, when the load factor of the hash-index is low, the lookup loop will only perform
a single iteration, finding the wanted element in the first try. Thus we have constant-time lookup, like
in any hash table implementation. The performance gain should come from the worst case when the loop has
to iterate several times. Thanks to the way
idHashIndex is implemented, subsequent iterations will
access neighboring elements in the array, which are already likely to be in the CPU cache due to hardware
prefetching. This is the departure from
std::unordered_map, for instance, which is implemented in terms
of a liked list. The first lookup it constant time, but if there’s a key collision, then a linked-list
iteration is required. Unless the user provides a custom pool allocator, the list nodes will be sourced
operator new, so
unordered_map will potentially touch memory that is scattered through the RAM,
idHashIndex will only visit neighboring items in the user-defined value-store array.
Let’s now take a closer look at how
idHashIndex is implemented and how they made some
clever little optimizations to make lookup as efficient as possible:
hashSize are a pointer to an array of integers and the array size, respectively.
hash is indexed by the incoming hash key in methods like
Add/Remove. It either
holds the index of the requested data in the user-defined value-store if the keys never collided,
or an index into
indexChain. The hash-index has no way of knowing if the value of
and index into
indexChain or the user array, that’s why you need to perform the test yourself
in the lookup loop. If the data it returned was the corresponding data for the key, then it was
a user index, otherwise, it points to
indexChain and you should call
indexSize are also a pointer to an array of integers and the array size.
indexChain is either empty (full of -1s) or it has the data for key-collision resolution.
hash points to
indexChain, the given entry will be
!= -1. If you test
with your array and find the item you wanted, then it was the index of the data, otherwise,
the resulting value is another index into
indexChain. Keep looking up with
you either find the data or until -1 is encountered.
granularity is just the resizing factor for
hash array size will
remain unchanged after the table is constructed, but for the above collision resolution to
indexChain size must match the size of the external value-store, so whenever a new
index is added to the hash-index, a check is performed to see if the new index will fit the
index chain, if not,
indexChain is resized to match, with some extra taken from
to amortize future growths of the array.
Now comes a little bit of old school black magic… This is what
Next() look like:
INVALID_INDEX are basically an effort to make those two methods
as fast as possible. They effectively allow implementing lookup without using any
if tests, which
back then was a bigger deal than it is today. Nowadays you don’t pay much for a jump instruction,
but nonetheless, they are actually very elegant optimizations once you get the gist of it.
INVALID_INDEX is a static array with one element that is set to -1, the “invalid index” sentinel
value used by the hash-index. When the hash-index is empty and no heap memory is assigned to it, both
indexChain will point to
INVALID_INDEX, having a single element array with a -1 in it to return the
expected value if a lookup is performed in the empty table. The
INVALID_INDEX is never changed by the code but
it is not declared as
const because it has to be assigned to the non-const pointers of
hashMask will always be the same as
hashSize is enforced to be always a power-of-two,
which means the expensive integer modulo operation (
%) can be replaced with a cheap AND (
&) with the
length of the table -1 when indexing it. So the cached
hashMask also avoids and extra subtraction when
hash array from an incoming hash key.
lookupMask was a bit mysterious at first. I didn’t really get the point of it until I actually ran
the code in a few tests.
lookupMask will always be set to either 0 or -1. It is of type
this is important. It’s sole objective it to avoid the need for an extra
if test for the empty hash-index
case. When empty,
lookupMask is 0, so the AND op we see above will always yield zero, which is the only
index available in
INVALID_INDEX and equals to -1, the correct value to return for the empty case.
When not empty,
lookupMask is -1. If we remember how signed integers are represented in two’s complement, we can see why.
-1 is represented as all 1s in binary, or
0xFFFFFFFF... in hexadecimal. ANDing a bit with 1 will yield back
the input, so when the table is not empty, the AND with
lookupMask is just a no-op. This smart little
hack makes the code branch-less by trading a compare and jump for a much cheaper AND instruction.
Okay, now that we have a pretty good idea of how
idHashIndex works and why it might be faster than
the Standard C++ equivalents, it’s time to write up a few benchmarks and stack up
I made some changes to
idHashIndex so that it would compile as a standalone class and wrote
a small test program to stress insertions, erasure and lookup. The hash-index is paired with a
std::vector in my test code, while
std::map/unordered_map are used as standalone containers.
The benchmark code was built and tested on Clang with
-03. These are the results:
testing insertions on std::map 4096 iterations average time taken...: 372 ns lowest time sample...: 145 ns largest time sample..: 9461 ns ---------------------------------- testing insertions on std::unordered_map 4096 iterations average time taken...: 212 ns lowest time sample...: 133 ns largest time sample..: 80079 ns ---------------------------------- testing insertions on hash_index + std::vector 4096 iterations average time taken...: 82 ns lowest time sample...: 23 ns largest time sample..: 53708 ns
testing erasures on std::map 4096 iterations average time taken...: 308 ns lowest time sample...: 166 ns largest time sample..: 9216 ns ---------------------------------- testing erasures on std::unordered_map 4096 iterations average time taken...: 177 ns lowest time sample...: 157 ns largest time sample..: 281 ns ---------------------------------- testing erasures on hash_index + std::vector 4096 iterations average time taken...: 45 ns lowest time sample...: 22 ns largest time sample..: 80 ns
testing lookup on std::map 4096 iterations average time taken...: 220 ns lowest time sample...: 123 ns largest time sample..: 9021 ns ---------------------------------- testing lookup on std::unordered_map 4096 iterations average time taken...: 488 ns lowest time sample...: 60 ns largest time sample..: 16152 ns ---------------------------------- testing lookup on hash_index + std::vector 4096 iterations average time taken...: 76 ns lowest time sample...: 43 ns largest time sample..: 401 ns
std::map as expected lost on all three operations. It only won at the worst case insertion time.
Map is implemented as a binary search tree (usually a Red-Black balanced tree), so
it allocates sparse node structures for each new key-value pair in the tree. This will have terrible
data locality. My guess is that it only won in the worst case insertion because it always allocates
a new node for each insertion, so the cost is pretty much distributed evenly, while
probably only allocated memory at construction, but it might grow the table if needed, thus the
huge spike for the worst time, which is probably resulting from a realloc and copy of the whole table.
idHashIndex) also measures the time to
push_back() in an accompanying
that acts as our backing store of values.
hash_index proved to be significantly faster than
std::unordered_map and much
std::map for the common operations of insertion, removal and lookup. I am definitely
curious to use it again in a more practical scenario and try to compare it again with its closest
Standard sibling, the
idHashIndex seems like a pretty handy tool to have, after looking at the tests above,
so I went ahead and extracted the code into a standalone single-file template class that
I can now use for my own projects! If you’d like to use it as well, you’ll find it together
with the test and benchmark code in my GitHub repository. I have kept
the GPL license though, to comply with the license used by the DOOM 3 project. So that might
limit its uses a little, I’m afraid. If that is not a problem for you, then please, make
it useful and send feedback!
Three very relevant presentations, available on YouTube, I mention above that are worth watching for anyone interested in C++ and optimizations: