String Hashes

This long overdue post takes a brief look at string hashes.

I was trolling through blogs ( in the passing ship sense, not the flame-baiting, nor monster under the bridge sense ) and came across John Gior's post on FNV in games. Having never heard of FNV, I thought I'd take a look. Recognizing that covering just FNV is likely to be pretty boring, I thought I might take a look at string hashes in general.

In case you haven't used them before, or you don't think their useful enough to bother using them in your game, I'd like to introduce to the string hash:

String hash, meet the reader, reader meet the string hash.

String hashes can shrink the size of your data, and speed the execution of your application. Even as PCs and consoles get ever more RAM, and ever faster processors, string hashes can still have a place.

Hashing: A Quick Introduction

Hashing is typically used to store objects into an array in such a way that the objects can be retrieved quickly later on. A normal hashing algorithm goes something like this:

An object is reduced to a number ( the hash ):

   unsigned long ComputeHash(Object * object);
   ComputeHash (object1)= 0xfade;

That number is used to generate an index into an array of buckets ( the hash table ) in some cases perhaps simply by the hash modulus (%) the number of buckets:

   int CalcIndex(int hash, int numBuckets);
   CalcIndex (0xfade)= 0;

The object then gets stored in the bucket at the index calculated from its hash:

   array[0].Add( Object )

Since there are a fixed number of buckets but a large number of potential hashes, the buckets are usually implemented as linked lists. That way when (not if) two objects fall into the same bucket, they still can be retrieved as needed.

   array[0] => { (0xfade,Object1), (0x1234d,Object2) }

You might use a hash table in an RPG. For instance, you might want to keep a big list of all the usable items in memory. Rather than forcing your designers to give each one a compact, unique index, you might instead use the name of the item definition to generate a hash. Any time you load a new item from disk you can hash its name, and store the item at the calculated position in the hash table. Later, when someone requests the definition by name, you can hash the requested name, peek into the hash table, and retrieve the desired definition.

Fast, quick, flexible lookups.

At some point, however, someone realized that you could also do away the string altogether, and in fact, reduce most internal string data down to just the hash itself.

String Hashes

Internal strings appear a lot in game development. They might be designer specified names for your RPG's usable items, class names in a custom scripting language, console commands typed in by a user, or the name of network packets.

Essentially, a string hash allows you to "ossy compress" an arbitrary length string down into a single, fixed bit length, number. Instead of passing those internal strings of characters, you can instead just pass around the hash.

Note that word "lossy“ you can't reconstruct your string from the hash number alone. This means you can't go out and string hash all your localizable text and hope to reconstruct that text later for display, but it does mean you can hash all those strings that you don't display. Alleviating the storage space needed for internal strings, and reducing string compares to simple integer compares.

Ideally, you'd be able to get completely rid of all these strings altogether, except perhaps in debug mode where it can be useful to have in order to track what data is flying around.

A good first step in this direction is to introduce a StringHash class. Instead of passing around the references to strings everywhere, pass around references to the StringHash class.

A StringHash Class

Example 1-1 provides a very simple string hash class. This class allows you to generate a hash from a string, and then pass that hash around for use as needed in your searching, storage, and lookup functions.

Note that there's actually two implementations.

When IOND_KEEP_STRINGS is defined, perhaps for instance in a debug build, the string hash class selected actually copies and stores the string that created the hash. When the "keep string" is not defined, the string hash class selected does not keep the string around; it only stores the hash itself.

   Example 1-1. StringHash wrapper
//unsigned long under 32-bit compilers
typedef boost::uint_least32_t u32;
u32 ComputeHash(const char *);
struct RawStringHash {
. RawStringHash( const char * src ) {
. . m_hash=ComputeHash();
. }
. u32 Hash() const {
. . return m_hash;
. }
. bool operator==(const RawStringHash&o) const {
. . return m_hash==o.m_hash;
. }
. bool operator<(const RawStringHash&o) const {
. . return m_hash o.m_hash;
. }
};

struct StringHashWithString : RawStringHash {
. StringHashWithString( const char * src ) :
. . RawStringHash(src)
. . m_string(src) {
. }
. const char * String() const {
. . return m_string.c_str();
. }
private:
. std::string m_string;
. u32 m_hash;
};
#ifdef IOND_KEEP_STRINGS
. typedef StringHashWithString StringHash;
#else
. typedef RawStringHash StringHash;
#endif

Evaluating Hashes

The big question, of course, is just what goes in that ComputeHash() function.

Metrics

Not all hashes are created equal, and there are several metrics worth considering when choosing a hash. A few metrics that I personally consider important include: speed to generate a hash from a string, complexity of hash code, frequency of collisions between hash values, and the ability to in-place strlwr.

The first two metrics are hopefully pretty understandable, the second two, however, probably deserve a brief explanation.

First: not all hashes are perfect. Try as they might, it's certainly possible for two different strings to wind up generating the same hash value. Some algorithms are better at avoiding collisions than others.

Second: when talking about internal strings, it's often worthwhile to ignore case. Whether a designer specifies "Magical Treasure Chest of Doo" or "Magical Treasure Chest Of Doom" they probably actually intended to refer to the same object. This is a matter of personal, and team, taste of course. I generally prefer case-insensitivity.

Given those metrics there are actually a large number of hashing algorithms worth considering out there. For purposes of this post, I simply chose some of the more common public domain ones.

25,760 Strings

In order to compare the speed and collision frequency of different hashes you will need some real world test cases. Ideally, you can use the actual data from your game.

For this post however I wanted to use a large set of strings that others could easily obtain to replicate my experiments.

Easier said than done. I needed a big batch of strings to test collision against, but all the games I could think of (and could hack into) had only tiny amounts of data. I was thinking I might try and hunt down some World Of Warcraft item and monster lists, but then decided instead to go to my shelf and crack open my copy of Freelancer. I went on to Lancer's Reactor, downloaded the data file un-packers, and *poof* twenty five thousand object names at my finger tips.

Comparisons

Table 1-1 compares the generation of 25,760 strings in a release build ( auto-selected optimizations ) using Visual C++ 2005 Express.

Table 1-1
Name Bit Depth Speed (ms) Collision Complexity Lower
boost::crc_32_type 32 162 none! high no
fast_boost_crc 32 61 none! high no
custom CRC 32 54 none! med yes
FNVA 32 56 none! low yes
FNV 32 51 1 low yes
Jenkins 32 37 none! high no
The native boost crc type is slow. It took me a while to understand why, but basically to be compatible with the crc "standard" (pkzip) the algorithm has to invert the bits of the source and destination hash -- believe it or not, that extra 100ms comes all from there.

The fast boost crc (shown in Example 1-2) tweaks the parameters of the boost crc template, to remove the bit inversions. I believe that this is the fastest boost crc possible.

   Example 1-2. Fast Boost CRC
typedef boost::crc_optimal<32, 0x04C11DB7, 0xFFFFFFFF, 0xFFFFFFFF, false,  false> fast_boost_crc;

My custom CRC type is basically just the boost algorithm manually expanded and tweaked to lay the statements out in a good order. The nice thing about manually expanding the algorithm is that it's possible to manually insert the strlwr'ing along the way.

The FNV(S) aren't bad but the FNV collides (just once), and FNVA is a little slower than my custom CRC code.

Jenkins looks awesome but it is worth noting that it works 4 bytes at a time, and thus can't be easily in-place strlwr'd. Further, since it doesn't look at individual characters, it can't automatically detect the end of your strings, and therefore also needs to know how long your strings are in advance.

The Hash Algorithms

More information on the various hashes can be found at:
Of these, I like the custom CRC. It has a good speed, manages to avoid collision well, and can be outfitted with strlwr. Unfortunately, CRC generation does require a table -- in this case a 1K table. If you need the space, than FNV/A would be good alternatives.

It's interesting to note that the stdext::hash_map in Visual Studio C++ 2005 uses FNV for it's hash computations.

Future Work

I will post my code here a little later on, and I also follow up with two additional posts:

"A purist's guide to string hashes" taking a look at keeping consistent StringHash structure sizes between builds, computing crcs at compile time, and how to obtain the perfect hash

"Applications of the string hash" showing how the string hash can be used in practice.

0 comments: