OSDev.org

Posted: **Mon Mar 28, 2011 12:40 am**

I've written a dictionary implementation that uses a hashtable with chaining for collision resolution. I plan to implement incremental resizing of the dictionary as well. What would be the least naïve way to ensure my dictionary type is both robust and has very good lookup times? (not just as far as algorithmic complexity but also overhead). I'm willing to change it up as much as needed to ensure this.

If you have any other tips to improve a dictionary type or resources on learning more about this topic in computer science, that would help too.

Posted: **Mon Mar 28, 2011 1:17 am**

It depends.

What is your dictionary storing? What types of interactions do you value most? lookup? insertion?

What range of data will your dictionary store?

Personally I would use a standard chained hashtable only for the most generic of maps - lookup, insertion and deletion have an algorithmic worst-case complexity of O(n) - the more full your table gets the more chains you have, and reducing them even with a table resize depends on having a good hash function.

My own dictionaries in my python compiler value atomicity and thread-safety without the need for course mutex locks. Therefore I use a 16-ary trie protected by a readers-writer lock. This works for both integer data (addresses of objects perhaps?) and even better for string data (utf-8 encoded) as it has a worst-case complexity of O(k) in all cases, and no hash function is required (even a simple hash function requires a modulo which is slow).

For pure string data I would prefer a Patricia Tree over the vanilla Trie, but all that does on top is add key compression.

If I were interested in many fast checks if an object were in a dictionary or not (x not in y), I would add a Bloom Filter on top.

As you can see, your mileage may vary and it depends precisely on what you need/want from your container and what sort of data you put in to it.

Posted: **Mon Mar 28, 2011 8:27 am**

JamesM wrote:It depends.

What is your dictionary storing? What types of interactions do you value most? lookup? insertion?

What range of data will your dictionary store?

If you read more into my original post, I said I most value lookup. I'd also like the dictionary to use both strings and integers as key values.

EDIT: I've read some more about the speed of tries, which looks appealing, but I'm worried that since lookup speed is about the same and more memory allocations are necessary, that there might be too much overhead involved for things like insertion to warrant it.

Posted: **Tue Mar 29, 2011 5:15 pm**

I've been told it's better to use a hash table than some sort of tree structure because of the locality of the memory, which is supposedly nicer on the CPU cache. Is this accurate?

Posted: **Tue Mar 29, 2011 7:47 pm**

Hi,

JamesM wrote:This works for both integer data (addresses of objects perhaps?) and even better for string data (utf-8 encoded) as it has a worst-case complexity of O(k) in all cases, and no hash function is required (even a simple hash function requires a modulo which is slow).

Modulo is one of the fastest operators possible, as long as the divisor is a power of 2 (e.g. "hash = result & HASH_MASK"). I'd worry more about cache misses, and branch mis-predictions...

@xvedejas: I'd implement a hash table (it's easier). Then I'd profile the code, and if the hash table is a problem I'd make sure it's not because of a bad hash function or bad hash size. If the hash table is still a problem after that, then I'd consider implementing some sort of tree and doing an "apples vs. apples" performance comparison between them to see which is better in your exact situation.

Cheers,

Brendan

Posted: **Wed Mar 30, 2011 12:59 am**

Modulo is one of the fastest operators possible, as long as the divisor is a power of 2 (e.g. "hash = result & HASH_MASK").

... which it should never be for a decent hash function! Most texts recommend the use of a prime number divisor - a power of two would end up with, for pointer data, pretty much everything in the same bucket

Posted: **Wed Mar 30, 2011 7:43 am**

Brendan wrote:I'd implement a hash table (it's easier).

Perhaps if the implementation:

doesn't care at all about collisions,
doesn't care at all about rehashing,
doesn't care at all about the choice of a hash function.

If it does, then trie is actually easier.

JamesM wrote:...which it should never be for a decent hash function!

s/decent/horribly bad/

Posted: **Wed Mar 30, 2011 10:24 am**

Hi,

JamesM wrote:
Modulo is one of the fastest operators possible, as long as the divisor is a power of 2 (e.g. "hash = result & HASH_MASK").
... which it should never be for a decent hash function! Most texts recommend the use of a prime number divisor - a power of two would end up with, for pointer data, pretty much everything in the same bucket

Obviously you wouldn't use the "always zero" least significant bits as part of the hash.

For an example, imagine you were storing 64-byte structures that are always aligned on a 64-byte boundary and wanted fast lookup based on the addresses of these structures. You'd discard the least significant 6 bits of the address, then use the next 'n' bits for the hash (where 'n' depends on hash table size). For a hash table with 65536 slots the complete hash function would be "hash = (address >> 6) & 0xFFFF". Of course if this example doesn't seem vaguely familiar to you then maybe you should try replacing the words "64-byte structure" with "64-byte cache line size"...

Fanael wrote:
Brendan wrote:I'd implement a hash table (it's easier).
Perhaps if the implementation:
doesn't care at all about collisions,

doesn't care at all about rehashing,

doesn't care at all about the choice of a hash function.
If it does, then trie is actually easier.

Finding a good hash function isn't that hard; and you only need to do rehashing if your hash size was wrong/too small to begin with.

Cheers,

Brendan

Posted: **Wed Mar 30, 2011 10:40 am**

Brendan wrote:Finding a good hash function isn't that hard; and you only need to do rehashing if your hash size was wrong/too small to begin with.

Still, an implementation of a hash table has to cope with these things, and this obviously adds some complexity. An implementation of a trie doesn't even know what they are.

Posted: **Wed Mar 30, 2011 11:04 am**

For a hash table with 65536 slots

There's your first error - a hashtable shouldn't be a power of 2 (or anything nearly so easily factorisable) in size. When inserting data with an obvious pattern, if that pattern factorises to the hashtable size, the same buckets get used over and over again and the same ones are empty, resulting in an artificially full table.

Which is why prime numbers are an excellent idea.

Posted: **Wed Mar 30, 2011 11:47 am**

It depends on the number of data items you will be storing vs. the number of entries in the hash table. Brendan is completely correct if the dictionary is not "dense". It is easily possible to go too far with the modulo buckets and the prime numbers. Doing bucket = (uint8_t) crc32(data) generally works quite well. The length of time needed to calculate the modulus is generally much smaller than the time to calculate the hash.

Posted: **Wed Mar 30, 2011 12:18 pm**

bewing wrote:It depends on the number of data items you will be storing vs. the number of entries in the hash table. Brendan is completely correct if the dictionary is not "dense". It is easily possible to go too far with the modulo buckets and the prime numbers. Doing bucket = (uint8_t) crc32(data) generally works quite well. The length of time needed to calculate the modulus is generally much smaller than the time to calculate the hash.

Two points: Firstly, if you know you have sparse data, there are better ways to make your table more efficient than converting the modulo to an AND. If that is the case, I would start by reducing the size of the table as it is obviously too large for the dataset you have. For best time/space efficiency it should be around 50% full.

Secondly, that totally depends on your hash function. If you're working with pointer or integer data for example, the hash function is likely to just be "(x>>4) % table_size" for 32-bit systems. You have to box a little clever with string data, but again, storing strings in a hashtable is a horrid waste of compute power (think of all the string comparisons you need to do once you hit the correct bucket!)

Posted: **Wed Mar 30, 2011 9:11 pm**

Hi,

JamesM wrote:
bewing wrote:It depends on the number of data items you will be storing vs. the number of entries in the hash table. Brendan is completely correct if the dictionary is not "dense". It is easily possible to go too far with the modulo buckets and the prime numbers. Doing bucket = (uint8_t) crc32(data) generally works quite well. The length of time needed to calculate the modulus is generally much smaller than the time to calculate the hash.
Two points: Firstly, if you know you have sparse data, there are better ways to make your table more efficient than converting the modulo to an AND. If that is the case, I would start by reducing the size of the table as it is obviously too large for the dataset you have. For best time/space efficiency it should be around 50% full.

If the original key is a random or pseudo-random number (e.g. a V4 GUID) then any modulo (including any power of 2) is going to be as good as any other for avoiding collisions. If the original key is something like the age of each student that graduated from primary school the same time you did, then you're going to be screwed regardless of whether you use a prime number of not (it's a very bad key).

JamesM wrote:You have to box a little clever with string data, but again, storing strings in a hashtable is a horrid waste of compute power (think of all the string comparisons you need to do once you hit the correct bucket!)

No.

The hash function is always 2 parts. The first part is some sort of calculation to convert the key into an integer ID (that hopefully has good distribution), and the second part is using modulo to create a hash from this ID. For string lookups, you use the hash to find the bucket, then the full integer ID to avoid almost all string comparisons.

For an example, you might use CRC32 to create a 32-bit ID from the string, then do "hash = CRC32 & 0xFFFF" to get the hash and find the right bucket, then (for each entry in the bucket) you do "if(entry.id == CRC32) { if(strcmp(entry.string, string) == 0) { /* Yay! */ } }". The chance of doing an unnecessary string comparison is extremely small (unless the hash table is so small that you get a very high number of collisions and have to worry about the birthday problem).

Of course the CRC32 calculation is a bit expensive, and for most situations it's probably overkill.

Fanael wrote:
Brendan wrote:Finding a good hash function isn't that hard; and you only need to do rehashing if your hash size was wrong/too small to begin with.
Still, an implementation of a hash table has to cope with these things, and this obviously adds some complexity. An implementation of a trie doesn't even know what they are.

There's no "always best" method, different solutions are better/worse for different situations, and it's hard to make generalisations without knowing anything about the exact situation/s. In a similar way, it's hard to estimate the complexity of each method without knowing anything about the situation.

There's only 2 assumptions we can really make about the situation - we can assume that performance matters, and we can assume there's enough data that implementation details matter. Those 2 initial assumptions lead to a third assumption - you want to avoid pounding the living daylights of the CPU/s caches.

For a hash table, optimising for caches means finding a good hash table size - not so big that you risk unnecessary cache misses when accessing the hash table, and not so small that you risk unnecessary cache misses when searching for the right entry in a bucket. Most people would do a good enough job of this without thinking about caches at all.

For a trie, it gets messy. The first thing I'd do is make sure each node is split into "hot data" (fields that are needed for searches) and "cold data" (everything else); where the trie only contains "hot data" (and the "hot data" contains a pointer to the "cold data" that's allocated separately). Then I'd try to allocate (and/or reallocate) chunks of RAM that are large enough to contain many adjacent nodes - for example, maybe split the entire trie into "sub-tries" such that each sub-trie fits into a single 4 KiB page. I'm not too sure if there's a way of doing any of this without losing more (from alloc/free overhead) than you gain (from cache locality).

Then next assumption I'd make is that scalability matters (e.g. many threads or CPUs reading and potentially modifying the data at the same time). For a hash table, having some sort of lock for each bucket is both obvious and easy - if there's enough buckets that collisions aren't a problem then it's likely there's enough buckets that lock contention won't be a problem either. Of course there's more complex methods that might be suitable (e.g. doing a lockless "read copy update" operation on buckets) but there's at least a clear starting point to work with.

For a trie, I'd probably spend the first day or so doing research (finding and reading white-papers, etc), and after that I'd give up. Combined with the cache optimisations above, you end up with a huge over-complicated mess that reminds me of a Brian W. Kernighan quote ("Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.").

Cheers,

Brendan

Posted: **Wed Mar 30, 2011 10:17 pm**

Brendan wrote: Of course the CRC32 calculation is a bit expensive, and for most situations it's probably overkill.

If you use it for all your hashing, as I do, then it becomes worthwhile to generate a 16bit lookup table for the CRC32 polynomials (since it's only 256K and the table can be shared, readonly) -- and then it's not particularly expensive at all.

But I do kinda like a radix trie too -- reasonably compact, and once you hit a "match" you don't need to do any further testing. Similar to a perfect hash that way.

Posted: **Thu Mar 31, 2011 1:13 am**

For a trie, I'd probably spend the first day or so doing research (finding and reading white-papers, etc), and after that I'd give up. Combined with the cache optimisations above, you end up with a huge over-complicated mess that reminds me of a Brian W. Kernighan quote ("Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.").

A trie is extremely easy to apply fine-grained locking to. I have an ADT class which does exactly that.

OSDev.org

Dictionary Implementation

Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation

Re: Dictionary Implementation