Hi,
JamesM wrote:bewing wrote:It depends on the number of data items you will be storing vs. the number of entries in the hash table. Brendan is completely correct if the dictionary is not "dense". It is easily possible to go too far with the modulo buckets and the prime numbers. Doing bucket = (uint8_t) crc32(data) generally works quite well. The length of time needed to calculate the modulus is generally much smaller than the time to calculate the hash.
Two points: Firstly, if you know you have sparse data, there are better ways to make your table more efficient than converting the modulo to an AND. If that is the case, I would start by reducing the size of the table as it is obviously too large for the dataset you have. For best time/space efficiency it should be around 50% full.
If the original key is a random or pseudo-random number (e.g. a
V4 GUID) then any modulo (including any power of 2) is going to be as good as any other for avoiding collisions. If the original key is something like the age of each student that graduated from primary school the same time you did, then you're going to be screwed regardless of whether you use a prime number of not (it's a very bad key).
JamesM wrote:You have to box a little clever with string data, but again, storing strings in a hashtable is a horrid waste of compute power (think of all the string comparisons you need to do once you hit the correct bucket!)
No.
The hash function is always 2 parts. The first part is some sort of calculation to convert the key into an integer ID (that hopefully has good distribution), and the second part is using modulo to create a hash from this ID. For string lookups, you use the hash to find the bucket, then the full integer ID to avoid almost all string comparisons.
For an example, you might use CRC32 to create a 32-bit ID from the string, then do "hash = CRC32 & 0xFFFF" to get the hash and find the right bucket, then (for each entry in the bucket) you do "
if(entry.id == CRC32) { if(strcmp(entry.string, string) == 0) { /* Yay! */ } }". The chance of doing an unnecessary string comparison is extremely small (unless the hash table is so small that you get a very high number of collisions and have to worry about the birthday problem).
Of course the CRC32 calculation is a bit expensive, and for most situations it's probably overkill.
Fanael wrote:Brendan wrote:Finding a good hash function isn't that hard; and you only need to do rehashing if your hash size was wrong/too small to begin with.
Still, an implementation of a hash table has to cope with these things, and this obviously adds some complexity. An implementation of a trie doesn't even
know what they are.
There's no "always best" method, different solutions are better/worse for different situations, and it's hard to make generalisations without knowing anything about the exact situation/s. In a similar way, it's hard to estimate the complexity of each method without knowing anything about the situation.
There's only 2 assumptions we can really make about the situation - we can assume that performance matters, and we can assume there's enough data that implementation details matter. Those 2 initial assumptions lead to a third assumption - you want to avoid pounding the living daylights of the CPU/s caches.
For a hash table, optimising for caches means finding a good hash table size - not so big that you risk unnecessary cache misses when accessing the hash table, and not so small that you risk unnecessary cache misses when searching for the right entry in a bucket. Most people would do a good enough job of this without thinking about caches at all.
For a trie, it gets messy. The first thing I'd do is make sure each node is split into "hot data" (fields that are needed for searches) and "cold data" (everything else); where the trie only contains "hot data" (and the "hot data" contains a pointer to the "cold data" that's allocated separately). Then I'd try to allocate (and/or reallocate) chunks of RAM that are large enough to contain many adjacent nodes - for example, maybe split the entire trie into "sub-tries" such that each sub-trie fits into a single 4 KiB page. I'm not too sure if there's a way of doing any of this without losing more (from alloc/free overhead) than you gain (from cache locality).
Then next assumption I'd make is that scalability matters (e.g. many threads or CPUs reading and potentially modifying the data at the same time). For a hash table, having some sort of lock for each bucket is both obvious and easy - if there's enough buckets that collisions aren't a problem then it's likely there's enough buckets that lock contention won't be a problem either. Of course there's more complex methods that might be suitable (e.g. doing a lockless "read copy update" operation on buckets) but there's at least a clear starting point to work with.
For a trie, I'd probably spend the first day or so doing research (finding and reading white-papers, etc), and after that I'd give up. Combined with the cache optimisations above, you end up with a huge over-complicated mess that reminds me of a Brian W. Kernighan quote ("
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.").
Cheers,
Brendan