Page 2 of 3

Posted: Tue May 29, 2007 4:42 pm
by Kevin McGuire
You will know the increase the array size when you start having a lot of collisions. I would imagine keeping a count, do a count as you transverse the chain, or when adding to the chain to determine that you are slowly losing the performance.

You could schedule a rehashing when there is a lot of idle time, or do a background rehashing and dynamically adjust the background priority so that it may catch up or come close enough to rehash the entire array in one go so you do not make the entire process (program) stop for a rehash.

(KEYSPACE_SIZE / ARRAY_SIZE) this is how many collisions you will have at worst for each array entry.

Take for example:

Code: Select all

uint32_t doHash(uint8_t *in, uint32_t len)
{
   for(uint32_t x = 0; x < len; ++x)
  {
      hash = hash + in[x]; 
  }
  return hash;
}
You have to understand the hash and since I am not very good at math I can use something simple like this. Knowing the algorithm will help. The algorithm for the hash (above) gives your key space length (below):
Image
(KEYSPACE_SIZE / ARRAY_SIZE * ARRAY_SIZE) is how many collisions you would have if you exhausted the entire key space.

If you can make an array this size then you should never have a collision. When you have a array size 1/4 of this you have the potential to have three collisions in worst case if you use only half of the key space.

So knowing the size of your biggest string to insert for m in the equation (above) would help to determine a optimize array size by looking at your expected nominal number of array items over the entire period your array with hashes is expected to be in use.

(And the above algorithm might be wrong. I am still thinking it over.)

Posted: Tue May 29, 2007 5:15 pm
by Alboin
Hm. Sounds like a good idea. I'll keep track of the number of items added, and the number of collisions, and when that reaches a certain point, I'll just rehash it. Beautiful.

Thanks again!!!

Posted: Tue May 29, 2007 5:29 pm
by Kevin McGuire
You might want to get a second opinion on this too since I am not expert.

Posted: Tue May 29, 2007 7:45 pm
by frank
In my hashing attempts I always doubled the size whenever the array got more than half full. Kind of the cheap way out but it worked.

Posted: Tue May 29, 2007 8:20 pm
by Alboin
frank wrote:In my hashing attempts I always doubled the size whenever the array got more than half full. Kind of the cheap way out but it worked.
That does sound a bit simpler. Maybe I'll do that instead...

Just to clarify something: A hash table is pretty much the same as an associative array, correct? (The terms seem to be used interchangeably.)

Posted: Tue May 29, 2007 8:27 pm
by Colonel Kernel
Alboin wrote:
frank wrote:In my hashing attempts I always doubled the size whenever the array got more than half full. Kind of the cheap way out but it worked.
That does sound a bit simpler. Maybe I'll do that instead...

Just to clarify something: A hash table is pretty much the same as an associative array, correct? (The terms seem to be used interchangeably.)
An associative array is an abstraction from the programmer's point of view (i.e. -- it is a convenience for whoever is using it). A hash table is one way to implement associative arrays, but by no means the only way (e.g. -- STL's std::map often uses trees instead).

Posted: Tue May 29, 2007 8:35 pm
by Alboin
Colonel Kernel wrote:
Alboin wrote:
frank wrote:In my hashing attempts I always doubled the size whenever the array got more than half full. Kind of the cheap way out but it worked.
That does sound a bit simpler. Maybe I'll do that instead...

Just to clarify something: A hash table is pretty much the same as an associative array, correct? (The terms seem to be used interchangeably.)
An associative array is an abstraction from the programmer's point of view (i.e. -- it is a convenience for whoever is using it). A hash table is one way to implement associative arrays, but by no means the only way (e.g. -- STL's std::map often uses trees instead).
But are associative arrays the same as hash tables when implemented using hash tables? That is, what's being abstracted, exactly?

Posted: Tue May 29, 2007 8:44 pm
by Kevin McGuire
No. Since as shown earlier you can use a regular array and convert the hash into a index into this array which shows that a associative array is not necessarily a hash table, but is most likely used as a hash table.

You are abstracting nothing as far as I know, but rather using a abstract data type to store the hashes which may or may not increase the performance.

The abstraction is governing how you store the hashes and defines a set of operators to manage these stored hashes (removing, adding, searching).

You are not abstracting the hashing.

"...an associative array maps arbitrarily typed objects to arbitrarily typed objects".

You are using a abstract data type to map a hash to a string.

Posted: Tue May 29, 2007 9:19 pm
by Colonel Kernel
Kevin McGuire wrote:No. Since as shown earlier you can use a regular array and convert the hash into a index into this array which shows that a associative array is not necessarily a hash table, but is most likely used as a hash table.
No, you have it backwards. A hashtable can be used as an associative array, but so can other data structures. A hashtable is just a data structure with a particular implementation and particular performance characteristics. It is not as "high-level" (i.e. -- as close to the problem domain) as an associative array. It's like saying that a mode of transportation can be used as an airplane -- it's more the other way around.
"...an associative array maps arbitrarily typed objects to arbitrarily typed objects".

You are using a abstract data type to map a hash to a string.
No, the hash function is a way to map from your key type (I guess string in this case) to your value type, whatever it may be. The key and value could even be the same thing, in which case the C++/STL equivalent would be std::set, not std::map.

Posted: Wed May 30, 2007 7:47 am
by Kevin McGuire
I can understand the point you are making and among all the fine lines between a hash table and a associative array I find that you are correct in that:

...an associative array maps arbitrarily typed objects to arbitrarily typed objects. found at en.wikipedia.org.
The definition represents the most pure form of a associative array. It is based on the fact that one part of the two parts that make a association has to be a key, hash, or numerical value which fits with a hash being a mathematical function.

Hash Table which has a algorithmic functionality.
Associative Array which has a structural functionality.

A associative array can be used as a hash table.
(HASHTABLE)ARRAY .. HASHTABLE :: ARRAY
A hash table can be used as a associative array.
(ARRAY)HASHTABLE .. ARRAY :: HASHTABLE

HASHTABLE:ARRAY
It would seem confusing to say that the hash (the value) is entirely represented as a array, linked list, tree, or any other arbitrary type. A higher level of a associate array could map a arbitrary type such as a numerical value to a arbitrary type such as a null terminated string.
You could allow the structure and functionality of a associative array to be inherited by a hash table. If so you must allow the hash table to include the functionality of mapping one arbitrary object to another using the functionality of hashes which would be half okay for the apparent reason that you could call one of the two items that make the association a arbitrary type, but how else do you store a mathmatical result than with a number (or at least not to start adding complexity for no reason). Therefore the hash table has trouble inheriting the entire functionality of a associative array.
ARRAY:HASHTABLE
Of course if you flip the positions, then you can make it a lot less complex and more understandable, by saying that the functionality of a hash table is inherited by a associative array but not any structure, and to add that the functionality provided by the hash table provides a algorithmic functionality rather than a structural form or structural functionality.

Of course I wrote all that trying to prove you wrong and the best part is as I was writing it I ended up realizing that I myself was completely wrong. It was very amazing and I very much enjoyed the feeling of excitement as I started frantically looking back and forth between your words and mine trying to find that last little glimpse of hope that I was still correct in some insignificant technical way that could flaw your entire argument, but unfortunately I was wrong and you argument was as hard than trying to break concrete with my hands.

Posted: Wed May 30, 2007 8:39 am
by Colonel Kernel
Kevin McGuire wrote:Of course I wrote all that trying to prove you wrong and the best part is as I was writing it I ended up realizing that I myself was completely wrong. It was very amazing and I very much enjoyed the feeling of excitement as I started frantically looking back and forth between your words and mine trying to find that last little glimpse of hope that I was still correct in some insignificant technical way that could flaw your entire argument, but unfortunately I was wrong and you argument was as hard than trying to break concrete with my hands.
I'm glad to hear it. Of course, you could have saved me a lot of time and trouble by deleting your invalid argument from the rest of your post. ;)

Posted: Wed May 30, 2007 9:21 am
by Kevin McGuire
I am afraid it is all valid. What do you disagree with?

Or in other worsds it had already been deleted before I posted it, and any remaining artifacts of my argument were removed.

So I am confused as to what you mean now.

Posted: Wed May 30, 2007 10:37 am
by Colonel Kernel
Kevin McGuire wrote:I am afraid it is all valid. What do you disagree with?

Or in other worsds it had already been deleted before I posted it, and any remaining artifacts of my argument were removed.

So I am confused as to what you mean now.
To be honest, the part of your post above the part I quoted made my eyes glaze over. This happens a lot when I read your posts. I'm beginning to think you're in a perpetual state of confusion. :P

In any case, I have to get back to work now... Maybe if I get bored I'll try again to decrypt the rest of your post later. :D

Posted: Wed May 30, 2007 11:26 am
by Kevin McGuire
Just read it like I am talking to myself.

Posted: Wed May 30, 2007 12:49 pm
by Candy
Alboin wrote:Hey, fancy that. Another question.

If to get the index I do hash%table_size, when do I need to change my table size? (I'm using chaining to fix collisions.) Wouldn't all the elements have to fit in the table, because of the modulo?

Also, if I don't change size, then what would be a good table size?

Thanks.
That's because the hash table is in effect a mix between a pure linked list and a direct mapping, where the linked list is a hash table with one chain (all hashes are 0) and the direct map is one with, for N possible objects, N possible hashes, so there can't be a collision unless there is an object -> by definition no chains. Both have really bad cases and you're mostly trying to find the optimal case for your application:

Making it a linked list gives average lookup times of O(n), which is bad. Making it a direct map when there are huge gaps between objects, you get space complexity O(lots) which is even worse. If you fold the space complexity, you can reduce it to O(n) without raising its average amortized lookup time to beyond O(1) significantly. That requires keeping your chains short enough to not be related to N. That means, if your average chain length is too large, you should rehash. Inversely, if your table has lots of empty items, you should also rehash.

The concept of using primes is to ensure that any input will more or less distribute evenly over the bins. The exact definition is, every input will divide over the amount of bins divided by the GCD of the hash modulus and the average stride of the input values.

That means, if you have objects that occur at, say, every 16 bytes or a multiple of that (allocated a lot of 24-byte objects and stuffed them into a hashtable) where the hash table uses 32 bins with a simple modulus over the pointer, all objects (ALL) will end up in bin 0. If you have 64 bins, they'll end up in (64/32) bins, or 2. If you have 63 bins, even though it isn't prime (but it is coprime to 32), you'll use all 63 bins.