CRC32 Hash
CRC32 Hash
Hi all, I have been using my own Hash for a while now and have decided to use a better one that distributes better.
Does anyone have a suitable hashing function?
I found these which look pretty good.
http://burtleburtle.net/bob/hash/spooky.html
https://code.google.com/p/cityhash/
Then I found that intel have a built in CRC instruction, which is better again.
http://lwn.net/Articles/292984/
Has anyone in the OS dev land tried the intel CRC instruction?
I am about to try it now and would be interested in others findings.
Testing on an AMD processor, so need to make sure it is available first.
Alistair.
Does anyone have a suitable hashing function?
I found these which look pretty good.
http://burtleburtle.net/bob/hash/spooky.html
https://code.google.com/p/cityhash/
Then I found that intel have a built in CRC instruction, which is better again.
http://lwn.net/Articles/292984/
Has anyone in the OS dev land tried the intel CRC instruction?
I am about to try it now and would be interested in others findings.
Testing on an AMD processor, so need to make sure it is available first.
Alistair.
Re: CRC32 Hash
OpenBSD has recently switched most of its tree to Siphash. I think that's worth looking into. I'm assuming you are not talking cryptographic use.
Re: CRC32 Hash
Thanks nice to know.OpenBSD has recently switched most of its tree to Siphash
Correct, just using internally for a bucket selectorI'm assuming you are not talking cryptographic use
Re: CRC32 Hash
Hi,
In terms of overhead, for CRC32 you mostly use lookup tables and have a compromise between lookup table size and number of lookups. E.g. for 1234 bytes of data, you could use a 16 entry lookup table and do 2468 lookups, or a 256 entry lookup table and do 1234 lookups, or a 65536 entry lookup table and do 617 lookups, etc. In any case it ends up being "cache pounding".
If the data is numerical, an extremely simple "hash = number % buckets;" is likely to be better, and if the data is a string something like "for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }" is likely to be better; mostly because additional memory accesses are avoided in both cases.
Cheers,
Brendan
For a hash tree, CRC32 is likely to be overkill - the overhead of computing the CRC32 is likely to be higher than overhead of "slightly higher chance of collisions" caused by a much faster method. Of course this depends on the type of data, the amount of data and the number of buckets.tsdnz wrote:Correct, just using internally for a bucket selectorI'm assuming you are not talking cryptographic use
In terms of overhead, for CRC32 you mostly use lookup tables and have a compromise between lookup table size and number of lookups. E.g. for 1234 bytes of data, you could use a 16 entry lookup table and do 2468 lookups, or a 256 entry lookup table and do 1234 lookups, or a 65536 entry lookup table and do 617 lookups, etc. In any case it ends up being "cache pounding".
If the data is numerical, an extremely simple "hash = number % buckets;" is likely to be better, and if the data is a string something like "for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }" is likely to be better; mostly because additional memory accesses are avoided in both cases.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: CRC32 Hash
Yes as I thought, but as intel has a built-in instruction I thought this might be the way to go??For a hash tree, CRC32 is likely to be overkill - the overhead of computing the CRC32
This is basically what I have, but I use QWORD(s) then final bytes."for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }"
I am using simple adds without the XOR's
Do you think I should add XOR's, I was going for performance?
This hash is for string keys.
Code: Select all
BYTE* pData;
QWORD Length;
FIL DWORD Hash()
{
QWORD r = 0;
BYTE* pChar = pData;
DWORD ShiftHash = Length & (64 - 1);
DWORD Loop = Length & ~(64 - 1);
while (ShiftHash--)
{
r = rol(r, 8) + *pChar;
pChar += sizeof(BYTE);
}
while (Loop--)
{
r = rol(r, 23) + *(QWORD*)pChar;
pChar += sizeof(QWORD);
}
return (r ^ (r >> KeyBucketSize)) & (((QWORD)1 << KeyBucketSize) - 1);
}
Re: CRC32 Hash
Hi,
For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).
Cheers,
Brendan
As far as I know, the CRC32 instruction is only supported by Nehalem and later (can't be used on earlier CPUs), and while it's about 2 to 3 times faster than calculating a CRC32 without the CRC32 instruction it's still slower than not using CRC32.tsdnz wrote:Yes as I thought, but as intel has a built-in instruction I thought this might be the way to go??For a hash tree, CRC32 is likely to be overkill - the overhead of computing the CRC32
It still depends far too much on the data. Is the data people's names, or vehicle registration numbers, or URLs, or entire text files, or...? All of these cases are "strings", and all of them are very very different and would have a different "ideal method of creating a hash".tsdnz wrote:This is basically what I have, but I use QWORD(s) then final bytes."for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }"
I am using simple adds without the XOR's
For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: CRC32 Hash
Good points, my bucket is large, 4 GB, most keys will be above 12 characters.It still depends far too much on the data. Is the data people's names, or vehicle registration numbers, or URLs, or entire text files, or...? All of these cases are "strings", and all of them are very very different and would have a different "ideal method of creating a hash".
For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).
I will put counters in to track and make changes if necessary.
Thanks
Re: CRC32 Hash
Hi,
Of course perhaps you don't have that much data, and ~4 billion buckets is just far too many buckets?
When you say "above 12 characters", is that like "between 12 an 15 characters" or more like "between 12 and 12 thousand characters"; and what is a character (upper-case ASCII only with no control characters where you're struggling to get 6 bits of entropy per character; or Unicode with a wide variety of different languages where there's a lot more entropy per character)?
Is there any sort of pattern in the strings? For example, maybe 50% of the strings begin with the 5 characters "/home", 25% of the strings begin with the 5 characters "/sys/" and the other 25% of the strings begin with the 5 characters "/usr/"; and you can ignore the first 5 characters completely and it won't make any difference to the number of hash collisions.
Cheers,
Brendan
If you assume an average of one entry per bucket and 32 bytes per entry, then ~4 billion buckets would only make sense if you're expecting about 128 GiB of data. In that case, the overhead of creating the hash is going to be irrelevant because over 75% of hash table lookups will involve fetching from swap space with no way to pre-fetch or predict what will be needed next (and no way to avoid massive performance problems caused by disk IO).tsdnz wrote:Good points, my bucket is large, 4 GB, most keys will be above 12 characters.It still depends far too much on the data. Is the data people's names, or vehicle registration numbers, or URLs, or entire text files, or...? All of these cases are "strings", and all of them are very very different and would have a different "ideal method of creating a hash".
For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).
I will put counters in to track and make changes if necessary.
Of course perhaps you don't have that much data, and ~4 billion buckets is just far too many buckets?
When you say "above 12 characters", is that like "between 12 an 15 characters" or more like "between 12 and 12 thousand characters"; and what is a character (upper-case ASCII only with no control characters where you're struggling to get 6 bits of entropy per character; or Unicode with a wide variety of different languages where there's a lot more entropy per character)?
Is there any sort of pattern in the strings? For example, maybe 50% of the strings begin with the 5 characters "/home", 25% of the strings begin with the 5 characters "/sys/" and the other 25% of the strings begin with the 5 characters "/usr/"; and you can ignore the first 5 characters completely and it won't make any difference to the number of hash collisions.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: CRC32 Hash
Not yet but I noted that instruction a long while ago. I have my own file format header that contains a CRC checksum field. First I planned using the "standard" CRC-32 (polynomial 0x04C11DB7) but I guess using the "Castagnoli" CRC-32C (polynomial 0x1EDC6F41) is a better choice because it is possible to use that hardware accelerated checksum calculation.tsdnz wrote:Has anyone in the OS dev land tried the intel CRC instruction?
- Kazinsal
- Member
- Posts: 559
- Joined: Wed Jul 13, 2011 7:38 pm
- Libera.chat IRC: Kazinsal
- Location: Vancouver
- Contact:
Re: CRC32 Hash
Bulldozer and newer (Piledriver, Steamroller, and future) AMD microarchitectures support SSE4.2 and the CRC32 instruction.tsdnz wrote: Testing on an AMD processor, so need to make sure it is available first.
Re: CRC32 Hash
Hi,
Today, I switched everything from CRC-32 to CRC-32C.
If you're interested in performance, then you might find this stack overflow question useful (specifically, the example code in the first answer that includes optimised code using Intel's CRC32 instruction in GCC/inline assembly).
Cheers,
Brendan
D'oh. In 2008, I wrote a "native file format specification" (which describes a generic header used by all of my OS's own file formats) that uses CRC-32 (polynomial 0x04C11DB7). I've been recycling this specification (and the utilities, etc) in later versions of the project. I didn't even realise Intel went and used the less common CRC-32C for their "CRC32" instruction until I read your post.Antti wrote:Not yet but I noted that instruction a long while ago. I have my own file format header that contains a CRC checksum field. First I planned using the "standard" CRC-32 (polynomial 0x04C11DB7) but I guess using the "Castagnoli" CRC-32C (polynomial 0x1EDC6F41) is a better choice because it is possible to use that hardware accelerated checksum calculation.
Today, I switched everything from CRC-32 to CRC-32C.
I tried it today, mostly to make sure that my software implementation gives the same results as Intel's instruction. Of course performance wasn't important for this, so I only did a simple "one byte at a time" loop.tsdnz wrote:Has anyone in the OS dev land tried the intel CRC instruction?
If you're interested in performance, then you might find this stack overflow question useful (specifically, the example code in the first answer that includes optimised code using Intel's CRC32 instruction in GCC/inline assembly).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: CRC32 Hash
Mate, nice searching.If you're interested in performance, then you might find this stack overflow questionuseful (specifically, the example code in the first answer that includes optimised code using Intel's CRC32 instruction in GCC/inline assembly).
Re: CRC32 Hash
As an offtopic (as usual), I spent almost a week when I designed my file format header. The end result looks very simple:Brendan wrote:I wrote a "native file format specification" (which describes a generic header used by all of my OS's own file formats)
Code: Select all
00000000h: 8D 41 54 0D 0A 73 74 64 0C 31 0C 30 0C 0A 71 F8 ; .AT..std.1.0..q.
00000010h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................
Offset Length Description
09h 1 Major file type (standard)
0Bh 1 Minor file type (standard)
10h 8 File size (little endian)
18h 4 CRC-32C
1Ch 4 Start offset of file-type-specific header (if any)
- Author's initials.
- Hex/ascii layout is eye-pleasing for me.
- The first byte is 0x8D and it indicates that it is not an ASCII file (not very realiable).
- There are "0x0D, 0x0A" and "0x0A" bytes that will detect if there are "end-of-line conversions".
Code: Select all
16-bit disassembly:
00000000 8D4154 lea ax,[bx+di+0x54]
00000003 0D0A73 or ax,0x730a
00000006 7464 jz 0x6c
00000008 0C31 or al,0x31 ; Major file type version can be anything
0000000A 0C30 or al,0x30 ; Minor file type version can be anything
0000000C 0C0A or al,0xa
0000000E 71F8 jno 0x8 ; branch is always taken
32-bit disassembly:
00000000 8D4154 lea eax,[ecx+0x54]
00000003 0D0A737464 or eax,0x6474730a
00000008 0C31 or al,0x31 ; Major file type version can be anything
0000000A 0C30 or al,0x30 ; Minor file type version can be anything
0000000C 0C0A or al,0xa
0000000E 71F8 jno 0x8 ; branch is always taken
64-bit disassembly:
00000000 8D4154 lea eax,[rcx+0x54]
00000003 0D0A737464 or eax,0x6474730a
00000008 0C31 or al,0x31 ; Major file type version can be anything
0000000A 0C30 or al,0x30 ; Minor file type version can be anything
0000000C 0C0A or al,0xa
0000000E 71F8 jno 0x8 ; branch is always taken
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: CRC32 Hash
For reference, ARMv8 implements both the traditional CRC32 polynomial (CRC32[BHWX]) and CRC32-C (CRC32C[BHWX]) instructions.
(And I'm not sure of the use of an executable header, unless you think somebody might try to run your file under DOS...)
(And I'm not sure of the use of an executable header, unless you think somebody might try to run your file under DOS...)
Re: CRC32 Hash
If there were an option to use a "rep" prefix, then we would basically just say "calculate checksum for this memory area". Perhaps this would be more abstract and there could be more hardware optimizations.
Code: Select all
PseudoCode:
mov ecx, MEMORY_AREA_LENGTH ; how many bytes
mov esi, MEMORY_AREA ; memory area
rep crc32 [esi] ; not valid!!!