CRC32 Hash

tsdnz · Post by **tsdnz** » Mon Dec 29, 2014 10:25 pm

Hi all, I have been using my own Hash for a while now and have decided to use a better one that distributes better.

Does anyone have a suitable hashing function?
I found these which look pretty good.
http://burtleburtle.net/bob/hash/spooky.html
https://code.google.com/p/cityhash/

Then I found that intel have a built in CRC instruction, which is better again.
http://lwn.net/Articles/292984/

Has anyone in the OS dev land tried the intel CRC instruction?
I am about to try it now and would be interested in others findings.
Testing on an AMD processor, so need to make sure it is available first.

Alistair.

sortie · Post by **sortie** » Mon Dec 29, 2014 10:27 pm

OpenBSD has recently switched most of its tree to Siphash. I think that's worth looking into. I'm assuming you are not talking cryptographic use.

tsdnz · Post by **tsdnz** » Mon Dec 29, 2014 10:35 pm

OpenBSD has recently switched most of its tree to Siphash

Thanks nice to know.

I'm assuming you are not talking cryptographic use

Correct, just using internally for a bucket selector

Brendan · Post by **Brendan** » Mon Dec 29, 2014 10:53 pm

Hi,

tsdnz wrote:
I'm assuming you are not talking cryptographic use
Correct, just using internally for a bucket selector

For a hash tree, CRC32 is likely to be overkill - the overhead of computing the CRC32 is likely to be higher than overhead of "slightly higher chance of collisions" caused by a much faster method. Of course this depends on the type of data, the amount of data and the number of buckets.

In terms of overhead, for CRC32 you mostly use lookup tables and have a compromise between lookup table size and number of lookups. E.g. for 1234 bytes of data, you could use a 16 entry lookup table and do 2468 lookups, or a 256 entry lookup table and do 1234 lookups, or a 65536 entry lookup table and do 617 lookups, etc. In any case it ends up being "cache pounding".

If the data is numerical, an extremely simple "hash = number % buckets;" is likely to be better, and if the data is a string something like "for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }" is likely to be better; mostly because additional memory accesses are avoided in both cases.

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Mon Dec 29, 2014 11:05 pm

For a hash tree, CRC32 is likely to be overkill - the overhead of computing the CRC32

Yes as I thought, but as intel has a built-in instruction I thought this might be the way to go??

"for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }"

This is basically what I have, but I use QWORD(s) then final bytes.
I am using simple adds without the XOR's

Do you think I should add XOR's, I was going for performance?
This hash is for string keys.

Code: Select all

BYTE* pData;
					QWORD Length;
FIL DWORD Hash()
					{
						QWORD r = 0;

						BYTE* pChar = pData;

						DWORD ShiftHash = Length & (64 - 1);
						DWORD Loop = Length & ~(64 - 1);

						while (ShiftHash--)
						{
							r = rol(r, 8) + *pChar;
							pChar += sizeof(BYTE);
						}

						while (Loop--)
						{
							r = rol(r, 23) + *(QWORD*)pChar;
							pChar += sizeof(QWORD);
						}

						return (r ^ (r >> KeyBucketSize)) & (((QWORD)1 << KeyBucketSize) - 1);
					}

Edit: It is working really well, I had a few minutes before my shower and thought I would visit this part of my OS.

Brendan · Post by **Brendan** » Tue Dec 30, 2014 10:47 pm

Hi,

tsdnz wrote:
For a hash tree, CRC32 is likely to be overkill - the overhead of computing the CRC32
Yes as I thought, but as intel has a built-in instruction I thought this might be the way to go??

As far as I know, the CRC32 instruction is only supported by Nehalem and later (can't be used on earlier CPUs), and while it's about 2 to 3 times faster than calculating a CRC32 without the CRC32 instruction it's still slower than not using CRC32.

tsdnz wrote:
"for each character in string { hash = ( hash ^ (hash << 32) ^ character) % buckets; }"
This is basically what I have, but I use QWORD(s) then final bytes.
I am using simple adds without the XOR's

It still depends far too much on the data. Is the data people's names, or vehicle registration numbers, or URLs, or entire text files, or...? All of these cases are "strings", and all of them are very very different and would have a different "ideal method of creating a hash".

For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Tue Dec 30, 2014 10:56 pm

It still depends far too much on the data. Is the data people's names, or vehicle registration numbers, or URLs, or entire text files, or...? All of these cases are "strings", and all of them are very very different and would have a different "ideal method of creating a hash".

For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).

Good points, my bucket is large, 4 GB, most keys will be above 12 characters.
I will put counters in to track and make changes if necessary.

Thanks

Brendan · Post by **Brendan** » Wed Dec 31, 2014 1:10 am

Hi,

tsdnz wrote:
It still depends far too much on the data. Is the data people's names, or vehicle registration numbers, or URLs, or entire text files, or...? All of these cases are "strings", and all of them are very very different and would have a different "ideal method of creating a hash".

For example; does it matter if the hashes for the strings "abc", "acb", "bac", "bca", "cab" and "cba" all collide? For people's names or email addresses it probably doesn't matter if they collide (and you might be able to do "hash = sum of individual characters"), but for vehicle registration numbers it probably does matter (but for vehicle registration numbers maybe there's only 36 characters and you can use that to your advantage).
Good points, my bucket is large, 4 GB, most keys will be above 12 characters.
I will put counters in to track and make changes if necessary.

If you assume an average of one entry per bucket and 32 bytes per entry, then ~4 billion buckets would only make sense if you're expecting about 128 GiB of data. In that case, the overhead of creating the hash is going to be irrelevant because over 75% of hash table lookups will involve fetching from swap space with no way to pre-fetch or predict what will be needed next (and no way to avoid massive performance problems caused by disk IO).

Of course perhaps you don't have that much data, and ~4 billion buckets is just far too many buckets?

When you say "above 12 characters", is that like "between 12 an 15 characters" or more like "between 12 and 12 thousand characters"; and what is a character (upper-case ASCII only with no control characters where you're struggling to get 6 bits of entropy per character; or Unicode with a wide variety of different languages where there's a lot more entropy per character)?

Is there any sort of pattern in the strings? For example, maybe 50% of the strings begin with the 5 characters "/home", 25% of the strings begin with the 5 characters "/sys/" and the other 25% of the strings begin with the 5 characters "/usr/"; and you can ignore the first 5 characters completely and it won't make any difference to the number of hash collisions.

Cheers,

Brendan

Antti · Post by **Antti** » Wed Dec 31, 2014 4:07 am

tsdnz wrote:Has anyone in the OS dev land tried the intel CRC instruction?

Not yet but I noted that instruction a long while ago. I have my own file format header that contains a CRC checksum field. First I planned using the "standard" CRC-32 (polynomial 0x04C11DB7) but I guess using the "Castagnoli" CRC-32C (polynomial 0x1EDC6F41) is a better choice because it is possible to use that hardware accelerated checksum calculation.

araxestroy · Post by **araxestroy** » Wed Dec 31, 2014 4:52 am

tsdnz wrote: Testing on an AMD processor, so need to make sure it is available first.

Bulldozer and newer (Piledriver, Steamroller, and future) AMD microarchitectures support SSE4.2 and the CRC32 instruction.

Brendan · Post by **Brendan** » Wed Dec 31, 2014 10:49 pm

Hi,

Antti wrote:Not yet but I noted that instruction a long while ago. I have my own file format header that contains a CRC checksum field. First I planned using the "standard" CRC-32 (polynomial 0x04C11DB7) but I guess using the "Castagnoli" CRC-32C (polynomial 0x1EDC6F41) is a better choice because it is possible to use that hardware accelerated checksum calculation.

D'oh. In 2008, I wrote a "native file format specification" (which describes a generic header used by all of my OS's own file formats) that uses CRC-32 (polynomial 0x04C11DB7). I've been recycling this specification (and the utilities, etc) in later versions of the project. I didn't even realise Intel went and used the less common CRC-32C for their "CRC32" instruction until I read your post.

Today, I switched everything from CRC-32 to CRC-32C.

tsdnz wrote:Has anyone in the OS dev land tried the intel CRC instruction?

I tried it today, mostly to make sure that my software implementation gives the same results as Intel's instruction. Of course performance wasn't important for this, so I only did a simple "one byte at a time" loop.

If you're interested in performance, then you might find this stack overflow question useful (specifically, the example code in the first answer that includes optimised code using Intel's CRC32 instruction in GCC/inline assembly).

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Thu Jan 01, 2015 12:55 am

If you're interested in performance, then you might find this stack overflow questionuseful (specifically, the example code in the first answer that includes optimised code using Intel's CRC32 instruction in GCC/inline assembly).

Mate, nice searching.

Antti · Post by **Antti** » Fri Jan 02, 2015 7:13 am

Brendan wrote:I wrote a "native file format specification" (which describes a generic header used by all of my OS's own file formats)

As an offtopic (as usual), I spent almost a week when I designed my file format header. The end result looks very simple:

Code: Select all

00000000h: 8D 41 54 0D 0A 73 74 64 0C 31 0C 30 0C 0A 71 F8  ; .AT..std.1.0..q.
00000010h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ; ................

Offset          Length          Description
09h             1               Major file type (standard)
0Bh             1               Minor file type (standard)
10h             8               File size (little endian)
18h             4               CRC-32C
1Ch             4               Start offset of file-type-specific header (if any)

However, there interesting features in the first 16 bytes:

Author's initials.
Hex/ascii layout is eye-pleasing for me.
The first byte is 0x8D and it indicates that it is not an ASCII file (not very realiable).
There are "0x0D, 0x0A" and "0x0A" bytes that will detect if there are "end-of-line conversions".

But the best part is here:

Code: Select all

16-bit disassembly:

00000000  8D4154            lea ax,[bx+di+0x54]
00000003  0D0A73            or ax,0x730a
00000006  7464              jz 0x6c
00000008  0C31              or al,0x31          ; Major file type version can be anything
0000000A  0C30              or al,0x30          ; Minor file type version can be anything
0000000C  0C0A              or al,0xa
0000000E  71F8              jno 0x8             ; branch is always taken

32-bit disassembly:

00000000  8D4154            lea eax,[ecx+0x54]
00000003  0D0A737464        or eax,0x6474730a
00000008  0C31              or al,0x31          ; Major file type version can be anything
0000000A  0C30              or al,0x30          ; Minor file type version can be anything
0000000C  0C0A              or al,0xa
0000000E  71F8              jno 0x8             ; branch is always taken

64-bit disassembly:

00000000  8D4154            lea eax,[rcx+0x54]
00000003  0D0A737464        or eax,0x6474730a
00000008  0C31              or al,0x31          ; Major file type version can be anything
0000000A  0C30              or al,0x30          ; Minor file type version can be anything
0000000C  0C0A              or al,0xa
0000000E  71F8              jno 0x8             ; branch is always taken

As we can see, it will always end into an endless loop. No undefined behavior even if the header is executed.

Owen · Post by **Owen** » Fri Jan 02, 2015 8:13 am

For reference, ARMv8 implements both the traditional CRC32 polynomial (CRC32[BHWX]) and CRC32-C (CRC32C[BHWX]) instructions.

(And I'm not sure of the use of an executable header, unless you think somebody might try to run your file under DOS...)

Antti · Post by **Antti** » Mon Jan 05, 2015 7:21 am

If there were an option to use a "rep" prefix, then we would basically just say "calculate checksum for this memory area". Perhaps this would be more abstract and there could be more hardware optimizations.

Code: Select all

PseudoCode:
        mov ecx, MEMORY_AREA_LENGTH     ; how many bytes
        mov esi, MEMORY_AREA            ; memory area
        rep crc32 [esi]                 ; not valid!!!

OSDev.org

CRC32 Hash

CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash

Re: CRC32 Hash