Fastest rep SCASD, 12 million items. AMD x64
Fastest rep SCASD, 12 million items. AMD x64
Hi, I am looking at looping through 12 million dwords to find a specific match.
I know rep SCASD is fast but I was wondering if using prefetch before using custom code would be faster.
I am not sure if rep does prefetch internally??
I was thinking something like
[x], where x is a byte
1. Prefetch [x + (16*4)]
2. Load 16 dwords using 8 x64 registers, starting at [x], then [x + 4], [x + 8], etc..
3. Then OR all x64 registers and do bit test with DWORD to look for, if match goto 10.
4. increase x by (16 * 4)
5. goto step 1 until end.
10. Check each register manually to make sure a match was found
11. No actual match, goto 1.
12. Match found.......
Hope this makes sense.
Any great ideas out there?
Regards, Alistair
I know rep SCASD is fast but I was wondering if using prefetch before using custom code would be faster.
I am not sure if rep does prefetch internally??
I was thinking something like
[x], where x is a byte
1. Prefetch [x + (16*4)]
2. Load 16 dwords using 8 x64 registers, starting at [x], then [x + 4], [x + 8], etc..
3. Then OR all x64 registers and do bit test with DWORD to look for, if match goto 10.
4. increase x by (16 * 4)
5. goto step 1 until end.
10. Check each register manually to make sure a match was found
11. No actual match, goto 1.
12. Match found.......
Hope this makes sense.
Any great ideas out there?
Regards, Alistair
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi,
I can guarantee that on different CPUs you'll get different results. I can also guarantee that any solution that uses MMX, SSE or AVX will not be faster than "rep scasd" on 80486 CPUs.
Cheers,
Brendan
My great idea is: profile it on many different CPUs.tsdnz wrote:Any great ideas out there?
I can guarantee that on different CPUs you'll get different results. I can also guarantee that any solution that uses MMX, SSE or AVX will not be faster than "rep scasd" on 80486 CPUs.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Fastest rep SCASD, 12 million items. AMD x64
Great, thanks.
Re: Fastest rep SCASD, 12 million items. AMD x64
Do you know if rep uses both pipes or does it use one?
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi again,
Some notes...
Most CPUs detect sequential accesses and do hardware pre-fetching (which includes the "rep scasd" case). It's possible (for some CPUs) that software pre-fetching will make things worse.
Hardware pre-fetching will not cross page boundaries, takes about 3 misses to trigger it, and will fetch an unnecessary cache line after you stop. There is room for improvement here.
Software pre-fetching will not pre-fetch at all if there's a TLB miss. Depending on usage; the fastest pre-fetching method may involve doing a dummy read for the first cache line in each page (to trick the CPU into doing a "TLB fetch") followed by software prefetching for the remaining cache lines in the page.
For software pre-fetching, there's an "optimal pre-fetch distance". You don't want to pre-fetch too early (as the pre-fetched data can be evicted before you actually use it) and you don't want to pre-fetch too late (so the data hasn't arrived before you need it). The "optimal pre-fetch distance" is different for different CPUs and also depends on RAM chips, etc (how quickly a CPU can fetch from RAM).
You can/should "peel off" the last iteration of the loop to avoid unnecessarily pre-fetching data past the end of the array.
This code will push all "more likely to be used" data out of the cache; and to prevent that you might want to consider doing "CLFLUSH" on each cache line after you've used that cache line.
Finally; depending on exactly why you're doing this, I wouldn't be surprised if the fastest way is to avoid the need to search the array in the first place.
Cheers,
Brendan
Some notes...
Most CPUs detect sequential accesses and do hardware pre-fetching (which includes the "rep scasd" case). It's possible (for some CPUs) that software pre-fetching will make things worse.
Hardware pre-fetching will not cross page boundaries, takes about 3 misses to trigger it, and will fetch an unnecessary cache line after you stop. There is room for improvement here.
Software pre-fetching will not pre-fetch at all if there's a TLB miss. Depending on usage; the fastest pre-fetching method may involve doing a dummy read for the first cache line in each page (to trick the CPU into doing a "TLB fetch") followed by software prefetching for the remaining cache lines in the page.
For software pre-fetching, there's an "optimal pre-fetch distance". You don't want to pre-fetch too early (as the pre-fetched data can be evicted before you actually use it) and you don't want to pre-fetch too late (so the data hasn't arrived before you need it). The "optimal pre-fetch distance" is different for different CPUs and also depends on RAM chips, etc (how quickly a CPU can fetch from RAM).
You can/should "peel off" the last iteration of the loop to avoid unnecessarily pre-fetching data past the end of the array.
This code will push all "more likely to be used" data out of the cache; and to prevent that you might want to consider doing "CLFLUSH" on each cache line after you've used that cache line.
Finally; depending on exactly why you're doing this, I wouldn't be surprised if the fastest way is to avoid the need to search the array in the first place.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi,
Cheers,
Brendan
Most CPUs optimise the "rep" instructions to work on cache lines. The CPU will probably be stalled (waiting for cache line fetch from L1 or worse) most of the time, regardless of how many pipes are used.tsdnz wrote:Do you know if rep uses both pipes or does it use one?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Fastest rep SCASD, 12 million items. AMD x64
Thanks, excellent reply's.
The list is a list of Ethernet source IP's that are currently connected.
The list is a list of Ethernet source IP's that are currently connected.
Re: Fastest rep SCASD, 12 million items. AMD x64
The stall is what I was trying to get past using code prefetch.
the flow is a little wrong, I was not going to load into registers, I was going to or registers, oops.
I was hoping the prefetch would arrive quicker than rep was doing it.
I will have to test and find out.
Alistair.
the flow is a little wrong, I was not going to load into registers, I was going to or registers, oops.
I was hoping the prefetch would arrive quicker than rep was doing it.
I will have to test and find out.
Alistair.
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi,
For example, what if you use 2 bytes of the IP address as an index into an array of "pointers to bitmaps", and the other 2 bytes as a "bit number within bitmap"?
Example:
Cheers,
Brendan
Do you really want to do a linear search?tsdnz wrote:The list is a list of Ethernet source IP's that are currently connected.
For example, what if you use 2 bytes of the IP address as an index into an array of "pointers to bitmaps", and the other 2 bytes as a "bit number within bitmap"?
Example:
Code: Select all
checkIPaddress(uint32_t IPaddress) {
uint8_t *bitmap;
unsigned int byteNumber;
unsigned int bitNumber;
bitmap = myTable[IPaddress >> 16];
if(bitmap == NULL) {
return NOT_FOUND;
}
byteNumber = (IPaddress & 0xFFFF) >> 3;
bitNumber = IPaddress & 7;
if( (bitmap[byteNumber] & (1 << bitNumber)) == 0) {
return NOT_FOUND;
}
return FOUND;
}
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi, Would that be a lot of lost space??
These are going to change rapidly, will give it some thought, i like the idea.
Alistair
These are going to change rapidly, will give it some thought, i like the idea.
Alistair
Re: Fastest rep SCASD, 12 million items. AMD x64
Hold on miss read, you said bits!!!
Re: Fastest rep SCASD, 12 million items. AMD x64
Lovely, I like that a lot.
Much faster!!
Much faster!!
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi,
It'd cost "65536*sizeof(uint8_t *)" for the array of pointers, which is 512 KiB for 64-bit pointers. Each bitmap would cost 65536 bits or 8 KiB.
Worst case would be 65536 bitmaps; which works out to "512 KiB + 65536 * 8 KiB" or 512.5 MiB.
Note that this is just the first thing that popped into my head - there's many alternatives...
Cheers,
Brendan
I added example code to my previous post.tsdnz wrote:Hi, Would that be a lot of lost space??
These are going to change rapidly, will give it some thought, i like the idea.
It'd cost "65536*sizeof(uint8_t *)" for the array of pointers, which is 512 KiB for 64-bit pointers. Each bitmap would cost 65536 bits or 8 KiB.
Worst case would be 65536 bitmaps; which works out to "512 KiB + 65536 * 8 KiB" or 512.5 MiB.
Note that this is just the first thing that popped into my head - there's many alternatives...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Fastest rep SCASD, 12 million items. AMD x64
Awesome, thanks a lot.
alistair
alistair
Re: Fastest rep SCASD, 12 million items. AMD x64
Hi just had tea and thought about it.
It needs to hold a 64 bit pointer for each IP address. so the structure is a little large.
Using lists I can have one list for the IP address and another for the structure data.
Also planning to support IPv6
Alistair
It needs to hold a 64 bit pointer for each IP address. so the structure is a little large.
Using lists I can have one list for the IP address and another for the structure data.
Also planning to support IPv6
Alistair