Fastest rep SCASD, 12 million items. AMD x64

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 9:07 pm

Hi, I am looking at looping through 12 million dwords to find a specific match.
I know rep SCASD is fast but I was wondering if using prefetch before using custom code would be faster.
I am not sure if rep does prefetch internally??

I was thinking something like

[x], where x is a byte
1. Prefetch [x + (16*4)]
2. Load 16 dwords using 8 x64 registers, starting at [x], then [x + 4], [x + 8], etc..
3. Then OR all x64 registers and do bit test with DWORD to look for, if match goto 10.
4. increase x by (16 * 4)
5. goto step 1 until end.

10. Check each register manually to make sure a match was found
11. No actual match, goto 1.
12. Match found.......

Hope this makes sense.

Any great ideas out there?

Regards, Alistair

Brendan · Post by **Brendan** » Thu Mar 27, 2014 9:24 pm

Hi,

tsdnz wrote:Any great ideas out there?

My great idea is: profile it on many different CPUs.

I can guarantee that on different CPUs you'll get different results. I can also guarantee that any solution that uses MMX, SSE or AVX will not be faster than "rep scasd" on 80486 CPUs.

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 9:40 pm

Great, thanks.

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 9:50 pm

Do you know if rep uses both pipes or does it use one?

Brendan · Post by **Brendan** » Thu Mar 27, 2014 9:50 pm

Hi again,

Some notes...

Most CPUs detect sequential accesses and do hardware pre-fetching (which includes the "rep scasd" case). It's possible (for some CPUs) that software pre-fetching will make things worse.

Hardware pre-fetching will not cross page boundaries, takes about 3 misses to trigger it, and will fetch an unnecessary cache line after you stop. There is room for improvement here.

Software pre-fetching will not pre-fetch at all if there's a TLB miss. Depending on usage; the fastest pre-fetching method may involve doing a dummy read for the first cache line in each page (to trick the CPU into doing a "TLB fetch") followed by software prefetching for the remaining cache lines in the page.

For software pre-fetching, there's an "optimal pre-fetch distance". You don't want to pre-fetch too early (as the pre-fetched data can be evicted before you actually use it) and you don't want to pre-fetch too late (so the data hasn't arrived before you need it). The "optimal pre-fetch distance" is different for different CPUs and also depends on RAM chips, etc (how quickly a CPU can fetch from RAM).

You can/should "peel off" the last iteration of the loop to avoid unnecessarily pre-fetching data past the end of the array.

This code will push all "more likely to be used" data out of the cache; and to prevent that you might want to consider doing "CLFLUSH" on each cache line after you've used that cache line.

Finally; depending on exactly why you're doing this, I wouldn't be surprised if the fastest way is to avoid the need to search the array in the first place.

Cheers,

Brendan

Brendan · Post by **Brendan** » Thu Mar 27, 2014 9:53 pm

Hi,

tsdnz wrote:Do you know if rep uses both pipes or does it use one?

Most CPUs optimise the "rep" instructions to work on cache lines. The CPU will probably be stalled (waiting for cache line fetch from L1 or worse) most of the time, regardless of how many pipes are used.

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 9:56 pm

Thanks, excellent reply's.

The list is a list of Ethernet source IP's that are currently connected.

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 10:01 pm

The stall is what I was trying to get past using code prefetch.

the flow is a little wrong, I was not going to load into registers, I was going to or registers, oops.

I was hoping the prefetch would arrive quicker than rep was doing it.

I will have to test and find out.

Alistair.

Brendan · Post by **Brendan** » Thu Mar 27, 2014 10:05 pm

Hi,

tsdnz wrote:The list is a list of Ethernet source IP's that are currently connected.

Do you really want to do a linear search?

For example, what if you use 2 bytes of the IP address as an index into an array of "pointers to bitmaps", and the other 2 bytes as a "bit number within bitmap"?

Example:

Code: Select all

checkIPaddress(uint32_t IPaddress) {
    uint8_t *bitmap;
    unsigned int byteNumber;
    unsigned int bitNumber;

    bitmap = myTable[IPaddress >> 16];
    if(bitmap == NULL) {
        return NOT_FOUND;
    }

    byteNumber = (IPaddress & 0xFFFF) >> 3;
    bitNumber = IPaddress & 7;

    if( (bitmap[byteNumber] & (1 << bitNumber)) == 0) {
        return NOT_FOUND;
    }
    return FOUND;
}

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 10:11 pm

Hi, Would that be a lot of lost space??

These are going to change rapidly, will give it some thought, i like the idea.

Alistair

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 10:13 pm

Hold on miss read, you said bits!!!

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 10:15 pm

Lovely, I like that a lot.

Much faster!!

Brendan · Post by **Brendan** » Thu Mar 27, 2014 10:18 pm

Hi,

tsdnz wrote:Hi, Would that be a lot of lost space??

These are going to change rapidly, will give it some thought, i like the idea.

I added example code to my previous post.

It'd cost "65536*sizeof(uint8_t *)" for the array of pointers, which is 512 KiB for 64-bit pointers. Each bitmap would cost 65536 bits or 8 KiB.

Worst case would be 65536 bitmaps; which works out to "512 KiB + 65536 * 8 KiB" or 512.5 MiB.

Note that this is just the first thing that popped into my head - there's many alternatives...

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 10:22 pm

Awesome, thanks a lot.

alistair

tsdnz · Post by **tsdnz** » Thu Mar 27, 2014 10:50 pm

Hi just had tea and thought about it.

It needs to hold a 64 bit pointer for each IP address. so the structure is a little large.

Using lists I can have one list for the IP address and another for the structure data.

Also planning to support IPv6

Alistair

OSDev.org

Fastest rep SCASD, 12 million items. AMD x64

Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64