Fastest rep SCASD, 12 million items. AMD x64

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Hi, I am looking at looping through 12 million dwords to find a specific match.
I know rep SCASD is fast but I was wondering if using prefetch before using custom code would be faster.
I am not sure if rep does prefetch internally??

I was thinking something like

[x], where x is a byte
1. Prefetch [x + (16*4)]
2. Load 16 dwords using 8 x64 registers, starting at [x], then [x + 4], [x + 8], etc..
3. Then OR all x64 registers and do bit test with DWORD to look for, if match goto 10.
4. increase x by (16 * 4)
5. goto step 1 until end.

10. Check each register manually to make sure a match was found
11. No actual match, goto 1.
12. Match found.......

Hope this makes sense.

Any great ideas out there?

Regards, Alistair
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi,
tsdnz wrote:Any great ideas out there?
My great idea is: profile it on many different CPUs.

I can guarantee that on different CPUs you'll get different results. I can also guarantee that any solution that uses MMX, SSE or AVX will not be faster than "rep scasd" on 80486 CPUs. ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Great, thanks.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Do you know if rep uses both pipes or does it use one?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi again,

Some notes...

Most CPUs detect sequential accesses and do hardware pre-fetching (which includes the "rep scasd" case). It's possible (for some CPUs) that software pre-fetching will make things worse.

Hardware pre-fetching will not cross page boundaries, takes about 3 misses to trigger it, and will fetch an unnecessary cache line after you stop. There is room for improvement here.

Software pre-fetching will not pre-fetch at all if there's a TLB miss. Depending on usage; the fastest pre-fetching method may involve doing a dummy read for the first cache line in each page (to trick the CPU into doing a "TLB fetch") followed by software prefetching for the remaining cache lines in the page.

For software pre-fetching, there's an "optimal pre-fetch distance". You don't want to pre-fetch too early (as the pre-fetched data can be evicted before you actually use it) and you don't want to pre-fetch too late (so the data hasn't arrived before you need it). The "optimal pre-fetch distance" is different for different CPUs and also depends on RAM chips, etc (how quickly a CPU can fetch from RAM).

You can/should "peel off" the last iteration of the loop to avoid unnecessarily pre-fetching data past the end of the array.

This code will push all "more likely to be used" data out of the cache; and to prevent that you might want to consider doing "CLFLUSH" on each cache line after you've used that cache line.

Finally; depending on exactly why you're doing this, I wouldn't be surprised if the fastest way is to avoid the need to search the array in the first place.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi,
tsdnz wrote:Do you know if rep uses both pipes or does it use one?
Most CPUs optimise the "rep" instructions to work on cache lines. The CPU will probably be stalled (waiting for cache line fetch from L1 or worse) most of the time, regardless of how many pipes are used.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Thanks, excellent reply's.

The list is a list of Ethernet source IP's that are currently connected.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

The stall is what I was trying to get past using code prefetch.

the flow is a little wrong, I was not going to load into registers, I was going to or registers, oops.

I was hoping the prefetch would arrive quicker than rep was doing it.

I will have to test and find out.

Alistair.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi,
tsdnz wrote:The list is a list of Ethernet source IP's that are currently connected.
Do you really want to do a linear search?

For example, what if you use 2 bytes of the IP address as an index into an array of "pointers to bitmaps", and the other 2 bytes as a "bit number within bitmap"?

Example:

Code: Select all

checkIPaddress(uint32_t IPaddress) {
    uint8_t *bitmap;
    unsigned int byteNumber;
    unsigned int bitNumber;

    bitmap = myTable[IPaddress >> 16];
    if(bitmap == NULL) {
        return NOT_FOUND;
    }

    byteNumber = (IPaddress & 0xFFFF) >> 3;
    bitNumber = IPaddress & 7;

    if( (bitmap[byteNumber] & (1 << bitNumber)) == 0) {
        return NOT_FOUND;
    }
    return FOUND;
}

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Hi, Would that be a lot of lost space??

These are going to change rapidly, will give it some thought, i like the idea.

Alistair
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Hold on miss read, you said bits!!!
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Lovely, I like that a lot.

Much faster!!
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi,
tsdnz wrote:Hi, Would that be a lot of lost space??

These are going to change rapidly, will give it some thought, i like the idea.
I added example code to my previous post.

It'd cost "65536*sizeof(uint8_t *)" for the array of pointers, which is 512 KiB for 64-bit pointers. Each bitmap would cost 65536 bits or 8 KiB.

Worst case would be 65536 bitmaps; which works out to "512 KiB + 65536 * 8 KiB" or 512.5 MiB.

Note that this is just the first thing that popped into my head - there's many alternatives... ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Awesome, thanks a lot.

alistair
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Hi just had tea and thought about it.

It needs to hold a 64 bit pointer for each IP address. so the structure is a little large.

Using lists I can have one list for the IP address and another for the structure data.

Also planning to support IPv6

Alistair
Post Reply