Fastest rep SCASD, 12 million items. AMD x64

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi,
tsdnz wrote:Hi just had tea and thought about it.

It needs to hold a 64 bit pointer for each IP address. so the structure is a little large.

Using lists I can have one list for the IP address and another for the structure data.

Also planning to support IPv6
In this case I'd use hash tables or trees or something.

For a linear search of 12 million items you can expect to check an average of 6 million entries per search. If each check costs an average of 2 cycles, then it works out to an average of 166 searches per second for a 2 GHz CPU.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Yes, good points. I need to sit down and do some solid calculations and deep thinking. Good fun.

Thanks,

Alistair
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Righty, have decided to use 1 x 256 bucket holding a pointer to a linked-list.
256 * sizeof(QWORD) = 524,288 Bytes

Each IPv4Connection is 42 bytes.
Each pointer (first line) points to a linked-list, each linked-list contains 1024 entries.

So 1024 * 42 = 43,008 bytes per linked-list.
43,008 * 256 = 11,010,048 bytes for the initial setup.

As I am working on IPv4, the TCP Source address bytes will be added together.

What are your thoughts?

Alistair
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Brendan »

Hi,
tsdnz wrote:Righty, have decided to use 1 x 256 bucket holding a pointer to a linked-list.
256 * sizeof(QWORD) = 524,288 Bytes

Each IPv4Connection is 42 bytes.
Each pointer (first line) points to a linked-list, each linked-list contains 1024 entries.

So 1024 * 42 = 43,008 bytes per linked-list.
43,008 * 256 = 11,010,048 bytes for the initial setup.

As I am working on IPv4, the TCP Source address bytes will be added together.

What are your thoughts?
Singly linked lists means an extra 8-byte pointer for each of the 12 million entries, and (for 12 million entries) that will cost an extra 96 MB of RAM just for the "next" pointers. It also means you're no longer doing sequential reads to iterate through the list and you're doing "pointer chasing" instead. This adds up to a pre-fetching nightmare - an average of 50 thousand entries per list means an average of 25 thousand ("un-pre-fetchable") potential cache misses per search.

I'd be tempted to use the "least significant n bits" as a hash; with an array of IP addresses and an array of structures for each hash. I'd want to ensure that the arrays don't have many entries (you'll see why soon); which means that for 12 million entries I'd want the number of hashes to be around 2 million. This works out to using the least significant 21 bits of an IP address as the hash (possibly XORed with the remaining/highest 11 bits for fun).

To keep the size of the main table small I'd only have one pointer per entry; which would point to a structure containing details for that hash (pointer to array of structures, number of entries in both arrays) and would be immediately followed by the array of IP addresses.

The main table would cost "sizeof(void *) * 2 million entries" or 16 MiB. On average, for 12 million IP addresses, you'd have 6 entries per array. That'd be a 12 byte header (8 byte pointer and 4 byte "number of entries" count) and then 4 bytes per IP address; or 36 bytes per list on average (not including the array of unknown structures). This also means that all lists that have 13 or fewer IP addresses fit in a single cache line with the header; and because you're expecting 6 entries per list on average it's unlikely you'll have many lists that use 2 or more cache lines. Note: I'm assuming that the header is aligned on a (64-byte) cache line boundary here - if you're using a generic "malloc()" thing then it'd probably be at least 50% more cache lines touched.

2 million lists at 36 bytes per list is 72 MiB, plus the main table's 16 MiB works out to 88 MiB for everything used for searching (for 12 million entries); not including the array of unknown structures. You're also only touching 2 cache lines for almost all searches.

To make this a bit easier to understand:

Code: Select all

%define HASH_BITS    21

typedef struct {
    /* ??? */
} UNKNOWN_DATA;

typedef struct {
    uint32_t listSize;
    UNKNOWN_DATA *unknownDataArray;
} ENTRY_HEADER;

ENTRY_HEADER *mainTable[1 << HASH_BITS)];

UNKNOWN_DATA *findDataForIPaddress(uint32_t IPaddress) {
    uint32_t hash;
    ENTRY_HEADER *header;
    uint32_t *IParray;
    unsigned int i;

    hash = (IPaddress ^ (IPaddress >> HASH_BITS)) & ((1 << HASH_BITS) - 1);
    header = mainTable[hash];                          // First cache line touched
    IParray = (void *)header + sizeof(ENTRY_HEADER);
                                                       // All memory accesses past here likely to be
                                                       //   contained in same/second cache line
    for(i = 0; i < header->listSize; i++) {
        if(IParray[i] == IPaddress) {
            return &(header->unknownDataArray[i]);
        }
    }
    return NULL;
}

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by bluemoon »

tsdnz wrote:Righty, have decided to use 1 x 256 bucket holding a pointer to a linked-list.
256 * sizeof(QWORD) = 524,288 Bytes
My calculator disagree: 256 x sizeof(64-bit pointer) = 2048 bytes
However, 65536 x sizeof(64-bit pointer) = 524288 = 512K bytes, a reasonable budget for 12 millions records.
tsdnz wrote:What are your thoughts?
Check out different data structure algorithms.
With old school linked list with huge amount of records this not only unfriendly to cache and wait state (cpu has to stall/wait for data ready to proceed to next node), it also impose significant overhead per node: 12 millions node with just one "next pointer" consume 96M bytes.

Depends on what action you want to optimize, e.g. add, remove, or scan; and statistical pattern (which may affect tree balancing) or usage of the data (e.g. if scans are performed periodically so you want to scan only newly added record), there may be other better data structure and algorithms better fits.
CWood
Member
Member
Posts: 127
Joined: Sun Jun 20, 2010 1:21 pm

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by CWood »

Assuming memory isn't an issue here, skip lists are an alternative to vanilla linked lists. (Wiki)

According to Wikipedia they're probabilistic, however personally I think the best way to implement this would be in a binary fashion, with the number of "higher-order" lists each element is in being based on the number of bits set to 0, counting from the LSB to the least-significant bit set to 1. I can't think of a better way to describe what I mean, so I will use diagrams (see below), but as a consequence, the structure is deterministic, works when there is no random number generator available, and (may*) get better speeds.

Diagram:

Code: Select all

1 []----------------------[]
2 []----------[]----------[]----------[]
3 []----[]----[]----[]----[]----[]----[]----[]
4 []-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]
With loads as high as yours, you'll probably have to do a lot of performance tweaking, but this is (hopefully) a start.

* Don't quote me on that. Experiments yet to be done.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

My calculator disagree: 256 x sizeof(64-bit pointer) = 2048 bytes
However, 65536 x sizeof(64-bit pointer) = 524288 = 512K bytes, a reasonable budget for 12 millions records.
Wicked, you passed that test! :D
I was meant to type 65536, LOL.

My link-lists are grouped, eg a list of 1024 then points to another list.

I like Brendan's answer, it does not fit within some of my constraints but I like the idea and need to massage it into place.
And I need performance, with minimal cache misses, very nice.

Time to do some more calcs.

Thanks for all the suggestions, it has really got my thinking.

Alistair
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Brendan, thanks, I like the idea, use a larger bucket.

I have my own memory allocator, a custom OS.
And I need to make things fit.
I will look at a smaller bucket or changing it slightly.

Thanks,

Alistair
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by Combuster »

CWood wrote:According to Wikipedia they're probabilistic, however personally I think the best way to implement this would be in a binary fashion
The probabilistic part is what gives skiplists their safety. Try for instance finding an IPv4 address ending in .0, which means that even if you have a million IP addresses, you'll probably only skip through them by the 128 - for purely technical reasons.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

Hi, not sure skiplists work for me, there are going to be a lot of connections changing.

Alistair
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

According to Wikipedia they're probabilistic, however personally I think the best way to implement this would be in a binary fashion
Yes looking into binary searching.

Alistair
hidnplayr
Posts: 23
Joined: Sat Aug 09, 2008 5:07 pm

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by hidnplayr »

I think what you're looking for is radix trees.

You need this for routing, right?
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Fastest rep SCASD, 12 million items. AMD x64

Post by tsdnz »

I think what you're looking for is radix trees.
Thanks for this idea, it was interesting reading, I never thought of spliting data for such a tree, nice idea.

It does not suit my requirements, I need millions of TCP/IP connections every second.

I have changed my memory manager to suit Brendan's answer using buckets.
I have two memory managers to test it against.

Thanks everyone.
Post Reply