Fastest rep SCASD, 12 million items. AMD x64

Brendan · Post by **Brendan** » Thu Mar 27, 2014 11:03 pm

Hi,

tsdnz wrote:Hi just had tea and thought about it.

It needs to hold a 64 bit pointer for each IP address. so the structure is a little large.

Using lists I can have one list for the IP address and another for the structure data.

Also planning to support IPv6

In this case I'd use hash tables or trees or something.

For a linear search of 12 million items you can expect to check an average of 6 million entries per search. If each check costs an average of 2 cycles, then it works out to an average of 166 searches per second for a 2 GHz CPU.

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Fri Mar 28, 2014 1:15 am

Yes, good points. I need to sit down and do some solid calculations and deep thinking. Good fun.

Thanks,

Alistair

tsdnz · Post by **tsdnz** » Fri Mar 28, 2014 2:06 am

Righty, have decided to use 1 x 256 bucket holding a pointer to a linked-list.
256 * sizeof(QWORD) = 524,288 Bytes

Each IPv4Connection is 42 bytes.
Each pointer (first line) points to a linked-list, each linked-list contains 1024 entries.

So 1024 * 42 = 43,008 bytes per linked-list.
43,008 * 256 = 11,010,048 bytes for the initial setup.

As I am working on IPv4, the TCP Source address bytes will be added together.

What are your thoughts?

Alistair

Brendan · Post by **Brendan** » Fri Mar 28, 2014 6:21 am

Hi,

tsdnz wrote:Righty, have decided to use 1 x 256 bucket holding a pointer to a linked-list.
256 * sizeof(QWORD) = 524,288 Bytes

Each IPv4Connection is 42 bytes.
Each pointer (first line) points to a linked-list, each linked-list contains 1024 entries.

So 1024 * 42 = 43,008 bytes per linked-list.
43,008 * 256 = 11,010,048 bytes for the initial setup.

As I am working on IPv4, the TCP Source address bytes will be added together.

What are your thoughts?

Singly linked lists means an extra 8-byte pointer for each of the 12 million entries, and (for 12 million entries) that will cost an extra 96 MB of RAM just for the "next" pointers. It also means you're no longer doing sequential reads to iterate through the list and you're doing "pointer chasing" instead. This adds up to a pre-fetching nightmare - an average of 50 thousand entries per list means an average of 25 thousand ("un-pre-fetchable") potential cache misses per search.

I'd be tempted to use the "least significant n bits" as a hash; with an array of IP addresses and an array of structures for each hash. I'd want to ensure that the arrays don't have many entries (you'll see why soon); which means that for 12 million entries I'd want the number of hashes to be around 2 million. This works out to using the least significant 21 bits of an IP address as the hash (possibly XORed with the remaining/highest 11 bits for fun).

To keep the size of the main table small I'd only have one pointer per entry; which would point to a structure containing details for that hash (pointer to array of structures, number of entries in both arrays) and would be immediately followed by the array of IP addresses.

The main table would cost "sizeof(void *) * 2 million entries" or 16 MiB. On average, for 12 million IP addresses, you'd have 6 entries per array. That'd be a 12 byte header (8 byte pointer and 4 byte "number of entries" count) and then 4 bytes per IP address; or 36 bytes per list on average (not including the array of unknown structures). This also means that all lists that have 13 or fewer IP addresses fit in a single cache line with the header; and because you're expecting 6 entries per list on average it's unlikely you'll have many lists that use 2 or more cache lines. Note: I'm assuming that the header is aligned on a (64-byte) cache line boundary here - if you're using a generic "malloc()" thing then it'd probably be at least 50% more cache lines touched.

2 million lists at 36 bytes per list is 72 MiB, plus the main table's 16 MiB works out to 88 MiB for everything used for searching (for 12 million entries); not including the array of unknown structures. You're also only touching 2 cache lines for almost all searches.

To make this a bit easier to understand:

Code: Select all

%define HASH_BITS    21

typedef struct {
    /* ??? */
} UNKNOWN_DATA;

typedef struct {
    uint32_t listSize;
    UNKNOWN_DATA *unknownDataArray;
} ENTRY_HEADER;

ENTRY_HEADER *mainTable[1 << HASH_BITS)];

UNKNOWN_DATA *findDataForIPaddress(uint32_t IPaddress) {
    uint32_t hash;
    ENTRY_HEADER *header;
    uint32_t *IParray;
    unsigned int i;

    hash = (IPaddress ^ (IPaddress >> HASH_BITS)) & ((1 << HASH_BITS) - 1);
    header = mainTable[hash];                          // First cache line touched
    IParray = (void *)header + sizeof(ENTRY_HEADER);
                                                       // All memory accesses past here likely to be
                                                       //   contained in same/second cache line
    for(i = 0; i < header->listSize; i++) {
        if(IParray[i] == IPaddress) {
            return &(header->unknownDataArray[i]);
        }
    }
    return NULL;
}

Cheers,

Brendan

bluemoon · Post by **bluemoon** » Fri Mar 28, 2014 8:13 am

tsdnz wrote:Righty, have decided to use 1 x 256 bucket holding a pointer to a linked-list.
256 * sizeof(QWORD) = 524,288 Bytes

My calculator disagree: 256 x sizeof(64-bit pointer) = 2048 bytes
However, 65536 x sizeof(64-bit pointer) = 524288 = 512K bytes, a reasonable budget for 12 millions records.

tsdnz wrote:What are your thoughts?

Check out different data structure algorithms.
With old school linked list with huge amount of records this not only unfriendly to cache and wait state (cpu has to stall/wait for data ready to proceed to next node), it also impose significant overhead per node: 12 millions node with just one "next pointer" consume 96M bytes.

Depends on what action you want to optimize, e.g. add, remove, or scan; and statistical pattern (which may affect tree balancing) or usage of the data (e.g. if scans are performed periodically so you want to scan only newly added record), there may be other better data structure and algorithms better fits.

CWood · Post by **CWood** » Fri Mar 28, 2014 8:59 am

Assuming memory isn't an issue here, skip lists are an alternative to vanilla linked lists. (Wiki)

According to Wikipedia they're probabilistic, however personally I think the best way to implement this would be in a binary fashion, with the number of "higher-order" lists each element is in being based on the number of bits set to 0, counting from the LSB to the least-significant bit set to 1. I can't think of a better way to describe what I mean, so I will use diagrams (see below), but as a consequence, the structure is deterministic, works when there is no random number generator available, and (may*) get better speeds.

Diagram:

Code: Select all

1 []----------------------[]
2 []----------[]----------[]----------[]
3 []----[]----[]----[]----[]----[]----[]----[]
4 []-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]-[]

With loads as high as yours, you'll probably have to do a lot of performance tweaking, but this is (hopefully) a start.

* Don't quote me on that. Experiments yet to be done.

tsdnz · Post by **tsdnz** » Fri Mar 28, 2014 1:26 pm

My calculator disagree: 256 x sizeof(64-bit pointer) = 2048 bytes
However, 65536 x sizeof(64-bit pointer) = 524288 = 512K bytes, a reasonable budget for 12 millions records.

Wicked, you passed that test!

I was meant to type 65536, LOL.

My link-lists are grouped, eg a list of 1024 then points to another list.

I like Brendan's answer, it does not fit within some of my constraints but I like the idea and need to massage it into place.
And I need performance, with minimal cache misses, very nice.

Time to do some more calcs.

Thanks for all the suggestions, it has really got my thinking.

Alistair

tsdnz · Post by **tsdnz** » Fri Mar 28, 2014 1:28 pm

Brendan, thanks, I like the idea, use a larger bucket.

I have my own memory allocator, a custom OS.
And I need to make things fit.
I will look at a smaller bucket or changing it slightly.

Thanks,

Alistair

Combuster · Post by **Combuster** » Fri Mar 28, 2014 2:20 pm

CWood wrote:According to Wikipedia they're probabilistic, however personally I think the best way to implement this would be in a binary fashion

The probabilistic part is what gives skiplists their safety. Try for instance finding an IPv4 address ending in .0, which means that even if you have a million IP addresses, you'll probably only skip through them by the 128 - for purely technical reasons.

tsdnz · Post by **tsdnz** » Fri Mar 28, 2014 3:27 pm

Hi, not sure skiplists work for me, there are going to be a lot of connections changing.

Alistair

tsdnz · Post by **tsdnz** » Fri Mar 28, 2014 3:28 pm

According to Wikipedia they're probabilistic, however personally I think the best way to implement this would be in a binary fashion

Yes looking into binary searching.

Alistair

hidnplayr · Post by **hidnplayr** » Thu Apr 03, 2014 1:40 pm

I think what you're looking for is radix trees.

You need this for routing, right?

tsdnz · Post by **tsdnz** » Thu Apr 03, 2014 2:28 pm

I think what you're looking for is radix trees.

Thanks for this idea, it was interesting reading, I never thought of spliting data for such a tree, nice idea.

It does not suit my requirements, I need millions of TCP/IP connections every second.

I have changed my memory manager to suit Brendan's answer using buckets.
I have two memory managers to test it against.

Thanks everyone.

OSDev.org

Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64

Re: Fastest rep SCASD, 12 million items. AMD x64