connecting multiple ethernet 10g link to one processor

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
ggodw000
Member
Member
Posts: 396
Joined: Wed Nov 18, 2015 3:04 pm
Location: San Jose San Francisco Bay Area
Contact:

connecting multiple ethernet 10g link to one processor

Post by ggodw000 »

we are working on a design project with multi-socket system in which we are connecting several 10g ethernet links to 1 CPU and 2nd CPU basically has no connection through hardware design.
There are some concern from select folks whether this cause congestion and debate heated up. IMO, this really generally does not matter due to APIC distributing whatever interrupt is coming from the devices. I was thinking whether there is a case of performance hit in NUMA enabled systems, but then NUMA is largely concerned with the memory locality and has nothing to do with network connectivitiy. This is compared to system which has 1 x 10G link connection to each CPU. Any thoughts on this? Thanks.,
key takeaway after spending yrs on sw industry: big issue small because everyone jumps on it and fixes it. small issue is big since everyone ignores and it causes catastrophy later. #devilisinthedetails
User avatar
dchapiesky
Member
Member
Posts: 204
Joined: Sun Dec 25, 2016 1:54 am
Libera.chat IRC: dchapiesky

Re: connecting multiple ethernet 10g link to one processor

Post by dchapiesky »

You will want to examine everything associated with DPDK.... the Intel Dataplane Development Kit

They are working with not just 10gig cards but also 40gig cards

http://www.dpdk.org/

here is a paper on 40gig performance on off the shelf hardware...

http://perso.telecom-paristech.fr/~dros ... echrep.pdf

Now to the specifics....

DPDK's threading model is one thread per core where the core is exempted from the OS's scheduler.

Threads are run-to-completion and avoid OS syscalls...

In general practice, a whole core - A WHOLE CORE - is dedicated to *** just reading*** the hardware rings on the NICs connected to it...

Another whole core is dedicated to *** just writing *** to the hardware rings of NICs connected to it...

The DPDK software is very cache aware and the data structures in it are tuned to be cache line aligned.

Now towards your NUMA issues....

DPDK takes into account - via mapping Huge Page memory (also not under OS memory manager control) to physical addresses, finding contiguous pages and ***noting which physical memory is connected to what physical SOCKET****

This allows DPDK to allocate memory to a ring buffer and packet buffer which will be accessed by core 3 on socket 0 and be assured that this memory is physically connected to socket 0.

This reduces substantially the NUMA hit your team members are concerned about.... but it is a software problem... not a hardware one.

ultimately the argument about why your second unconnected processor may lag in performance is a question about just what software you run on it....

For example...

socket 0 cores 0 - 8 -- run networking code and filtering algorithms

socket 1 cores 9 - 15 -- run the OS and a database application which consumes filtered events

the lag is one way from socket 0 (the network) to socket 1 (the database)

If the database has to communicate at wire speed - it would be better to have some of it running on socket 0.

So... please check out DPDK, and go ask your team just what applications this hardware is supposedly designed for..... Although I find it hard to believe the second socket doesn't have it's own PCI-x connections....

Good luck and cheers!
Last edited by dchapiesky on Tue Jan 03, 2017 10:31 pm, edited 1 time in total.
Plagiarize. Plagiarize. Let not one line escape thine eyes...
User avatar
dchapiesky
Member
Member
Posts: 204
Joined: Sun Dec 25, 2016 1:54 am
Libera.chat IRC: dchapiesky

Re: connecting multiple ethernet 10g link to one processor

Post by dchapiesky »

ggodw000 wrote:..... but then NUMA is largely concerned with the memory locality and has nothing to do with network connectivitiy
Please note that most 10gig / 40gig cards directly connect to the CPU Socket and write directly into cache

For a packet to get from Socket 0 to Socket 1 it must traverse from Socket 0 cache to Socket 1 cache via NUMA channels... While this probably doesn't invalidate memory and force a cache flush; it will saturate the inter-socket communications if core 9 on socket 1 is reading h/w rings of a nic on socket 0.....
Plagiarize. Plagiarize. Let not one line escape thine eyes...
alexfru
Member
Member
Posts: 1111
Joined: Tue Mar 04, 2014 5:27 am

Re: connecting multiple ethernet 10g link to one processor

Post by alexfru »

+1 to what dchapiesky said. I've worked for a while on a system like that (with many instances of one 10G NIC = one CPU core; poked around NIC drivers and TCP/IP). If I'm not mistaken, we didn't use hyperthreading (disabled it). And some of our old boxes had issues with PCIE shared between NICs and/or some other devices, making it hard to get and maintain the full speed (cheap(er) off-the-shelf hardware may have nasty surprises). DPDK is a way to start.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: connecting multiple ethernet 10g link to one processor

Post by Brendan »

Hi,
ggodw000 wrote:we are working on a design project with multi-socket system in which we are connecting several 10g ethernet links to 1 CPU and 2nd CPU basically has no connection through hardware design.
There are some concern from select folks whether this cause congestion and debate heated up. IMO, this really generally does not matter due to APIC distributing whatever interrupt is coming from the devices. I was thinking whether there is a case of performance hit in NUMA enabled systems, but then NUMA is largely concerned with the memory locality and has nothing to do with network connectivitiy. This is compared to system which has 1 x 10G link connection to each CPU. Any thoughts on this? Thanks.,
If you (e.g.) ask the NIC's "TCP offload engine" to send 64 KiB of data; how much communication does the NIC need to do to fetch all the cache lines and how much communication does the NIC need to do to issue a single "MSI write" to send an IRQ at the end (if that IRQ isn't skipped due to IRQ rate limiting or something)? I'm fairly sure you'll find that (for "number of transactions across buses/links") reads/writes to memory are more significant than IRQs by multiple orders of magnitude.

The only case where IRQs might actually matter is the "extremely large number of extremely tiny packets" case, but for this case the application developer has lost the right to expect acceptable performance (for failing to do anything to combine multiple tiny packets into fewer larger packets); not least of all because the majority of the "10 gigabits of available bandwidth on the wire" will be wasted on inter-packet gaps and packet headers (regardless of what OS or NIC does).

For congestion on the link between CPUs; don't forget that even the oldest and slowest version of quickpath is (according to wikipiedia) running at 153.6 Gbit/s and the NIC's worst case maximum of 10 Gbit/s is a relatively slow dribble. In practice, what is likely to be far more important is the software overhead of the OS and libraries (and not hardware) - things like routing and firewalls (and anti-virus), and crusty old socket APIs (based on "synchronous read/write").


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
dchapiesky
Member
Member
Posts: 204
Joined: Sun Dec 25, 2016 1:54 am
Libera.chat IRC: dchapiesky

Re: connecting multiple ethernet 10g link to one processor

Post by dchapiesky »

Brendan wrote: In practice, what is likely to be far more important is the software overhead of the OS and libraries (and not hardware) - things like routing and firewalls (and anti-virus), and crusty old socket APIs (based on "synchronous read/write").
Very true. In the DPDK they don't even use interrupts - they poll the nic's h/w rings continuously (thus the 1 core for incoming packets and 1 core for outgoing) to achieve 40gig wire speed.

They are currently attempting to address the issue of power cost & idle in virtual machines.. where these cores are eating up empty cycles on a shared box. That is another thread of discussion altogether.

In any case I would love to hear more about the hardware ggodw000 opened with... can we know more details?
Plagiarize. Plagiarize. Let not one line escape thine eyes...
ggodw000
Member
Member
Posts: 396
Joined: Wed Nov 18, 2015 3:04 pm
Location: San Jose San Francisco Bay Area
Contact:

Re: connecting multiple ethernet 10g link to one processor

Post by ggodw000 »

Thanks all for inputs. I know it is not simply problem, and can be affected by rather conglomeration of many pieces hardware, kernel, OS, application and all related features. Will probably take some to digest all inputs.
dpcha, the system info is really confidential, wish I could tell more about it, it is soon to be related product not some school project.
key takeaway after spending yrs on sw industry: big issue small because everyone jumps on it and fixes it. small issue is big since everyone ignores and it causes catastrophy later. #devilisinthedetails
User avatar
dchapiesky
Member
Member
Posts: 204
Joined: Sun Dec 25, 2016 1:54 am
Libera.chat IRC: dchapiesky

Re: connecting multiple ethernet 10g link to one processor

Post by dchapiesky »

ggodw000 wrote:the system info is really confidential
Glad you are getting paid brother 8) (or sister) (or...)

I love long term projects that you can't tell anyone about...

"How's your project going?" .... "Fine..."

9 months later

"How's your project going?" .... "Fine..."

Cheers and good luck on it.

Seriously.. *everything* DPDK - particularly DPDK PktGen (PacketGenerator) on their github - WIRESPEED packet storm to test your hardware/software combo
Plagiarize. Plagiarize. Let not one line escape thine eyes...
ggodw000
Member
Member
Posts: 396
Joined: Wed Nov 18, 2015 3:04 pm
Location: San Jose San Francisco Bay Area
Contact:

Re: connecting multiple ethernet 10g link to one processor

Post by ggodw000 »

dchapiesky wrote:You will want to examine everything associated with DPDK.... the Intel Dataplane Development Kit

They are working with not just 10gig cards but also 40gig cards

http://www.dpdk.org/

here is a paper on 40gig performance on off the shelf hardware...

http://perso.telecom-paristech.fr/~dros ... echrep.pdf

Now to the specifics....

DPDK's threading model is one thread per core where the core is exempted from the OS's scheduler.

Threads are run-to-completion and avoid OS syscalls...

In general practice, a whole core - A WHOLE CORE - is dedicated to *** just reading*** the hardware rings on the NICs connected to it...

Another whole core is dedicated to *** just writing *** to the hardware rings of NICs connected to it...

The DPDK software is very cache aware and the data structures in it are tuned to be cache line aligned.

Now towards your NUMA issues....

DPDK takes into account - via mapping Huge Page memory (also not under OS memory manager control) to physical addresses, finding contiguous pages and ***noting which physical memory is connected to what physical SOCKET****

This allows DPDK to allocate memory to a ring buffer and packet buffer which will be accessed by core 3 on socket 0 and be assured that this memory is physically connected to socket 0.

This reduces substantially the NUMA hit your team members are concerned about.... but it is a software problem... not a hardware one.

ultimately the argument about why your second unconnected processor may lag in performance is a question about just what software you run on it....

For example...

socket 0 cores 0 - 8 -- run networking code and filtering algorithms

socket 1 cores 9 - 15 -- run the OS and a database application which consumes filtered events

the lag is one way from socket 0 (the network) to socket 1 (the database)

If the database has to communicate at wire speed - it would be better to have some of it running on socket 0.

So... please check out DPDK, and go ask your team just what applications this hardware is supposedly designed for..... Although I find it hard to believe the second socket doesn't have it's own PCI-x connections....

Good luck and cheers!
Couple of responses:

- 2nd CPU does have a PCIe connections through root ports, but just nothing is connected to it
- generic modern server that could be running any applications, not tailured for specific application.
key takeaway after spending yrs on sw industry: big issue small because everyone jumps on it and fixes it. small issue is big since everyone ignores and it causes catastrophy later. #devilisinthedetails
Post Reply