Page 1 of 1

connecting multiple ethernet 10g link to one processor

Posted: Tue Jan 03, 2017 7:45 pm
by ggodw000
we are working on a design project with multi-socket system in which we are connecting several 10g ethernet links to 1 CPU and 2nd CPU basically has no connection through hardware design.
There are some concern from select folks whether this cause congestion and debate heated up. IMO, this really generally does not matter due to APIC distributing whatever interrupt is coming from the devices. I was thinking whether there is a case of performance hit in NUMA enabled systems, but then NUMA is largely concerned with the memory locality and has nothing to do with network connectivitiy. This is compared to system which has 1 x 10G link connection to each CPU. Any thoughts on this? Thanks.,

Re: connecting multiple ethernet 10g link to one processor

Posted: Tue Jan 03, 2017 10:24 pm
by dchapiesky
You will want to examine everything associated with DPDK.... the Intel Dataplane Development Kit

They are working with not just 10gig cards but also 40gig cards

http://www.dpdk.org/

here is a paper on 40gig performance on off the shelf hardware...

http://perso.telecom-paristech.fr/~dros ... echrep.pdf

Now to the specifics....

DPDK's threading model is one thread per core where the core is exempted from the OS's scheduler.

Threads are run-to-completion and avoid OS syscalls...

In general practice, a whole core - A WHOLE CORE - is dedicated to *** just reading*** the hardware rings on the NICs connected to it...

Another whole core is dedicated to *** just writing *** to the hardware rings of NICs connected to it...

The DPDK software is very cache aware and the data structures in it are tuned to be cache line aligned.

Now towards your NUMA issues....

DPDK takes into account - via mapping Huge Page memory (also not under OS memory manager control) to physical addresses, finding contiguous pages and ***noting which physical memory is connected to what physical SOCKET****

This allows DPDK to allocate memory to a ring buffer and packet buffer which will be accessed by core 3 on socket 0 and be assured that this memory is physically connected to socket 0.

This reduces substantially the NUMA hit your team members are concerned about.... but it is a software problem... not a hardware one.

ultimately the argument about why your second unconnected processor may lag in performance is a question about just what software you run on it....

For example...

socket 0 cores 0 - 8 -- run networking code and filtering algorithms

socket 1 cores 9 - 15 -- run the OS and a database application which consumes filtered events

the lag is one way from socket 0 (the network) to socket 1 (the database)

If the database has to communicate at wire speed - it would be better to have some of it running on socket 0.

So... please check out DPDK, and go ask your team just what applications this hardware is supposedly designed for..... Although I find it hard to believe the second socket doesn't have it's own PCI-x connections....

Good luck and cheers!

Re: connecting multiple ethernet 10g link to one processor

Posted: Tue Jan 03, 2017 10:28 pm
by dchapiesky
ggodw000 wrote:..... but then NUMA is largely concerned with the memory locality and has nothing to do with network connectivitiy
Please note that most 10gig / 40gig cards directly connect to the CPU Socket and write directly into cache

For a packet to get from Socket 0 to Socket 1 it must traverse from Socket 0 cache to Socket 1 cache via NUMA channels... While this probably doesn't invalidate memory and force a cache flush; it will saturate the inter-socket communications if core 9 on socket 1 is reading h/w rings of a nic on socket 0.....

Re: connecting multiple ethernet 10g link to one processor

Posted: Tue Jan 03, 2017 10:53 pm
by alexfru
+1 to what dchapiesky said. I've worked for a while on a system like that (with many instances of one 10G NIC = one CPU core; poked around NIC drivers and TCP/IP). If I'm not mistaken, we didn't use hyperthreading (disabled it). And some of our old boxes had issues with PCIE shared between NICs and/or some other devices, making it hard to get and maintain the full speed (cheap(er) off-the-shelf hardware may have nasty surprises). DPDK is a way to start.

Re: connecting multiple ethernet 10g link to one processor

Posted: Tue Jan 03, 2017 11:56 pm
by Brendan
Hi,
ggodw000 wrote:we are working on a design project with multi-socket system in which we are connecting several 10g ethernet links to 1 CPU and 2nd CPU basically has no connection through hardware design.
There are some concern from select folks whether this cause congestion and debate heated up. IMO, this really generally does not matter due to APIC distributing whatever interrupt is coming from the devices. I was thinking whether there is a case of performance hit in NUMA enabled systems, but then NUMA is largely concerned with the memory locality and has nothing to do with network connectivitiy. This is compared to system which has 1 x 10G link connection to each CPU. Any thoughts on this? Thanks.,
If you (e.g.) ask the NIC's "TCP offload engine" to send 64 KiB of data; how much communication does the NIC need to do to fetch all the cache lines and how much communication does the NIC need to do to issue a single "MSI write" to send an IRQ at the end (if that IRQ isn't skipped due to IRQ rate limiting or something)? I'm fairly sure you'll find that (for "number of transactions across buses/links") reads/writes to memory are more significant than IRQs by multiple orders of magnitude.

The only case where IRQs might actually matter is the "extremely large number of extremely tiny packets" case, but for this case the application developer has lost the right to expect acceptable performance (for failing to do anything to combine multiple tiny packets into fewer larger packets); not least of all because the majority of the "10 gigabits of available bandwidth on the wire" will be wasted on inter-packet gaps and packet headers (regardless of what OS or NIC does).

For congestion on the link between CPUs; don't forget that even the oldest and slowest version of quickpath is (according to wikipiedia) running at 153.6 Gbit/s and the NIC's worst case maximum of 10 Gbit/s is a relatively slow dribble. In practice, what is likely to be far more important is the software overhead of the OS and libraries (and not hardware) - things like routing and firewalls (and anti-virus), and crusty old socket APIs (based on "synchronous read/write").


Cheers,

Brendan

Re: connecting multiple ethernet 10g link to one processor

Posted: Wed Jan 04, 2017 1:02 am
by dchapiesky
Brendan wrote: In practice, what is likely to be far more important is the software overhead of the OS and libraries (and not hardware) - things like routing and firewalls (and anti-virus), and crusty old socket APIs (based on "synchronous read/write").
Very true. In the DPDK they don't even use interrupts - they poll the nic's h/w rings continuously (thus the 1 core for incoming packets and 1 core for outgoing) to achieve 40gig wire speed.

They are currently attempting to address the issue of power cost & idle in virtual machines.. where these cores are eating up empty cycles on a shared box. That is another thread of discussion altogether.

In any case I would love to hear more about the hardware ggodw000 opened with... can we know more details?

Re: connecting multiple ethernet 10g link to one processor

Posted: Wed Jan 04, 2017 4:03 pm
by ggodw000
Thanks all for inputs. I know it is not simply problem, and can be affected by rather conglomeration of many pieces hardware, kernel, OS, application and all related features. Will probably take some to digest all inputs.
dpcha, the system info is really confidential, wish I could tell more about it, it is soon to be related product not some school project.

Re: connecting multiple ethernet 10g link to one processor

Posted: Wed Jan 04, 2017 8:31 pm
by dchapiesky
ggodw000 wrote:the system info is really confidential
Glad you are getting paid brother 8) (or sister) (or...)

I love long term projects that you can't tell anyone about...

"How's your project going?" .... "Fine..."

9 months later

"How's your project going?" .... "Fine..."

Cheers and good luck on it.

Seriously.. *everything* DPDK - particularly DPDK PktGen (PacketGenerator) on their github - WIRESPEED packet storm to test your hardware/software combo

Re: connecting multiple ethernet 10g link to one processor

Posted: Wed Jan 04, 2017 10:48 pm
by ggodw000
dchapiesky wrote:You will want to examine everything associated with DPDK.... the Intel Dataplane Development Kit

They are working with not just 10gig cards but also 40gig cards

http://www.dpdk.org/

here is a paper on 40gig performance on off the shelf hardware...

http://perso.telecom-paristech.fr/~dros ... echrep.pdf

Now to the specifics....

DPDK's threading model is one thread per core where the core is exempted from the OS's scheduler.

Threads are run-to-completion and avoid OS syscalls...

In general practice, a whole core - A WHOLE CORE - is dedicated to *** just reading*** the hardware rings on the NICs connected to it...

Another whole core is dedicated to *** just writing *** to the hardware rings of NICs connected to it...

The DPDK software is very cache aware and the data structures in it are tuned to be cache line aligned.

Now towards your NUMA issues....

DPDK takes into account - via mapping Huge Page memory (also not under OS memory manager control) to physical addresses, finding contiguous pages and ***noting which physical memory is connected to what physical SOCKET****

This allows DPDK to allocate memory to a ring buffer and packet buffer which will be accessed by core 3 on socket 0 and be assured that this memory is physically connected to socket 0.

This reduces substantially the NUMA hit your team members are concerned about.... but it is a software problem... not a hardware one.

ultimately the argument about why your second unconnected processor may lag in performance is a question about just what software you run on it....

For example...

socket 0 cores 0 - 8 -- run networking code and filtering algorithms

socket 1 cores 9 - 15 -- run the OS and a database application which consumes filtered events

the lag is one way from socket 0 (the network) to socket 1 (the database)

If the database has to communicate at wire speed - it would be better to have some of it running on socket 0.

So... please check out DPDK, and go ask your team just what applications this hardware is supposedly designed for..... Although I find it hard to believe the second socket doesn't have it's own PCI-x connections....

Good luck and cheers!
Couple of responses:

- 2nd CPU does have a PCIe connections through root ports, but just nothing is connected to it
- generic modern server that could be running any applications, not tailured for specific application.