Page 3 of 3

Re: Flushing TLB in an SMP environment

Posted: Sat May 14, 2011 2:01 pm
by rdos
As I suspected, minimizing IPIs is the key to performance (at least with my message test application). When I first got it to work, the whole design scaled poorly. Here are some figures comparing AMD Athlon with Intel Atom:

* 1 core Athlon: 6.7 times faster than Atom
* 2 cores Athlon: 8.4 times faster than Atom
* 3 cores Athlon: 6.2 times faster than Atom
* 4 cores Athlon: 5.9 times faster than Atom

As can be seen, the increasing number of IPIs when more cores are added contributes to the application running slower with 3 and 4 cores (it is fastest with 2 cores).

The solution to regain scalability is simple. Instead of sending IPIs to all cores when a global page is freed, a smarter algoritm is used. If the destination core has no thread running, or is running the null thread, just set a flag in the core private data that instructs the core to force it to reload CR3 the next time it loads a thread.

With this algorithm, 4 cores Athlon runs 11.5 timer faster than Atom. IOW, the design scales well again.

Re: Flushing TLB in an SMP environment

Posted: Mon May 16, 2011 4:33 am
by Combuster
the design scales well
Because in the mathematical system of Rdos, 6.7 times 4 equals 11.5 :roll:

Re: Flushing TLB in an SMP environment

Posted: Mon May 16, 2011 5:05 am
by rdos
Combuster wrote:
the design scales well
Because in the mathematical system of Rdos, 6.7 times 4 equals 11.5 :roll:
It does scale well because the test application consists of a single receiver that is bombarded by requests from 8 IPC threads and 8 network threads. Because of this we would not expect it to run much faster with 4 cores than with 1. The best we could expect is about twice the speed of a single core, which is almost what is achieved with 4 cores. The important parameter instead is that performance does not degrade as more cores are added, something that could happen when IPIs are sent between many cores.

What the last fix essentially does is to remove IPIs to cores that are idle, and instead setting a flag to indicate there is a need for a CR3 reload when the next thread is loaded. This minimizes IPIs, and I expect the effect to persist with 8 or 16 cores as well (but I cannot verify this as I have no such machine).

An alternative approach to the same problem would be to simply shut-down all cores except for 2-3 as the remaining cores would only be running idle-code. An algoritm for this could be pretty straight-forward. If the accumulated time in null-thread for cores is above say 1.5 times the time elapsed, there are too many active cores for the present load.