Flushing TLB in an SMP environment

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

As I suspected, minimizing IPIs is the key to performance (at least with my message test application). When I first got it to work, the whole design scaled poorly. Here are some figures comparing AMD Athlon with Intel Atom:

* 1 core Athlon: 6.7 times faster than Atom
* 2 cores Athlon: 8.4 times faster than Atom
* 3 cores Athlon: 6.2 times faster than Atom
* 4 cores Athlon: 5.9 times faster than Atom

As can be seen, the increasing number of IPIs when more cores are added contributes to the application running slower with 3 and 4 cores (it is fastest with 2 cores).

The solution to regain scalability is simple. Instead of sending IPIs to all cores when a global page is freed, a smarter algoritm is used. If the destination core has no thread running, or is running the null thread, just set a flag in the core private data that instructs the core to force it to reload CR3 the next time it loads a thread.

With this algorithm, 4 cores Athlon runs 11.5 timer faster than Atom. IOW, the design scales well again.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Flushing TLB in an SMP environment

Post by Combuster »

the design scales well
Because in the mathematical system of Rdos, 6.7 times 4 equals 11.5 :roll:
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Combuster wrote:
the design scales well
Because in the mathematical system of Rdos, 6.7 times 4 equals 11.5 :roll:
It does scale well because the test application consists of a single receiver that is bombarded by requests from 8 IPC threads and 8 network threads. Because of this we would not expect it to run much faster with 4 cores than with 1. The best we could expect is about twice the speed of a single core, which is almost what is achieved with 4 cores. The important parameter instead is that performance does not degrade as more cores are added, something that could happen when IPIs are sent between many cores.

What the last fix essentially does is to remove IPIs to cores that are idle, and instead setting a flag to indicate there is a need for a CR3 reload when the next thread is loaded. This minimizes IPIs, and I expect the effect to persist with 8 or 16 cores as well (but I cannot verify this as I have no such machine).

An alternative approach to the same problem would be to simply shut-down all cores except for 2-3 as the remaining cores would only be running idle-code. An algoritm for this could be pretty straight-forward. If the accumulated time in null-thread for cores is above say 1.5 times the time elapsed, there are too many active cores for the present load.
Post Reply