As I suspected, minimizing IPIs is the key to performance (at least with my message test application). When I first got it to work, the whole design scaled poorly. Here are some figures comparing AMD Athlon with Intel Atom:
* 1 core Athlon: 6.7 times faster than Atom
* 2 cores Athlon: 8.4 times faster than Atom
* 3 cores Athlon: 6.2 times faster than Atom
* 4 cores Athlon: 5.9 times faster than Atom
As can be seen, the increasing number of IPIs when more cores are added contributes to the application running slower with 3 and 4 cores (it is fastest with 2 cores).
The solution to regain scalability is simple. Instead of sending IPIs to all cores when a global page is freed, a smarter algoritm is used. If the destination core has no thread running, or is running the null thread, just set a flag in the core private data that instructs the core to force it to reload CR3 the next time it loads a thread.
With this algorithm, 4 cores Athlon runs 11.5 timer faster than Atom. IOW, the design scales well again.
Flushing TLB in an SMP environment
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Flushing TLB in an SMP environment
Because in the mathematical system of Rdos, 6.7 times 4 equals 11.5the design scales well
Re: Flushing TLB in an SMP environment
It does scale well because the test application consists of a single receiver that is bombarded by requests from 8 IPC threads and 8 network threads. Because of this we would not expect it to run much faster with 4 cores than with 1. The best we could expect is about twice the speed of a single core, which is almost what is achieved with 4 cores. The important parameter instead is that performance does not degrade as more cores are added, something that could happen when IPIs are sent between many cores.Combuster wrote:Because in the mathematical system of Rdos, 6.7 times 4 equals 11.5the design scales well
What the last fix essentially does is to remove IPIs to cores that are idle, and instead setting a flag to indicate there is a need for a CR3 reload when the next thread is loaded. This minimizes IPIs, and I expect the effect to persist with 8 or 16 cores as well (but I cannot verify this as I have no such machine).
An alternative approach to the same problem would be to simply shut-down all cores except for 2-3 as the remaining cores would only be running idle-code. An algoritm for this could be pretty straight-forward. If the accumulated time in null-thread for cores is above say 1.5 times the time elapsed, there are too many active cores for the present load.