Re: Flushing TLB in an SMP environment
Posted: Sat May 14, 2011 2:01 pm
As I suspected, minimizing IPIs is the key to performance (at least with my message test application). When I first got it to work, the whole design scaled poorly. Here are some figures comparing AMD Athlon with Intel Atom:
* 1 core Athlon: 6.7 times faster than Atom
* 2 cores Athlon: 8.4 times faster than Atom
* 3 cores Athlon: 6.2 times faster than Atom
* 4 cores Athlon: 5.9 times faster than Atom
As can be seen, the increasing number of IPIs when more cores are added contributes to the application running slower with 3 and 4 cores (it is fastest with 2 cores).
The solution to regain scalability is simple. Instead of sending IPIs to all cores when a global page is freed, a smarter algoritm is used. If the destination core has no thread running, or is running the null thread, just set a flag in the core private data that instructs the core to force it to reload CR3 the next time it loads a thread.
With this algorithm, 4 cores Athlon runs 11.5 timer faster than Atom. IOW, the design scales well again.
* 1 core Athlon: 6.7 times faster than Atom
* 2 cores Athlon: 8.4 times faster than Atom
* 3 cores Athlon: 6.2 times faster than Atom
* 4 cores Athlon: 5.9 times faster than Atom
As can be seen, the increasing number of IPIs when more cores are added contributes to the application running slower with 3 and 4 cores (it is fastest with 2 cores).
The solution to regain scalability is simple. Instead of sending IPIs to all cores when a global page is freed, a smarter algoritm is used. If the destination core has no thread running, or is running the null thread, just set a flag in the core private data that instructs the core to force it to reload CR3 the next time it loads a thread.
With this algorithm, 4 cores Athlon runs 11.5 timer faster than Atom. IOW, the design scales well again.