Hi,
AlexHully wrote:I can hear that we cannot avoid TLB thrashing. But what if the OS caches the used addresses (by the process X) in the process X's structure and prefetches them when switching?
Prefetch what, and how?
If you try to prefetch everything you'll just slam into bus bandwidth limits and cripple performance. If you only prefetch a small number of things that you know will definitely be needed (specifically, the pages at user-space RIP and RSP) then it might be slightly beneficial sometimes, in theory.
Also note that (assuming 80x86); for software prefetching, if you ask to prefetch something but the TLB entry isn't present then the CPU ignores the prefetch (and doesn't load the data into TLB). The only way to avoid that (and force the TLB to be loaded) is something I call "pre-loading" (e.g. a dummy read, that isn't a prefetch, but where the data loaded isn't actually used and therefore doesn't cause a pipeline stall).
The problem is that there are other factors involved (e.g. serialising instructions, limits on the number of pending fetches, etc), and those "pre-loads" will cost something (even if it's just additional instruction fetch/decode costs); and there's no guarantee that the costs will be smaller than the benefits (for "worst case", "average case", or even for "best case").
The other problem is something Intel calls "prefetch scheduling distance". Essentially you need to prefetch early enough so that the data is present before it's needed (which means doing something else for maybe 1000 cycles after prefetching/preloading but before returning to user-space), but you also need to prefetch late enough to make sure that the data isn't fetched and then evicted again before it's used.
Note that recent Intel CPUs support a feature called "address space IDs"; which (if used properly) means that TLB entries for processes aren't flushed when a task switch occurs (and makes the entire idea of single address space pointless). Also; note that there are "fast TLB misses" (where the data the CPU needs is in caches) that might cost as little as 5 cycles (and be far cheaper than any prefetching/preloading scheme); and there are also "slow TLB misses" (where the CPU has to fetch the data for the TLB entry all the way from slow RAM chips) that might cost as much as 1000 cycles. This means that if you're task switching extremely frequently you'd probably want to minimise the number of cache lines you touch (to maximise the chance of "fast TLB misses"), which probably implies that you'd want to avoid touching more cache lines during some sort of TLB prefetching scheme.
The other thing that may be worth pointing out is that task switching extremely frequently is extremely stupid; and it's far smarter to avoid task switches instead (e.g. use asynchronous communication where task switches can be postponed/skipped until actually necessary; rather than synchronous communication where every single send/receive costs 2 task switches). In other words, in my opinion, avoiding task switches is far more important than doing task switches faster (e.g. 10 task switches per second at 500 cycles each, rather than 1000 task switches per second at 50 cycles each).
Cheers,
Brendan