Page 1 of 1

Single address space system and tlb caching

Posted: Sun Sep 20, 2015 7:17 am
by AlexHully
Hi,

I was reading with a lot of interest that thread : http://forum.osdev.org/viewtopic.php?f= ... ext+switch

I can hear that we cannot avoid TLB thrashing. But what if the OS caches the used addresses (by the process X) in the process X's structure and prefetches them when switching?

We could have a fast tlb recovery.

I am interested in 64bits systems because obviously, 32 bits with single address space is a bit.. pretentious (read: way not enough) for a modern general purpose OS.

Yes it would still require some loads but not the page walk. That in itself is a huge gain (less cache misses/cpu stalls).

Is that idea a no go for some reason?

Bye (and thanks)

Re: Single address space system and tlb caching

Posted: Sun Sep 20, 2015 8:37 am
by Combuster
A TLB miss costs a few cycles, and the processor does a lot to figure out which pages they are before it actually has to wait on it. Chances are that the gains you have from prefetching is less than the amount of losses you get from figuring what to prefetch in the first place.
I can hear that we cannot avoid TLB thrashing.
Trashing implies excessive misses, which only really happens when a dataset is too large to fit in the TLB. Actual trashing is usually an algorithmical issue and not quite relevant for task-switching.

You can however avoid TLB flushing: The regular task switching method clears out the TLB just once instead of repeatedly per process. Using small address space (segmentation+paging) or single address space (bytecode) techniques gives you the process isolation that don't actually require emptying the TLB to switch tasks.

Re: Single address space system and tlb caching

Posted: Sun Sep 20, 2015 9:19 am
by AlexHully
Combuster,

Ok, so the cpu is really good at predictions (maybe). So even if it does so, it still has to pagewalk because at the scheduler level, so many things can happen that the cpu cannot predict, or predict too late.

My solution is to give the final answer without the cost of page-walking. Even done at a early stage, and quite in parallel, it is still slow because of the number of memory read ports (eg: Haswell has 2 unlike many cpus. The others only have 1 -> Agner Fog's instruction tables).

So the prefetch would be interesting since the cpu will reorder anyway but at the time we start the process again, the most needed memory would already be in the process of being in the TLB (or already be there) while the rest is reloading incrementaly without waiting for a walk and interfering with the process.

The addresses array would fit in contiguous memory, so cache line friendly.

Re: Single address space system and tlb caching

Posted: Sun Sep 20, 2015 9:48 am
by Brendan
Hi,
AlexHully wrote:I can hear that we cannot avoid TLB thrashing. But what if the OS caches the used addresses (by the process X) in the process X's structure and prefetches them when switching?
Prefetch what, and how?

If you try to prefetch everything you'll just slam into bus bandwidth limits and cripple performance. If you only prefetch a small number of things that you know will definitely be needed (specifically, the pages at user-space RIP and RSP) then it might be slightly beneficial sometimes, in theory.

Also note that (assuming 80x86); for software prefetching, if you ask to prefetch something but the TLB entry isn't present then the CPU ignores the prefetch (and doesn't load the data into TLB). The only way to avoid that (and force the TLB to be loaded) is something I call "pre-loading" (e.g. a dummy read, that isn't a prefetch, but where the data loaded isn't actually used and therefore doesn't cause a pipeline stall).

The problem is that there are other factors involved (e.g. serialising instructions, limits on the number of pending fetches, etc), and those "pre-loads" will cost something (even if it's just additional instruction fetch/decode costs); and there's no guarantee that the costs will be smaller than the benefits (for "worst case", "average case", or even for "best case").

The other problem is something Intel calls "prefetch scheduling distance". Essentially you need to prefetch early enough so that the data is present before it's needed (which means doing something else for maybe 1000 cycles after prefetching/preloading but before returning to user-space), but you also need to prefetch late enough to make sure that the data isn't fetched and then evicted again before it's used.

Note that recent Intel CPUs support a feature called "address space IDs"; which (if used properly) means that TLB entries for processes aren't flushed when a task switch occurs (and makes the entire idea of single address space pointless). Also; note that there are "fast TLB misses" (where the data the CPU needs is in caches) that might cost as little as 5 cycles (and be far cheaper than any prefetching/preloading scheme); and there are also "slow TLB misses" (where the CPU has to fetch the data for the TLB entry all the way from slow RAM chips) that might cost as much as 1000 cycles. This means that if you're task switching extremely frequently you'd probably want to minimise the number of cache lines you touch (to maximise the chance of "fast TLB misses"), which probably implies that you'd want to avoid touching more cache lines during some sort of TLB prefetching scheme.

The other thing that may be worth pointing out is that task switching extremely frequently is extremely stupid; and it's far smarter to avoid task switches instead (e.g. use asynchronous communication where task switches can be postponed/skipped until actually necessary; rather than synchronous communication where every single send/receive costs 2 task switches). In other words, in my opinion, avoiding task switches is far more important than doing task switches faster (e.g. 10 task switches per second at 500 cycles each, rather than 1000 task switches per second at 50 cycles each). ;)


Cheers,

Brendan

Re: Single address space system and tlb caching

Posted: Sun Sep 20, 2015 10:00 am
by AlexHully
Hi Brendan,

I take your arguments (and Combuster's) and dismiss mine.

So, the only smart thing to do (wait, don't take my word too firmly on that) is to differentiate between computer tasks and user tasks.

One will need responsivity more than anything else (context_switches++), other pure speed (context_switches--).

Share them among the cores according to their respective ratios.
So at the end of the road, we reach a performant experience.

If there are computer tasks only (server..), we do our possible so that there are no more processes than the number of cpu cores that should be run. Again, we reach plenitude :)

if there are almost only user guided applications, we can have a huge number of them, since a human will treat one or two things concurrently. But as far as machine is involved, it doesn't matter, it is so much faster than us.

Is it much better that way?