OSDev.org

Posted: **Sun Oct 04, 2015 4:40 pm**

TLB shootdowns are expensive, as they involve sending out IPIs to concerned CPUs (I’ve seen figures for these vary from 10^3 to 10^4 cycles, would appreciate some help writing an actual benchmark) and polling them for completion via shared memory (with associated interconnect traffic). One way to avoid them for, say, a per-CPU private area would be to track which CPUs may be caching which pages in their TLBs, by making (hardware) page tables sort of a per-CPU cache and filling them with actual data as necessary. Are the additional page faults actually worth it? Or should we just stop trying to be smart and use shared page tables?

Related link: Linus on per-CPU page tables, mostly from a synchronization standpoint.
Unrelated note: I’d also be interested in anyone’s benchmarks on INVLPG / MOV CR3 + TLB repopulation break-even point. Seems that INVLPG used to be horribly unoptimized (Uhlig, page 48, gives 8 pages on a Pentium 4); it would be interesting to know (or find out for myself) how it performs now.

Posted: **Sun Oct 04, 2015 5:57 pm**

I don't understand your idea. What are per-cpu page-tables and how do they help ?

Is it the IPI or the TLB flush that is your major concern ?

Posted: **Sun Oct 04, 2015 7:43 pm**

Well, a TLB flush (partial at least) on page table modification is unavoidable: it is just normal cache invalidation. I am concerned about the IPIs, specifically unnecessary ones.

When doing TLB shootdowns, you want to interrupt as few CPUs as possible, preferably none of them. In order to do that, you need to track which CPUs may have cached which page entries. With one page table shared accross CPUs the best possible granularity is the whole address space: you track which CPUs have the given page table tree active, and pessimistically assume they may have cached every page in it. This is acceptable as long the TLB is not tagged and a given CPU can cache no more than one address space at a time. (Example: Haswell has 12-bit wide TLB tags / PCIDs; ps -A | wc -l on my system outputs 219, meaning every address space can have a PCID allocated on each CPU.)

To have finer granularity, you can do a trick: don’t point your CPU to the real page table; instead, point it to an (almost) empty one and copy real entries to it on page faults. If you track which CPU had to copy which entries, you can have a better idea of which of them may be cached. This is “per-CPU page tables”.

Posted: **Sun Oct 04, 2015 8:53 pm**

Thanks for the explanation. I get it now.

I had always assumed that the big cost would be the TLB flush and that the IPI send/receive itself would only cost 10's of cycles. It seems like there should be very little overhead there. No more overhead than a software interrupt.

Posted: **Mon Oct 05, 2015 3:58 am**

Hi,

alexShpilkin wrote:Well, a TLB flush (partial at least) on page table modification is unavoidable: it is just normal cache invalidation. I am concerned about the IPIs, specifically unnecessary ones.

Myself; I'd worry about scalability - for a simple implementation it's mostly exponential growth in overhead as the number of CPUs increase (e.g. something like "overhead = N * (N-1)" for N CPUs). It's not so bad where you're looking at a small number of CPUs (e.g. 2 to 4 CPUs); but with Intel doing "18-core with 2 logical CPUs per core" Xeons (and about to release 72-core with 4 logical CPUs per core" Xeon Phi) we're not necessarily looking at systems with a small number of CPUs anymore.

I'd also worry about power management (when CPUs are idle/sleeping you don't want TLB shootdowns to be constantly bumping them out of low power states).

alexShpilkin wrote:Well, a TLB flush (partial at least) on page table modification is unavoidable: it is just normal cache invalidation. I am concerned about the IPIs, specifically unnecessary ones.

When doing TLB shootdowns, you want to interrupt as few CPUs as possible, preferably none of them. In order to do that, you need to track which CPUs may have cached which page entries. With one page table shared accross CPUs the best possible granularity is the whole address space: you track which CPUs have the given page table tree active, and pessimistically assume they may have cached every page in it. This is acceptable as long the TLB is not tagged and a given CPU can cache no more than one address space at a time. (Example: Haswell has 12-bit wide TLB tags / PCIDs; ps -A | wc -l on my system outputs 219, meaning every address space can have a PCID allocated on each CPU.)

Yes.

alexShpilkin wrote:To have finer granularity, you can do a trick: don’t point your CPU to the real page table; instead, point it to an (almost) empty one and copy real entries to it on page faults. If you track which CPU had to copy which entries, you can have a better idea of which of them may be cached. This is “per-CPU page tables”.

I don't see how that solves anything; in that you'd still have to send an IPI asking the other CPUs "if you've copied the real page table into your own; then I've modified the real page table so you have to modify/update your copy (and also invalidate its TLB)".

The only way to avoid that is to ensure that CPUs don't share data. If you can say "this CPU is the only CPU that could possibly access the effected part of this virtual address space" then you don't need TLB shootdown (or its equivalent).

With this in mind the first thing you'd want to do is split virtual address spaces into zones. For example, instead of having a large globally shared "kernel space" you could split it into zones; where one zone is globally shared, one zone is "NUMA domain specific" and a third zone is "CPU specific". In this case if anything in the "CPU specific" zone is modified you know that no other CPU has any access to it at all (and therefore no TLB shootdown is necessary); and if anything in the "NUMA domain specific" zone is modified you know that only CPUs within the same NUMA domain has any access to it (and therefore TLB shootdown IPIs only need to be sent to CPUs in that NUMA domain and not all CPUs in all NUMA domains).

However; for kernels there's typically very little data that can be truly CPU specific (or NUMA domain specific); and that data typically isn't allocated/freed often while the OS is running and doesn't cause TLB shootdowns.

Cheers,

Brendan

Posted: **Mon Oct 05, 2015 4:06 am**

Is it actually guaranteed that if a core hasn't accessed a particular address since its TLB was flushed that that address is not in the TLB on x86 ? For example don't hyper threaded cores share TLB ?

Posted: **Mon Oct 05, 2015 7:34 am**

Hi,

gerryg400 wrote:Is it actually guaranteed that if a core hasn't accessed a particular address since its TLB was flushed that that address is not in the TLB on x86 ? For example don't hyper threaded cores share TLB ?

It's not guaranteed - even just speculative execution can (in theory) cause something that is never actually touched to be fetched (note: in practice I think most CPUs simply give up on TLB miss, but this isn't a guaranteed/architecturally defined behaviour).

Also; recent CPUs have "hardware TLB prefetching", where if you access pages in a certain predictable sequence (e.g. linearly) it'll start pre-fetching TLB entries before you access them (and even if you never access them - e.g. you accessed pages 20, 21 and 22, so CPU pre-fetches TLB for page 23, but you never had any intention of touching anything in page 23).

For hyper-threading there's an optimisation (in some CPUs) where logical CPUs will share TLB entries if/when CR3 is the same for both. When CR3 is different, TLB entries for "global" pages (that happen to be the same for both logical CPUs) are not shared. As far as I know this can be enabled/disabled via. an obscure MSR, but is enabled by default.

Cheers,

Brendan

Posted: **Mon Oct 05, 2015 9:35 am**

At least you can flag a discrete address space for the number/ids of CPUs that are using it, and don't bother them with IPIs when you don't have to.

Since IPIs are only mandatory when privileges are lessened, you can try and avoid returning memory to the OS for processes with many threads, and only do that every so often to avoid broadcasts.

If the application doesn't actually depend on having pages truly gone, you can also queue all unmap requests until they're either reallocated (put the same page back into the table to save the IPI and possibly TLB misses as well), or when you're actually out of memory and you can reclaim all that memory with a single broadcast IPI.

Posted: **Mon Apr 04, 2016 4:22 pm**

Brendan wrote:
alexShpilkin wrote:To have finer granularity, you can do a trick: don’t point your CPU to the real page table; instead, point it to an (almost) empty one and copy real entries to it on page faults. If you track which CPU had to copy which entries, you can have a better idea of which of them may be cached. This is “per-CPU page tables”.
I don't see how that solves anything; in that you'd still have to send an IPI asking the other CPUs "if you've copied the real page table into your own; then I've modified the real page table so you have to modify/update your copy (and also invalidate its TLB)".

You could use a different communication mechanism for estimating which CPUs to ask: cache coherency protocols, in the form of shared memory. There’s (something cleverly concurrent but functionnally equivalent to) a collectively maintained mask of CPUs that have accessed a particular page entry. IPIs get sent only to CPUs mentioned in the mask. It looks like the CPUs that needn’t be interrupted wouldn’t even participate in the cache coherency traffic because they won’t have cached the line containing the mask.

Posted: **Thu Apr 14, 2016 8:37 am**

My 2c about the subject (edit: reading back, I see Combuster kind of said this already..):

First, I assume that adding new pages is trivial to solve (since other cores would either not cache unmapped entries, or in the worst case just do an extra page fault).

Second, I assume kernel has linearly mapped all the memory into its own address space, so that it never needs to do any changes for it self.

Third, I feel like normal processes would never need to remove/update page mapping in their virtual address space. They should just be growing heap, growing stack, or mapping shared or private memory into their address space. Unmapping memory does not need to happen immediately as long as the virtual address space is not needed for anything else.

Thus the problem reduces to the case when a kernel running on one core decides to remove a mapping of a process address space and should eventually notify other cores about the change.

When will it do such a thing?

- when a process dies, the heap is removed and physical pages are reused.
- when a thread dies, the stack is removed and physical pages are reused.
- when a read-only page has not beed used for some time and the page is reused.
- when a page is swapped out to disk and the page is reused.
- when private memory has been unmapped, and the pages are reused.
- most likely I have missed some real use cases here, please add.

Now, what is the last possible time that the other cores should be informed of such changes?

It feels like, as long as the page is not yet used for any other purposes, it does not really matter if other cores use a cached TLB entry to access the page, except maybe for the case when page is swapped out to disk. So the kernel could stall the re-use of the page until all cores acknowledge that the virtual address + PCID combo is no longer in their TLB. That "TLB cleanup" could happen when ever, e.g. when scheduling is done, or when we are really running out of physical pages.

When page swapping is done, it might be tricky if other cores could still be writing to that page but when we are required to do swapping, TLB shootdown is our least concern..

Finally, virtual address space reuse feels even easier to avoid, just map new stuff to a different address space (48-bits should be enough for everything, right?). If other thread uses the now-unmapped space in another core, it is ill behaving anyway and can be let to access the old physical page (as long as it is not re-used yet).

I must be missing some crucial point here, because I feel like real time TLB invalidation it not really needed that much, as long as the system has enough RAM.

Posted: **Fri Apr 22, 2016 4:48 pm**

Hellbender wrote:My 2c about the subject (edit: reading back, I see Combuster kind of said this already..)

Unfortunately, I do not feel convinced by some of the arguments. It looks like what you want is quite different from what I do, but I will still answer from my own perspective. This is not comprehensive, but it’s as long a coherent text as I’m currently able of writing; sorry if the narrative is sloppy.

Hellbender wrote:
I assume kernel has linearly mapped all the memory into its own address space, so that it never needs to do any changes for itself.

There are systems where this is possible; some (MIPS) even have a kernel-only “unpaged” area in memory built in. However, x86-32 + PAE is one example where it is not possible to have all physical memory space mapped at the same time: virtual addresses are 32 bits (addressing at most 4G bytes) and physical are 36 (capable of addressing 64G of memory). Even if not the whole physical address space is really memory, it can realistically have big holes in it. Couple that with not wanting to have a separate address space for the kernel, and you’re in trouble.

In theory, x86-64 in its current form suffers from the same problem: (canonical) virtual addresses are 48 bits wide, and physical addresses have 52 bits. I imagine this will be only of theoretical concern for years to come, though.

To encompass these cases consistently, I feel it is cleaner for the kernel to maintain a ‘private mapping area’ for temporary mappings. It this area is per-CPU, though, it won’t need TLB shootdowns at all: the kernel doesn’t need protection from itself. I’m not sure if all cases of kernel-private mappings can be handled in this manner, though.

Hellbender wrote: Third, I feel like normal processes would never need to remove/update page mapping in their virtual address space. They should just be growing heap, growing stack, or mapping shared or private memory into their address space. Unmapping memory does not need to happen immediately as long as the virtual address space is not needed for anything else.

This is true in today’s systems. However, even with current approaches cheap unmapping could be used to improve some things. For example, implementing read barriers for GC using page faults is not unheard of; but this was done some 20 years ago so I don’t know if it is still feasible today.

The main scenario I imagine is different, though. If a process submits data to a device driver using shared memory, the driver might let the process write the data in its final form, but will still want to check its consistency before submitting it to the device itself. Even if this is not needed, letting the user program overwrite data that is being read by a device is probably not a good idea. This means that the submitting process must give up the write rights until the device is done with it. Voilà: decreasing access rights.

Example: for zero-copy networking, let the process itself write TCP, IP and possibly even Ethernet headers, but check that it has the permission to use the specified port and protocol numbers and addresses before pointing the hardware to the data. You will need to ensure that the things you’ve checked won’t change in the interim.

Current systems show that this can be worked around, but I’d still like to have cheap unmapping if possible.

In any case, there will be an unmapping call available from user space, so the machinery needs to be in place. This is still problematic. On one hand, a process should be able to do whatever it wants with its own mappings. On the other hand, if any process that happens to have a right to occupy one processor at a particular moment in time can cause all processors in the system to stall for thousands of cycles (IPI), we have a DoS vulnerability or at least a resource accounting problem.

Hellbender wrote: Thus the problem reduces to the case when a kernel running on one core decides to remove a mapping of a process address space and should eventually notify other cores about the change.

In my world, the kernel makes almost no decisions by itself, so this is the unfrequent case.

Hellbender wrote: when a process dies, the heap is removed and physical pages are reused.

Make address spaces not processes: don’t mix protection domains with execution states. Then ‘delete address space’ is an explicit user-level call, because even if nothing is currently running in the address space, it doesn’t mean it’s dead.

Hellbender wrote: when a thread dies, the stack is removed and physical pages are reused.

The kernel needn’t occupy itself with threads. See the scheduler activations paper for evidence that kernel threads impede user-level threading performance, or the Haskell language runtime for an example of just how lightweight user-level threads can be. (Spoiler: it can handle millions of them without breaking a sweat. And this is enormously useful.) As always, moving stuff out of the kernel generally improves things, but the shrinking amount of trusted code means you have to ensure (coherency of page) protection.

Hellbender wrote: Finally, virtual address space reuse feels even easier to avoid, just map new stuff to a different address space (48-bits should be enough for everything, right?). If other thread uses the now-unmapped space in another core, it is ill behaving anyway and can be let to access the old physical page (as long as it is not re-used yet).

I must be missing some crucial point here, because I feel like real time TLB invalidation it not really needed that much, as long as the system has enough RAM.

I think I know what the point is: if this is applied in full generality, either the user-level allocator will go mad or it will need to implement an additional level of address mapping just to remember where it put things.

OSDev.org

Feasibility of per-CPU page tables

Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables

Re: Feasibility of per-CPU page tables