Flushing TLB in an SMP environment

rdos · Post by **rdos** » Sun May 08, 2011 1:54 pm

This is pretty simple in a single-processor environment. Just reload cr3 (or use invldpg) and you're done. Now that my SMP scheduler is stable, this seems like the next logical "target" in order to provide a stable environment. The simplest solution would be to just send IPIs to other cores as cr3 is reloaded, and let those IPIs reload cr3 in the target core. Would this be an adequate solution, or would it need to be more precise? My current solution only reloads cr3. It doesn't use the more precise invldpg, as I had problems with this before.

gerryg400 · Post by **gerryg400** » Sun May 08, 2011 7:07 pm

In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.

Brendan · Post by **Brendan** » Sun May 08, 2011 7:48 pm

Hi,

rdos wrote:This is pretty simple in a single-processor environment. Just reload cr3 (or use invldpg) and you're done. Now that my SMP scheduler is stable, this seems like the next logical "target" in order to provide a stable environment. The simplest solution would be to just send IPIs to other cores as cr3 is reloaded, and let those IPIs reload cr3 in the target core. Would this be an adequate solution, or would it need to be more precise? My current solution only reloads cr3. It doesn't use the more precise invldpg, as I had problems with this before.

A TLB miss is expensive. For example:

Do TLB lookup, get TLB miss
Do cache lookup for page directory entry, get cache miss
Do "fetch from RAM" for page directory entry
Wait until RAM responds
Do cache lookup for page table entry, get cache miss
Do "fetch from RAM" for page table entry
Wait until RAM responds

For best performance on single-CPU (for "80486 or later"), you should be using INVLPG where possible to avoid flushing TLB entries for no reason (and to avoid lots of expensive TLB misses).

For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.

Some notes:

Some CPUs (e.g. Cyrix) do remember if a page is "not present", and you do need to invalidate TLB entries when you change a page from "not present" to "present".
If you use the "self-reference trick" (e.g. page directory entry that points to the page directory itself) to create a "paging structure mapping", then; if you change a page directory entry you need for flush up 1024 TLB entries for the effected area *plus* 1 TLB in the "paging structure mapping" area.

For SMP, it's the same as single CPU except that an IPI is involved to make sure the TLB entries are invalidated on all CPUs and not just one. Receiving an IPI is as expensive as receiving any other IRQ (e.g. causes a full pipeline flush, followed by IDT and GDT lookups and protection checks, followed by the overhead of the interrupt handler itself). If a CPU doing useful work sends an average of 10 IPIs per second, then 32 CPUs will probably send an average of 320 IPIs per second. If those IPIs are received by all CPUs except one; then for 2 CPUs you'd get 10 IPI received per second, for 4 CPUs you'd get "4 * 3 *10 = 120" IPIs received per second, and for 128 CPUs you'd get "128 * 127 * 10 = 162560" IPIs received per second. It ends up being exponential overhead (a scalability nightmare for large systems). Basically, it's important to avoid unnecessary IPIs.

The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").

There's other ways of avoiding IPIs in specific situations. If you change a page table for the currently running process, and if that process has only one thread, then it's impossible for any other CPUs to have TLB entries for that process and you don't need to send any IPI. If an OS implements "thread local storage" by giving different threads in the process (slightly) different virtual address spaces then you end up with a similar situation - if thread local storage is changed for one thread, then no other CPU can be running that thread and no IPIs are necessary. In a similar way, if a process (with multiple threads) has a "CPU affinity" that prevents those threads from being run on some CPUs, then (if the CPU affinity can't change and cause race conditions) you only need to send IPIs to CPUs that are in the process' CPU affinity (and not all CPUs).

There's also more advanced/complex schemes that involve modifying the local APIC's "logical APIC ID" during task switches, so that some of the bits are used to store a small hash of the process ID. This allows you to send IPIs using the "logical destination mode" to a subset of all CPUs instead of all of them when the TLB entry belongs to a multi-threaded process. For example, if 4 of the "logical APIC ID" bits are set to "1 << (process ID & 3)" during task switches, then you'd get rid of 75% of the IPIs received for TLB invalidation in user-space. Unfortunately, for x2APIC (unlike xAPIC) the "logical APIC ID" is hard-wired and this won't work. Fortunately, for x2APIC the hard-wired "logical APIC ID" is well suited to NUMA optimisations (e.g. you can broadcast an IPI to a subset of CPUs within a specific NUMA domain, and if a process' threads are constrained to a specific NUMA domain then it's easy to avoid sending unnecessary IPIs to other NUMA domains).

Cheers,

Brendan

rdos · Post by **rdos** » Mon May 09, 2011 12:59 am

gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.

That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.

rdos · Post by **rdos** » Mon May 09, 2011 1:34 am

Thanks, Bredan. That seems to more or less sum this problem up.

Brendan wrote:For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.

Ah, now I remember. This was what I had problems with a year ago or so when I tried global pages. The whole system seemed to become unstable, and there was no easy way to pin-point what exactly went wrong. Probably the typical "if you do too many changes at the same time, there is no way of knowing which broke the system".

I problably need to start-out by just enabling global tables (but not setting the bit in page-tables), and then move one table at a time.

Brendan wrote:The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").

Yes, I already use this method. The current TLB-invalidation calls are related to freeing pages only.

gerryg400 · Post by **gerryg400** » Mon May 09, 2011 1:35 am

rdos wrote:
gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.

It doesn't just apply to ISR's or the kernel. Let's say that user 2 threads are running at the same time on different cores and that they are using the same address space. If one thread frees a page, you need to make sure that that page has been removed from both (actually all) TLB's before it is added to the free list. This requires some sort of synchronisation after the IPI.

rdos · Post by **rdos** » Mon May 09, 2011 1:51 am

gerryg400 wrote:
rdos wrote:
gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.
It doesn't just apply to ISR's or the kernel. Let's say that user 2 threads are running at the same time on different cores and that they are using the same address space. If one thread frees a page, you need to make sure that that page has been removed from both (actually all) TLB's before it is added to the free list. This requires some sort of synchronisation after the IPI.

Yes, you are correct, except that applications will never free pages (unless they use special APIs). The C/C++ heap is currently implemented without freeing pages in OpenWatcom. But there is another (similar) problem that has to do with demand-loading pages into the application image. Two cores could try to demand-load the same page at roughly the same time, and this needs to be handled in some way. The single-CPU way would be to disable interrupts, but this no longer works with multiple cores. This is the (last?) issue I have. I need to go through every cli/sti to make sure it is not used to protect code from multiple access, and add spinlocks if it is.

EDIT: I don't see why user code (which per definition should execute with interrupts enabled, without any pending interrupts that could stop an IPI from being served) would need to make sure that another user thread executing on another core would have flushed it's TLB.

This code should be enough to ensure this:

Code: Select all

   for each active core
        SendFlushIPI
   mov eax,cr3
   mov cr3,eax

; it should be safe to assume that any non-ISR based code would not use stale TLB entries on any core at this point

A better (more selective variant) for TLB-entries tied to the private process-space would be this (this includes application memory, and kernel memory allocated for a specific process):

Code: Select all

   for each active core
       if (core.cr3 == cr3)
           SendFlushIPI

   mov eax,cr3
   mov cr3,eax

Another remark on this issue is that it is an error in the application if one thread tries to use memory that another thread just freed. I have run extensive testing on this issue just to make sure that our terminal application will not reference memory that is freed by having used a special dynamic heap-allocator that allocates all requests at the page-level, and indeed frees the pages and invalidates the TLB as the memory block is freed. Therefore, I'm pretty sure this error is rather uncommon.

Owen · Post by **Owen** » Mon May 09, 2011 3:08 am

True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.

rdos · Post by **rdos** » Mon May 09, 2011 5:10 am

Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.

Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).

Brendan · Post by **Brendan** » Mon May 09, 2011 5:45 am

Hi,

rdos wrote:
Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).

I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.

Cheers,

Brendan

rdos · Post by **rdos** » Mon May 09, 2011 6:36 am

Brendan wrote:Hi,

rdos wrote:
Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).
I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.

Cheers,

Brendan

This reminds me of my time-synchronization code. The major problem with this is that as number of cores increase, the time all cores will have to wait for a TLB-flush goes toward the maximum interrupt latency-time. This is in addition to the scaling problem that the flushing in itself has. So, no, I would not do it this way. I would rather work towards other solutions that would not have to ensure that IPIs are handled.

rdos · Post by **rdos** » Mon May 09, 2011 6:43 am

I think I will start this out by removing all cr3 reloads from various modules, and substituting them with either PageFlushGlobal or PageFlushProcess. Then I'll add these two functions and let them do a flush by reloading cr3.

EDIT: Now this interface is in-place. I also have implemented the code that only goes to the SMP-version if multiple cores are installed, and otherwise it will just reload CR3. Even for multi-core hardware, there should be no IPIs if the other cores are not yet started.

So now there is an empty procedure for process invalidation, and another for global invalidation, which I'll implement.

I might also have a flag in the core private selector that indicates invalidation is in process. This flag can be cleared before CR3 is reloaded. If the flag is set, there is no need to emit another IPI to this core, as it is already pending.

rdos · Post by **rdos** » Mon May 09, 2011 2:13 pm

There were no need for spinlocks either. I now have a working TLB flushing mechanism for SMP. I can tell it works because the message test application no longer prints erronous messages as it used to on AMD (it won't do this on Atom with only Hyperthreading either, possibly because of a shared TLB). However, it comes with a performance penalty. The message application now takes 35% longer to send the same number of messages.

rdos · Post by **rdos** » Mon May 09, 2011 10:41 pm

I'll do a slight redesign that should improve performance. I'll add space for up to 4 page-flushes per core + a spinlock and a count. Then TLB flushes of up to 4 pages can use invldpg instead of reloading cr3. The interface also must be changed to specify base address and number of pages.

Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.

Brendan · Post by **Brendan** » Mon May 09, 2011 11:11 pm

Hi,

rdos wrote:Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.

Here's what you're trying to avoid:

Process 1, thread 1, running on CPU#1 frees the page (and TLB entry is invalidated on CPU#1, and the IPI is sent to other CPUs)
Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it
Process 1, thread 2, running on CPU#3 still has an old TLB entry for the page (because it hasn't invalidated yet), and reads the passwords from process 2
CPU #2 receives the IPI, but it's too late

You're making assumptions about timing that seem reasonable (specifically; the time taken for a CPU to receive a sent IPI will be faster than the time taken to for one CPU to free a page and another CPU to allocate it). Unfortunately there's lots of security exploits that take advantage of timing that seemed reasonable.

Cheers,

Brendan

OSDev.org

Flushing TLB in an SMP environment

Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment