Flushing TLB in an SMP environment

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Flushing TLB in an SMP environment

Post by rdos »

This is pretty simple in a single-processor environment. Just reload cr3 (or use invldpg) and you're done. Now that my SMP scheduler is stable, this seems like the next logical "target" in order to provide a stable environment. The simplest solution would be to just send IPIs to other cores as cr3 is reloaded, and let those IPIs reload cr3 in the target core. Would this be an adequate solution, or would it need to be more precise? My current solution only reloads cr3. It doesn't use the more precise invldpg, as I had problems with this before.
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: Flushing TLB in an SMP environment

Post by gerryg400 »

In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
If a trainstation is where trains stop, what is a workstation ?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Flushing TLB in an SMP environment

Post by Brendan »

Hi,
rdos wrote:This is pretty simple in a single-processor environment. Just reload cr3 (or use invldpg) and you're done. Now that my SMP scheduler is stable, this seems like the next logical "target" in order to provide a stable environment. The simplest solution would be to just send IPIs to other cores as cr3 is reloaded, and let those IPIs reload cr3 in the target core. Would this be an adequate solution, or would it need to be more precise? My current solution only reloads cr3. It doesn't use the more precise invldpg, as I had problems with this before.
A TLB miss is expensive. For example:
  • Do TLB lookup, get TLB miss
  • Do cache lookup for page directory entry, get cache miss
  • Do "fetch from RAM" for page directory entry
  • Wait until RAM responds
  • Do cache lookup for page table entry, get cache miss
  • Do "fetch from RAM" for page table entry
  • Wait until RAM responds
For best performance on single-CPU (for "80486 or later"), you should be using INVLPG where possible to avoid flushing TLB entries for no reason (and to avoid lots of expensive TLB misses).

For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.

Some notes:
  1. Some CPUs (e.g. Cyrix) do remember if a page is "not present", and you do need to invalidate TLB entries when you change a page from "not present" to "present".
  2. If you use the "self-reference trick" (e.g. page directory entry that points to the page directory itself) to create a "paging structure mapping", then; if you change a page directory entry you need for flush up 1024 TLB entries for the effected area *plus* 1 TLB in the "paging structure mapping" area.
For SMP, it's the same as single CPU except that an IPI is involved to make sure the TLB entries are invalidated on all CPUs and not just one. Receiving an IPI is as expensive as receiving any other IRQ (e.g. causes a full pipeline flush, followed by IDT and GDT lookups and protection checks, followed by the overhead of the interrupt handler itself). If a CPU doing useful work sends an average of 10 IPIs per second, then 32 CPUs will probably send an average of 320 IPIs per second. If those IPIs are received by all CPUs except one; then for 2 CPUs you'd get 10 IPI received per second, for 4 CPUs you'd get "4 * 3 *10 = 120" IPIs received per second, and for 128 CPUs you'd get "128 * 127 * 10 = 162560" IPIs received per second. It ends up being exponential overhead (a scalability nightmare for large systems). Basically, it's important to avoid unnecessary IPIs.

The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").

There's other ways of avoiding IPIs in specific situations. If you change a page table for the currently running process, and if that process has only one thread, then it's impossible for any other CPUs to have TLB entries for that process and you don't need to send any IPI. If an OS implements "thread local storage" by giving different threads in the process (slightly) different virtual address spaces then you end up with a similar situation - if thread local storage is changed for one thread, then no other CPU can be running that thread and no IPIs are necessary. In a similar way, if a process (with multiple threads) has a "CPU affinity" that prevents those threads from being run on some CPUs, then (if the CPU affinity can't change and cause race conditions) you only need to send IPIs to CPUs that are in the process' CPU affinity (and not all CPUs).

There's also more advanced/complex schemes that involve modifying the local APIC's "logical APIC ID" during task switches, so that some of the bits are used to store a small hash of the process ID. This allows you to send IPIs using the "logical destination mode" to a subset of all CPUs instead of all of them when the TLB entry belongs to a multi-threaded process. For example, if 4 of the "logical APIC ID" bits are set to "1 << (process ID & 3)" during task switches, then you'd get rid of 75% of the IPIs received for TLB invalidation in user-space. Unfortunately, for x2APIC (unlike xAPIC) the "logical APIC ID" is hard-wired and this won't work. Fortunately, for x2APIC the hard-wired "logical APIC ID" is well suited to NUMA optimisations (e.g. you can broadcast an IPI to a subset of CPUs within a specific NUMA domain, and if a process' threads are constrained to a specific NUMA domain then it's easy to avoid sending unnecessary IPIs to other NUMA domains).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Thanks, Bredan. That seems to more or less sum this problem up.
Brendan wrote:For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.
Ah, now I remember. This was what I had problems with a year ago or so when I tried global pages. The whole system seemed to become unstable, and there was no easy way to pin-point what exactly went wrong. Probably the typical "if you do too many changes at the same time, there is no way of knowing which broke the system".

I problably need to start-out by just enabling global tables (but not setting the bit in page-tables), and then move one table at a time.
Brendan wrote:The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").
Yes, I already use this method. The current TLB-invalidation calls are related to freeing pages only.
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: Flushing TLB in an SMP environment

Post by gerryg400 »

rdos wrote:
gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.
It doesn't just apply to ISR's or the kernel. Let's say that user 2 threads are running at the same time on different cores and that they are using the same address space. If one thread frees a page, you need to make sure that that page has been removed from both (actually all) TLB's before it is added to the free list. This requires some sort of synchronisation after the IPI.
If a trainstation is where trains stop, what is a workstation ?
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

gerryg400 wrote:
rdos wrote:
gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.
It doesn't just apply to ISR's or the kernel. Let's say that user 2 threads are running at the same time on different cores and that they are using the same address space. If one thread frees a page, you need to make sure that that page has been removed from both (actually all) TLB's before it is added to the free list. This requires some sort of synchronisation after the IPI.
Yes, you are correct, except that applications will never free pages (unless they use special APIs). The C/C++ heap is currently implemented without freeing pages in OpenWatcom. But there is another (similar) problem that has to do with demand-loading pages into the application image. Two cores could try to demand-load the same page at roughly the same time, and this needs to be handled in some way. The single-CPU way would be to disable interrupts, but this no longer works with multiple cores. This is the (last?) issue I have. I need to go through every cli/sti to make sure it is not used to protect code from multiple access, and add spinlocks if it is.

EDIT: I don't see why user code (which per definition should execute with interrupts enabled, without any pending interrupts that could stop an IPI from being served) would need to make sure that another user thread executing on another core would have flushed it's TLB.

This code should be enough to ensure this:

Code: Select all

   for each active core
        SendFlushIPI
   mov eax,cr3
   mov cr3,eax

; it should be safe to assume that any non-ISR based code would not use stale TLB entries on any core at this point
A better (more selective variant) for TLB-entries tied to the private process-space would be this (this includes application memory, and kernel memory allocated for a specific process):

Code: Select all

   for each active core
       if (core.cr3 == cr3)
           SendFlushIPI

   mov eax,cr3
   mov cr3,eax
Another remark on this issue is that it is an error in the application if one thread tries to use memory that another thread just freed. I have run extensive testing on this issue just to make sure that our terminal application will not reference memory that is freed by having used a special dynamic heap-allocator that allocates all requests at the page-level, and indeed frees the pages and invalidates the TLB as the memory block is freed. Therefore, I'm pretty sure this error is rather uncommon.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Flushing TLB in an SMP environment

Post by Owen »

True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Flushing TLB in an SMP environment

Post by Brendan »

Hi,
rdos wrote:
Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).
I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Brendan wrote:Hi,
rdos wrote:
Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).
I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.


Cheers,

Brendan
This reminds me of my time-synchronization code. The major problem with this is that as number of cores increase, the time all cores will have to wait for a TLB-flush goes toward the maximum interrupt latency-time. This is in addition to the scaling problem that the flushing in itself has. So, no, I would not do it this way. I would rather work towards other solutions that would not have to ensure that IPIs are handled.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

I think I will start this out by removing all cr3 reloads from various modules, and substituting them with either PageFlushGlobal or PageFlushProcess. Then I'll add these two functions and let them do a flush by reloading cr3.

EDIT: Now this interface is in-place. I also have implemented the code that only goes to the SMP-version if multiple cores are installed, and otherwise it will just reload CR3. Even for multi-core hardware, there should be no IPIs if the other cores are not yet started.

So now there is an empty procedure for process invalidation, and another for global invalidation, which I'll implement.

I might also have a flag in the core private selector that indicates invalidation is in process. This flag can be cleared before CR3 is reloaded. If the flag is set, there is no need to emit another IPI to this core, as it is already pending.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

There were no need for spinlocks either. I now have a working TLB flushing mechanism for SMP. I can tell it works because the message test application no longer prints erronous messages as it used to on AMD (it won't do this on Atom with only Hyperthreading either, possibly because of a shared TLB). However, it comes with a performance penalty. The message application now takes 35% longer to send the same number of messages.
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

I'll do a slight redesign that should improve performance. I'll add space for up to 4 page-flushes per core + a spinlock and a count. Then TLB flushes of up to 4 pages can use invldpg instead of reloading cr3. The interface also must be changed to specify base address and number of pages.

Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Flushing TLB in an SMP environment

Post by Brendan »

Hi,
rdos wrote:Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.
Here's what you're trying to avoid:
  1. Process 1, thread 1, running on CPU#1 frees the page (and TLB entry is invalidated on CPU#1, and the IPI is sent to other CPUs)
  2. Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it
  3. Process 1, thread 2, running on CPU#3 still has an old TLB entry for the page (because it hasn't invalidated yet), and reads the passwords from process 2
  4. CPU #2 receives the IPI, but it's too late
You're making assumptions about timing that seem reasonable (specifically; the time taken for a CPU to receive a sent IPI will be faster than the time taken to for one CPU to free a page and another CPU to allocate it). Unfortunately there's lots of security exploits that take advantage of timing that seemed reasonable.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply