Flushing TLB in an SMP environment
Flushing TLB in an SMP environment
This is pretty simple in a single-processor environment. Just reload cr3 (or use invldpg) and you're done. Now that my SMP scheduler is stable, this seems like the next logical "target" in order to provide a stable environment. The simplest solution would be to just send IPIs to other cores as cr3 is reloaded, and let those IPIs reload cr3 in the target core. Would this be an adequate solution, or would it need to be more precise? My current solution only reloads cr3. It doesn't use the more precise invldpg, as I had problems with this before.
Re: Flushing TLB in an SMP environment
In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
If a trainstation is where trains stop, what is a workstation ?
Re: Flushing TLB in an SMP environment
Hi,
For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.
Some notes:
The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").
There's other ways of avoiding IPIs in specific situations. If you change a page table for the currently running process, and if that process has only one thread, then it's impossible for any other CPUs to have TLB entries for that process and you don't need to send any IPI. If an OS implements "thread local storage" by giving different threads in the process (slightly) different virtual address spaces then you end up with a similar situation - if thread local storage is changed for one thread, then no other CPU can be running that thread and no IPIs are necessary. In a similar way, if a process (with multiple threads) has a "CPU affinity" that prevents those threads from being run on some CPUs, then (if the CPU affinity can't change and cause race conditions) you only need to send IPIs to CPUs that are in the process' CPU affinity (and not all CPUs).
There's also more advanced/complex schemes that involve modifying the local APIC's "logical APIC ID" during task switches, so that some of the bits are used to store a small hash of the process ID. This allows you to send IPIs using the "logical destination mode" to a subset of all CPUs instead of all of them when the TLB entry belongs to a multi-threaded process. For example, if 4 of the "logical APIC ID" bits are set to "1 << (process ID & 3)" during task switches, then you'd get rid of 75% of the IPIs received for TLB invalidation in user-space. Unfortunately, for x2APIC (unlike xAPIC) the "logical APIC ID" is hard-wired and this won't work. Fortunately, for x2APIC the hard-wired "logical APIC ID" is well suited to NUMA optimisations (e.g. you can broadcast an IPI to a subset of CPUs within a specific NUMA domain, and if a process' threads are constrained to a specific NUMA domain then it's easy to avoid sending unnecessary IPIs to other NUMA domains).
Cheers,
Brendan
A TLB miss is expensive. For example:rdos wrote:This is pretty simple in a single-processor environment. Just reload cr3 (or use invldpg) and you're done. Now that my SMP scheduler is stable, this seems like the next logical "target" in order to provide a stable environment. The simplest solution would be to just send IPIs to other cores as cr3 is reloaded, and let those IPIs reload cr3 in the target core. Would this be an adequate solution, or would it need to be more precise? My current solution only reloads cr3. It doesn't use the more precise invldpg, as I had problems with this before.
- Do TLB lookup, get TLB miss
- Do cache lookup for page directory entry, get cache miss
- Do "fetch from RAM" for page directory entry
- Wait until RAM responds
- Do cache lookup for page table entry, get cache miss
- Do "fetch from RAM" for page table entry
- Wait until RAM responds
For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.
Some notes:
- Some CPUs (e.g. Cyrix) do remember if a page is "not present", and you do need to invalidate TLB entries when you change a page from "not present" to "present".
- If you use the "self-reference trick" (e.g. page directory entry that points to the page directory itself) to create a "paging structure mapping", then; if you change a page directory entry you need for flush up 1024 TLB entries for the effected area *plus* 1 TLB in the "paging structure mapping" area.
The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").
There's other ways of avoiding IPIs in specific situations. If you change a page table for the currently running process, and if that process has only one thread, then it's impossible for any other CPUs to have TLB entries for that process and you don't need to send any IPI. If an OS implements "thread local storage" by giving different threads in the process (slightly) different virtual address spaces then you end up with a similar situation - if thread local storage is changed for one thread, then no other CPU can be running that thread and no IPIs are necessary. In a similar way, if a process (with multiple threads) has a "CPU affinity" that prevents those threads from being run on some CPUs, then (if the CPU affinity can't change and cause race conditions) you only need to send IPIs to CPUs that are in the process' CPU affinity (and not all CPUs).
There's also more advanced/complex schemes that involve modifying the local APIC's "logical APIC ID" during task switches, so that some of the bits are used to store a small hash of the process ID. This allows you to send IPIs using the "logical destination mode" to a subset of all CPUs instead of all of them when the TLB entry belongs to a multi-threaded process. For example, if 4 of the "logical APIC ID" bits are set to "1 << (process ID & 3)" during task switches, then you'd get rid of 75% of the IPIs received for TLB invalidation in user-space. Unfortunately, for x2APIC (unlike xAPIC) the "logical APIC ID" is hard-wired and this won't work. Fortunately, for x2APIC the hard-wired "logical APIC ID" is well suited to NUMA optimisations (e.g. you can broadcast an IPI to a subset of CPUs within a specific NUMA domain, and if a process' threads are constrained to a specific NUMA domain then it's easy to avoid sending unnecessary IPIs to other NUMA domains).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Flushing TLB in an SMP environment
That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
Re: Flushing TLB in an SMP environment
Thanks, Bredan. That seems to more or less sum this problem up.
I problably need to start-out by just enabling global tables (but not setting the bit in page-tables), and then move one table at a time.
Ah, now I remember. This was what I had problems with a year ago or so when I tried global pages. The whole system seemed to become unstable, and there was no easy way to pin-point what exactly went wrong. Probably the typical "if you do too many changes at the same time, there is no way of knowing which broke the system".Brendan wrote:For best performance on single-CPU (for "P6 or later"), you should be using "global" pages (where possible) so that TLB entries for pages that are the same in all virtual address spaces aren't flushed when you change CR3. In this case, you have to use INVLPG to flush individual "global" pages (as reloading CR3 won't flush them) and if you must flush all TLB entries (e.g. you changed a lot of page directory entries in kernel space and flushing everything is faster than doing INVLPG thousands of times) you should toggle the PGE flag in CR4.
I problably need to start-out by just enabling global tables (but not setting the bit in page-tables), and then move one table at a time.
Yes, I already use this method. The current TLB-invalidation calls are related to freeing pages only.Brendan wrote:The main way of avoiding IPIs is called "lazy TLB invalidation". If using stale TLB information will cause a page fault, then the page fault handler can check if the page fault was caused by a stale TLB entry and invalidate the TLB entry itself, and you don't need to send an IPI. This means you don't need to send an IPI to other CPU/s if you change a page from "not present" to "present", or from "supervisor" to "user", or from "read-only" to "read/write", or from "no-execute" to "executable" (or any combination of these). That roughly halves the number of IPIs. You do get some page faults instead, but only when the CPU actually does have the info in it's TLB and only if the TLB entry is used (so, the number of page faults is a lot less than the number of IPIs you would've received if you weren't using "lazy TLB invalidation").
Re: Flushing TLB in an SMP environment
It doesn't just apply to ISR's or the kernel. Let's say that user 2 threads are running at the same time on different cores and that they are using the same address space. If one thread frees a page, you need to make sure that that page has been removed from both (actually all) TLB's before it is added to the free list. This requires some sort of synchronisation after the IPI.rdos wrote:That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
If a trainstation is where trains stop, what is a workstation ?
Re: Flushing TLB in an SMP environment
Yes, you are correct, except that applications will never free pages (unless they use special APIs). The C/C++ heap is currently implemented without freeing pages in OpenWatcom. But there is another (similar) problem that has to do with demand-loading pages into the application image. Two cores could try to demand-load the same page at roughly the same time, and this needs to be handled in some way. The single-CPU way would be to disable interrupts, but this no longer works with multiple cores. This is the (last?) issue I have. I need to go through every cli/sti to make sure it is not used to protect code from multiple access, and add spinlocks if it is.gerryg400 wrote:It doesn't just apply to ISR's or the kernel. Let's say that user 2 threads are running at the same time on different cores and that they are using the same address space. If one thread frees a page, you need to make sure that that page has been removed from both (actually all) TLB's before it is added to the free list. This requires some sort of synchronisation after the IPI.rdos wrote:That shouldn't be a problem, as ISRs and code that disables interrupts are not allowed to free memory. This is because memory allocation/free uses critical sections, which ISRs are not allowed to use.gerryg400 wrote:In the case where you are flushing the TLB because some memory has been de-allocated, you may also need some sort of reply from the other cores to ensure that they have completed their TLB flush before your memory manager re-uses the recently freed pages.
EDIT: I don't see why user code (which per definition should execute with interrupts enabled, without any pending interrupts that could stop an IPI from being served) would need to make sure that another user thread executing on another core would have flushed it's TLB.
This code should be enough to ensure this:
Code: Select all
for each active core
SendFlushIPI
mov eax,cr3
mov cr3,eax
; it should be safe to assume that any non-ISR based code would not use stale TLB entries on any core at this point
Code: Select all
for each active core
if (core.cr3 == cr3)
SendFlushIPI
mov eax,cr3
mov cr3,eax
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: Flushing TLB in an SMP environment
True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Re: Flushing TLB in an SMP environment
Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Re: Flushing TLB in an SMP environment
Hi,
Cheers,
Brendan
I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.rdos wrote:Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Flushing TLB in an SMP environment
This reminds me of my time-synchronization code. The major problem with this is that as number of cores increase, the time all cores will have to wait for a TLB-flush goes toward the maximum interrupt latency-time. This is in addition to the scaling problem that the flushing in itself has. So, no, I would not do it this way. I would rather work towards other solutions that would not have to ensure that IPIs are handled.Brendan wrote:Hi,
I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.rdos wrote:Yes. Perhaps a way to solve this is to place just freed physical memory on a temporary "hot" list, and not make it available for re-allocation until all cores have scheduled (1ms).Owen wrote:True. But it is a security or stability issue if an application accesses (successfully) memory that it has freed that has been re-allocated to someone else. Especially if that someone else is the kernel.
Cheers,
Brendan
Re: Flushing TLB in an SMP environment
I think I will start this out by removing all cr3 reloads from various modules, and substituting them with either PageFlushGlobal or PageFlushProcess. Then I'll add these two functions and let them do a flush by reloading cr3.
EDIT: Now this interface is in-place. I also have implemented the code that only goes to the SMP-version if multiple cores are installed, and otherwise it will just reload CR3. Even for multi-core hardware, there should be no IPIs if the other cores are not yet started.
So now there is an empty procedure for process invalidation, and another for global invalidation, which I'll implement.
I might also have a flag in the core private selector that indicates invalidation is in process. This flag can be cleared before CR3 is reloaded. If the flag is set, there is no need to emit another IPI to this core, as it is already pending.
EDIT: Now this interface is in-place. I also have implemented the code that only goes to the SMP-version if multiple cores are installed, and otherwise it will just reload CR3. Even for multi-core hardware, there should be no IPIs if the other cores are not yet started.
So now there is an empty procedure for process invalidation, and another for global invalidation, which I'll implement.
I might also have a flag in the core private selector that indicates invalidation is in process. This flag can be cleared before CR3 is reloaded. If the flag is set, there is no need to emit another IPI to this core, as it is already pending.
Re: Flushing TLB in an SMP environment
There were no need for spinlocks either. I now have a working TLB flushing mechanism for SMP. I can tell it works because the message test application no longer prints erronous messages as it used to on AMD (it won't do this on Atom with only Hyperthreading either, possibly because of a shared TLB). However, it comes with a performance penalty. The message application now takes 35% longer to send the same number of messages.
Re: Flushing TLB in an SMP environment
I'll do a slight redesign that should improve performance. I'll add space for up to 4 page-flushes per core + a spinlock and a count. Then TLB flushes of up to 4 pages can use invldpg instead of reloading cr3. The interface also must be changed to specify base address and number of pages.
Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.
Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.
Re: Flushing TLB in an SMP environment
Hi,
Cheers,
Brendan
Here's what you're trying to avoid:rdos wrote:Besides, having IOPL set to 0 (which disallows applications from manipulating interrupt-flag and thus hindering IRQs) should be enough to ensure freed pages cannot be accessed from another thread.
- Process 1, thread 1, running on CPU#1 frees the page (and TLB entry is invalidated on CPU#1, and the IPI is sent to other CPUs)
- Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it
- Process 1, thread 2, running on CPU#3 still has an old TLB entry for the page (because it hasn't invalidated yet), and reads the passwords from process 2
- CPU #2 receives the IPI, but it's too late
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.