Flushing TLB in an SMP environment

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Brendan wrote:Here's what you're trying to avoid:
  1. Process 1, thread 1, running on CPU#1 frees the page (and TLB entry is invalidated on CPU#1, and the IPI is sent to other CPUs)
  2. Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it
  3. Process 1, thread 2, running on CPU#3 still has an old TLB entry for the page (because it hasn't invalidated yet), and reads the passwords from process 2
  4. CPU #2 receives the IPI, but it's too late
Rather perhaps this is the problem:
  1. Process 1, thread 1, running on CPU#1 frees the page (and makes it available in the free pool of pages)
  2. Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it
  3. Process 1, thread 2, running on CPU#3 still has an old TLB entry (CPU#1 still hasn't reached the point where it invalidates TLBs with IPIs), and reads the passwords from process 2
  4. Process 1, thread 1, running on CPU#1 sends the IPIs
In order to avoid this problem, physical pages that are freed should not be reusable until there are no more pending IPIs. This could serve as a usable algorithm. Rather than putting these freed pages in the free pool, for immediate reuse, they should be put in a "locked" pool where they cannot be reused. Every time the TLB flushing code is run it could check if there are no unprocessed TLB flushes, and if so, move the "locked" pages to the free pool. Another alternative is to let the scheduler do this (probably better since it isn't called from ISRs).

With this new algorithm, the scenario changes to this:
  1. Process 1, thread 1, running on CPU#1 frees the page and places it in a locked page-pool
  2. Process 2, thread ?, running on CPU#2 cannot allocate this page (but gets another page).
  3. Process 1, thread 2, running on CPU#3 still has an old TLB entry, and can read/modify it until it receives the IPI
  4. Process 1, thread 1, running on CPU#1 sends the IPIs
  5. CPU#2 and CPU#3 gets the IPIs
  6. The scheduler on CPU#2 notices that the locked pool contains pages, and that there are no pending IPIs, so it puts the page in the free pool.
I can see two implementation-related problems with this. The first one is that some indication that there are pending IPIs (perhaps on the CPU doing the invalidation) needs to be present before placing the pages in the locked page pool, until the actual IPIs are sent (this might be a spinlock that also protects the locked page list). The second problem has to do with linking the locked pages into a list. It would not be safe to use the pages themselves, as late modifies could clobber the links. Rather, there is a need to allocate a header in order to link them.

OK. So the new interface should actually do one more thing. If it is a non-SMP-system, it should place the freed physical pages in the free pool, while if it is a SMP system, it should place them in the locked page-pool.

With this design, the header will contain the pages from a request, which could contain one or more physical pages. The header could also contain a timestamp that can be used to know when it would be safe to place the pages in the free pool (especially under heavy load when there might always be pending IPIs).
davidv1992
Member
Member
Posts: 223
Joined: Thu Jul 05, 2007 8:58 am

Re: Flushing TLB in an SMP environment

Post by davidv1992 »

One problem I see with your suggested implementation is that the locked page pool could in theory be locked indefinitely when the ammount of IPI's is large, which it could become on systems with many cores.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

davidv1992 wrote:One problem I see with your suggested implementation is that the locked page pool could in theory be locked indefinitely when the ammount of IPI's is large, which it could become on systems with many cores.
Yes, this is a problem, but it could be solved with timestamps. Either it can be assumed that pages will not be used after some time (say 1ms), and/or the IPI service routine could check if all IPIs have been processed, and if so, set an timestamp that indicates that all free requests older than this can be freed. A "pending IPI" counter (like Brendan suggested previously) could be used for this. A counter is fast to check & update.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Going though the various calls to "FlushTLB", there are two categories of calls. One category deals with changing mapping (kernel only) or changing access-attributes (user and kernel). These calls works with the original definition as there is no page-deletion involved. Then there is the page-aligned memory allocator for kernel and user. These are the ones that needs to defer putting deleted physical pages in the free pool and instead put them in the locked page-pool.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

I just found out that many of the online documents for invlpg are wrong. The operand is NOT a linear address, but an ordinary address. There are hints about this in the Intel processor manual as well, but the strange thing is that it doesn't fault when limits are exceeded. In order for invlpg to work in an segmented OS, there is a need to load DS with a flat selector (or at least a selector with base 0).
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

With invlpg now working correctly, I've also managed to turn on global pages, and to indicate that kernel + device-driver code + the kernel byte aligned memory allocator, use global pages, and thus there is never any need to invalidate those. Works well on my single-core Pentium at least.

EDIT: Also works on Intel Atom and AMD Athlon. The performance difference is small though.
Last edited by rdos on Tue May 10, 2011 12:16 pm, edited 1 time in total.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Flushing TLB in an SMP environment

Post by Brendan »

Hi,
Brendan wrote:I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.
More specifically, I had a generic "remote CPU call" mechanism:
  1. Acquire the spinlock for the "remote CPU call slot"
  2. Set the address of the routine you want the other CPU/s to call
  3. Set the number of CPUs that the IPI will send to
  4. Send the IPI/s
  5. Wait for the "number of CPUs" counter to be decreased by the remote CPUs (after they've executed the requested routine)
  6. Release the lock
I had multiple "remote CPU call slots" (although I think there were only 4 of these slots - probably not enough); and a set of functions to initiate a remote CPU call (one for "do the call on all CPUs", one for "do the call on a specific CPU", etc). It made things easier, because everything else in the kernel used it (and I didn't have code for handling IPIs scattered everywhere).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Brendan wrote:More specifically, I had a generic "remote CPU call" mechanism:
  1. Acquire the spinlock for the "remote CPU call slot"
  2. Set the address of the routine you want the other CPU/s to call
  3. Set the number of CPUs that the IPI will send to
  4. Send the IPI/s
  5. Wait for the "number of CPUs" counter to be decreased by the remote CPUs (after they've executed the requested routine)
  6. Release the lock
I had multiple "remote CPU call slots" (although I think there were only 4 of these slots - probably not enough); and a set of functions to initiate a remote CPU call (one for "do the call on all CPUs", one for "do the call on a specific CPU", etc). It made things easier, because everything else in the kernel used it (and I didn't have code for handling IPIs scattered everywhere).
OK. My IPC function is built on top of some low-level multitasking primitives, and has no idea about muitiple CPUs. It handles networks by running a custom (registered) IP protocol.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Flushing TLB in an SMP environment

Post by Brendan »

Hi,
rdos wrote:
Brendan wrote:I had multiple "remote CPU call slots" (although I think there were only 4 of these slots - probably not enough); and a set of functions to initiate a remote CPU call (one for "do the call on all CPUs", one for "do the call on a specific CPU", etc). It made things easier, because everything else in the kernel used it (and I didn't have code for handling IPIs scattered everywhere).
OK. My IPC function is built on top of some low-level multitasking primitives, and has no idea about muitiple CPUs. It handles networks by running a custom (registered) IP protocol.
Sorry - I shouldn't have said "remote". I meant "other CPU/s in the same computer".

Basically it's just an abstraction (that doesn't involve IPC or networking at all) that hides some lower level details (single-CPU vs. multi-CPU, xAPIC vs. x2APIC, arrangement of logical destination flags in the APIC, etc) from the rest of the kernel.

For a simple example, consider a function like "void do_function_on_all_except_self( void (*function)(void) )", which does nothing at all on single-CPU, and on multi-CPU sends an IPI (using either xAPIC or x2APIC) while taking care of any error handling (e.g. kernel panic if CPU/s fail to respond to the IPI after a time-out), doing the EOI, synchronisation, etc.

Now consider several variations, with support for multiple "slots" (IDT entries):
  • void do_function_on_all_except_self( int slot, void (*function)(void) );
  • void do_function_on_all_including_self( int slot, void (*function)(void) );
  • void do_function_on_specific_CPU( int slot, void (*function)(void), int CPU_number);
  • void do_function_on_all_within_NUMA_node( int slot, void (*function)(void), int NUMA_node);
  • void do_function_on_one_CPU_in_each_core( int slot, void (*function)(void) );
  • void do_function_on_one_CPU_in_each_chip( int slot, void (*function)(void) );
  • void do_function_on_one_CPU_in_each_NUMA_domain( int slot, void (*function)(void) );

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

This is getting more complex. In order for other CPUs to know which TLB entries to free, they need some type of data-structure that at least contains linear base address and number of pages. This structure needs to be allocated from the byte aligned memory pool. So, if this needs to be allocated anyway, why not add a few more members as well? It would be nice to have an array with physical entries (if it is a free, otherwise this array could be empty). It would also be nice to have a bitmask with CPUs that have not yet handled this request (so a particular CPU knows if it handled it or not), and number of remaining CPUs (so the last one could free the page table entries and the structure itself). While at it, lets add a link-field so entries can be linked into a list.

The best thing with this is that the problem of freeing the entries themselves is solved. When a particular CPU finds that all CPUs have done the TLB invalidates, it can either free the entries and the data-structure, or put the data-structure in a "pending free list".
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Flushing TLB in an SMP environment

Post by Brendan »

Hi,
rdos wrote:This is getting more complex. In order for other CPUs to know which TLB entries to free, they need some type of data-structure that at least contains linear base address and number of pages. This structure needs to be allocated from the byte aligned memory pool. So, if this needs to be allocated anyway, why not add a few more members as well? It would be nice to have an array with physical entries (if it is a free, otherwise this array could be empty). It would also be nice to have a bitmask with CPUs that have not yet handled this request (so a particular CPU knows if it handled it or not), and number of remaining CPUs (so the last one could free the page table entries and the structure itself). While at it, lets add a link-field so entries can be linked into a list.

The best thing with this is that the problem of freeing the entries themselves is solved. When a particular CPU finds that all CPUs have done the TLB invalidates, it can either free the entries and the data-structure, or put the data-structure in a "pending free list".
The most I've ever needed for TLB shootdown was a process ID, an "address of first page" and a "number of pages". For the "run a function on other CPU/s" system you can use space on the stack for anything you like - it's safe because the "run a function on other CPU/s" code won't let the initiating CPU continue until the other CPU/s have finished.

Because each "slot" has to use a pre-allocated IDT entry for the IPI and because the data for the slot itself is small (one spinlock, one pointer and one "CPU counter") there's no real reason not to have a pre-allocated table of structures for this (e.g. in the kernel's ".bss"); so you end up with no memory management needed at all.

For one bit per CPU (rather than a counter), I'd be worried about how many bits you'd need. For x2APIC the upper limit is something like 1 million CPUs/bits, or about 128 KiB. The main difference is that you'd be able to tell which CPU is failing to respond, but I'm not too sure how that can help anything unless there's some sort of (NMI based) fault recovery behind it to restart/reset CPUs that have locked up with interrupts disabled.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Brendan wrote:The most I've ever needed for TLB shootdown was a process ID, an "address of first page" and a "number of pages". For the "run a function on other CPU/s" system you can use space on the stack for anything you like - it's safe because the "run a function on other CPU/s" code won't let the initiating CPU continue until the other CPU/s have finished.
Yes, but my idea is to let the initiating CPU continue as soon as it as has set things up and flushed the TLB itself.
Brendan wrote:Because each "slot" has to use a pre-allocated IDT entry for the IPI and because the data for the slot itself is small (one spinlock, one pointer and one "CPU counter") there's no real reason not to have a pre-allocated table of structures for this (e.g. in the kernel's ".bss"); so you end up with no memory management needed at all.
Yes, I understand your method. I'm just testing an alternative as I have bad experience with waiting for CPUs to handle IPIs.
Brendan wrote:For one bit per CPU (rather than a counter), I'd be worried about how many bits you'd need. For x2APIC the upper limit is something like 1 million CPUs/bits, or about 128 KiB. The main difference is that you'd be able to tell which CPU is failing to respond, but I'm not too sure how that can help anything unless there's some sort of (NMI based) fault recovery behind it to restart/reset CPUs that have locked up with interrupts disabled.
I've limited number of CPUs to 64 (but it can be changed if there is a reason). This translates to 8 bytes. Today's software cannot successfully use many cores. Most uses will not be able to use more than 2-3 cores.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

It doesn't work to allocate the data-structure associated with TLB invalidations from the kernel byte aligned allocator. It seems like the flushes happens in ISRs, and the memory allocator uses critical sections to synchronize access, which is not allowed from ISRs. This allocator is also too slow in order to achieve acceptable performance. The alternative is to use some of the fixed-block allocators that are used by USB hardware drivers and similar. These are really fast, can use spinlocks to synchronize, but they must be of a fixed size in order to achieve performance. This means that larger free-requests needs to be split-up into smaller requests that fits into the format (which is no big problem, really).

The current size of the minimal data-structure is 22 bytes. That means it would be possible to select 32 or 64 bytes fixed size. 64 is probably best, and has room for 10 physical addresses.

Because linear memory would be allocated with the same allocator that is supposed to flush the TLB, linear address space must be pre-allocated. When this space is exhausted, the system will simply have to halt until some pending requests have been finished.
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Finally, I have experimental code to handle TLB invalidations with fixed-size 64 byte memory blocks. It seems to work on single-processor systems. Just need testing on my SMP-systems later today. It will solve the issue of not reusing physical memory until all affected cores have invalidated their TLBs, as well as to handle the flushing with invlpg. The flushing core will just allocate the required number of memory blocks, setup which cores should flush their TLBs, link them into a list, and then send the IPIs to cores that are not done and have no pending IPIs. This can be extended to survailance of cores in order to detect lockups in cores (and eventually restarting them).
rdos
Member
Member
Posts: 3347
Joined: Wed Oct 01, 2008 1:55 pm

Re: Flushing TLB in an SMP environment

Post by rdos »

Well, it kind of works on SMP now. At least almost. There is some nasty "bug" in the physical memory handller. Actually, the physical memory handler should use spinlocks (and cli) instead of critical sections, especially since physical memory could be allocated in the pagefault handler, and there is no garantee this will be non-ISR code. This would also simplify freeing the pages after all cores have flushed their TLBs. It could then be done in the same way regardless if the code is called from an ISR (for instance, an IPI) or on behalf of a thread.

So, back to the last problem, and fixing a spinlock-interface for modules outside of the scheduler-module.
Post Reply