Flushing TLB in an SMP environment

rdos · Post by **rdos** » Tue May 10, 2011 1:09 am

Brendan wrote:Here's what you're trying to avoid:

Process 1, thread 1, running on CPU#1 frees the page (and TLB entry is invalidated on CPU#1, and the IPI is sent to other CPUs)

Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it

Process 1, thread 2, running on CPU#3 still has an old TLB entry for the page (because it hasn't invalidated yet), and reads the passwords from process 2

CPU #2 receives the IPI, but it's too late

Rather perhaps this is the problem:

Process 1, thread 1, running on CPU#1 frees the page (and makes it available in the free pool of pages)
Process 2, thread ?, running on CPU#2 allocates the page and stores your passwords in it
Process 1, thread 2, running on CPU#3 still has an old TLB entry (CPU#1 still hasn't reached the point where it invalidates TLBs with IPIs), and reads the passwords from process 2
Process 1, thread 1, running on CPU#1 sends the IPIs

In order to avoid this problem, physical pages that are freed should not be reusable until there are no more pending IPIs. This could serve as a usable algorithm. Rather than putting these freed pages in the free pool, for immediate reuse, they should be put in a "locked" pool where they cannot be reused. Every time the TLB flushing code is run it could check if there are no unprocessed TLB flushes, and if so, move the "locked" pages to the free pool. Another alternative is to let the scheduler do this (probably better since it isn't called from ISRs).

With this new algorithm, the scenario changes to this:

Process 1, thread 1, running on CPU#1 frees the page and places it in a locked page-pool
Process 2, thread ?, running on CPU#2 cannot allocate this page (but gets another page).
Process 1, thread 2, running on CPU#3 still has an old TLB entry, and can read/modify it until it receives the IPI
Process 1, thread 1, running on CPU#1 sends the IPIs
CPU#2 and CPU#3 gets the IPIs
The scheduler on CPU#2 notices that the locked pool contains pages, and that there are no pending IPIs, so it puts the page in the free pool.

I can see two implementation-related problems with this. The first one is that some indication that there are pending IPIs (perhaps on the CPU doing the invalidation) needs to be present before placing the pages in the locked page pool, until the actual IPIs are sent (this might be a spinlock that also protects the locked page list). The second problem has to do with linking the locked pages into a list. It would not be safe to use the pages themselves, as late modifies could clobber the links. Rather, there is a need to allocate a header in order to link them.

OK. So the new interface should actually do one more thing. If it is a non-SMP-system, it should place the freed physical pages in the free pool, while if it is a SMP system, it should place them in the locked page-pool.

With this design, the header will contain the pages from a request, which could contain one or more physical pages. The header could also contain a timestamp that can be used to know when it would be safe to place the pages in the free pool (especially under heavy load when there might always be pending IPIs).

davidv1992 · Post by **davidv1992** » Tue May 10, 2011 1:46 am

One problem I see with your suggested implementation is that the locked page pool could in theory be locked indefinitely when the ammount of IPI's is large, which it could become on systems with many cores.

rdos · Post by **rdos** » Tue May 10, 2011 2:29 am

davidv1992 wrote:One problem I see with your suggested implementation is that the locked page pool could in theory be locked indefinitely when the ammount of IPI's is large, which it could become on systems with many cores.

Yes, this is a problem, but it could be solved with timestamps. Either it can be assumed that pages will not be used after some time (say 1ms), and/or the IPI service routine could check if all IPIs have been processed, and if so, set an timestamp that indicates that all free requests older than this can be freed. A "pending IPI" counter (like Brendan suggested previously) could be used for this. A counter is fast to check & update.

rdos · Post by **rdos** » Tue May 10, 2011 3:32 am

Going though the various calls to "FlushTLB", there are two categories of calls. One category deals with changing mapping (kernel only) or changing access-attributes (user and kernel). These calls works with the original definition as there is no page-deletion involved. Then there is the page-aligned memory allocator for kernel and user. These are the ones that needs to defer putting deleted physical pages in the free pool and instead put them in the locked page-pool.

rdos · Post by **rdos** » Tue May 10, 2011 8:05 am

I just found out that many of the online documents for invlpg are wrong. The operand is NOT a linear address, but an ordinary address. There are hints about this in the Intel processor manual as well, but the strange thing is that it doesn't fault when limits are exceeded. In order for invlpg to work in an segmented OS, there is a need to load DS with a flat selector (or at least a selector with base 0).

rdos · Post by **rdos** » Tue May 10, 2011 8:48 am

With invlpg now working correctly, I've also managed to turn on global pages, and to indicate that kernel + device-driver code + the kernel byte aligned memory allocator, use global pages, and thus there is never any need to invalidate those. Works well on my single-core Pentium at least.

EDIT: Also works on Intel Atom and AMD Athlon. The performance difference is small though.

Brendan · Post by **Brendan** » Tue May 10, 2011 9:13 am

Hi,

Brendan wrote:I typically acquire a spinlock, set a "number of CPUs that need to handle the IPI" counter, send the IPI, then wait until the other CPUs have decreased the counter to zero before releasing the spinlock and allowing the originating CPU to continue. It's probably not the most elegant way, but it works.

More specifically, I had a generic "remote CPU call" mechanism:

Acquire the spinlock for the "remote CPU call slot"
Set the address of the routine you want the other CPU/s to call
Set the number of CPUs that the IPI will send to
Send the IPI/s
Wait for the "number of CPUs" counter to be decreased by the remote CPUs (after they've executed the requested routine)
Release the lock

I had multiple "remote CPU call slots" (although I think there were only 4 of these slots - probably not enough); and a set of functions to initiate a remote CPU call (one for "do the call on all CPUs", one for "do the call on a specific CPU", etc). It made things easier, because everything else in the kernel used it (and I didn't have code for handling IPIs scattered everywhere).

Cheers,

Brendan

rdos · Post by **rdos** » Tue May 10, 2011 10:38 am

Brendan wrote:More specifically, I had a generic "remote CPU call" mechanism:
Acquire the spinlock for the "remote CPU call slot"

Set the address of the routine you want the other CPU/s to call

Set the number of CPUs that the IPI will send to

Send the IPI/s

Wait for the "number of CPUs" counter to be decreased by the remote CPUs (after they've executed the requested routine)

Release the lock
I had multiple "remote CPU call slots" (although I think there were only 4 of these slots - probably not enough); and a set of functions to initiate a remote CPU call (one for "do the call on all CPUs", one for "do the call on a specific CPU", etc). It made things easier, because everything else in the kernel used it (and I didn't have code for handling IPIs scattered everywhere).

OK. My IPC function is built on top of some low-level multitasking primitives, and has no idea about muitiple CPUs. It handles networks by running a custom (registered) IP protocol.

Brendan · Post by **Brendan** » Tue May 10, 2011 1:18 pm

Hi,

rdos wrote:
Brendan wrote:I had multiple "remote CPU call slots" (although I think there were only 4 of these slots - probably not enough); and a set of functions to initiate a remote CPU call (one for "do the call on all CPUs", one for "do the call on a specific CPU", etc). It made things easier, because everything else in the kernel used it (and I didn't have code for handling IPIs scattered everywhere).
OK. My IPC function is built on top of some low-level multitasking primitives, and has no idea about muitiple CPUs. It handles networks by running a custom (registered) IP protocol.

Sorry - I shouldn't have said "remote". I meant "other CPU/s in the same computer".

Basically it's just an abstraction (that doesn't involve IPC or networking at all) that hides some lower level details (single-CPU vs. multi-CPU, xAPIC vs. x2APIC, arrangement of logical destination flags in the APIC, etc) from the rest of the kernel.

For a simple example, consider a function like "void do_function_on_all_except_self( void (*function)(void) )", which does nothing at all on single-CPU, and on multi-CPU sends an IPI (using either xAPIC or x2APIC) while taking care of any error handling (e.g. kernel panic if CPU/s fail to respond to the IPI after a time-out), doing the EOI, synchronisation, etc.

Now consider several variations, with support for multiple "slots" (IDT entries):

void do_function_on_all_except_self( int slot, void (*function)(void) );
void do_function_on_all_including_self( int slot, void (*function)(void) );
void do_function_on_specific_CPU( int slot, void (*function)(void), int CPU_number);
void do_function_on_all_within_NUMA_node( int slot, void (*function)(void), int NUMA_node);
void do_function_on_one_CPU_in_each_core( int slot, void (*function)(void) );
void do_function_on_one_CPU_in_each_chip( int slot, void (*function)(void) );
void do_function_on_one_CPU_in_each_NUMA_domain( int slot, void (*function)(void) );

Cheers,

Brendan

rdos · Post by **rdos** » Tue May 10, 2011 1:40 pm

This is getting more complex. In order for other CPUs to know which TLB entries to free, they need some type of data-structure that at least contains linear base address and number of pages. This structure needs to be allocated from the byte aligned memory pool. So, if this needs to be allocated anyway, why not add a few more members as well? It would be nice to have an array with physical entries (if it is a free, otherwise this array could be empty). It would also be nice to have a bitmask with CPUs that have not yet handled this request (so a particular CPU knows if it handled it or not), and number of remaining CPUs (so the last one could free the page table entries and the structure itself). While at it, lets add a link-field so entries can be linked into a list.

The best thing with this is that the problem of freeing the entries themselves is solved. When a particular CPU finds that all CPUs have done the TLB invalidates, it can either free the entries and the data-structure, or put the data-structure in a "pending free list".

Brendan · Post by **Brendan** » Tue May 10, 2011 10:02 pm

Hi,

rdos wrote:This is getting more complex. In order for other CPUs to know which TLB entries to free, they need some type of data-structure that at least contains linear base address and number of pages. This structure needs to be allocated from the byte aligned memory pool. So, if this needs to be allocated anyway, why not add a few more members as well? It would be nice to have an array with physical entries (if it is a free, otherwise this array could be empty). It would also be nice to have a bitmask with CPUs that have not yet handled this request (so a particular CPU knows if it handled it or not), and number of remaining CPUs (so the last one could free the page table entries and the structure itself). While at it, lets add a link-field so entries can be linked into a list.

The best thing with this is that the problem of freeing the entries themselves is solved. When a particular CPU finds that all CPUs have done the TLB invalidates, it can either free the entries and the data-structure, or put the data-structure in a "pending free list".

The most I've ever needed for TLB shootdown was a process ID, an "address of first page" and a "number of pages". For the "run a function on other CPU/s" system you can use space on the stack for anything you like - it's safe because the "run a function on other CPU/s" code won't let the initiating CPU continue until the other CPU/s have finished.

Because each "slot" has to use a pre-allocated IDT entry for the IPI and because the data for the slot itself is small (one spinlock, one pointer and one "CPU counter") there's no real reason not to have a pre-allocated table of structures for this (e.g. in the kernel's ".bss"); so you end up with no memory management needed at all.

For one bit per CPU (rather than a counter), I'd be worried about how many bits you'd need. For x2APIC the upper limit is something like 1 million CPUs/bits, or about 128 KiB. The main difference is that you'd be able to tell which CPU is failing to respond, but I'm not too sure how that can help anything unless there's some sort of (NMI based) fault recovery behind it to restart/reset CPUs that have locked up with interrupts disabled.

Cheers,

Brendan

rdos · Post by **rdos** » Tue May 10, 2011 10:41 pm

Brendan wrote:The most I've ever needed for TLB shootdown was a process ID, an "address of first page" and a "number of pages". For the "run a function on other CPU/s" system you can use space on the stack for anything you like - it's safe because the "run a function on other CPU/s" code won't let the initiating CPU continue until the other CPU/s have finished.

Yes, but my idea is to let the initiating CPU continue as soon as it as has set things up and flushed the TLB itself.

Brendan wrote:Because each "slot" has to use a pre-allocated IDT entry for the IPI and because the data for the slot itself is small (one spinlock, one pointer and one "CPU counter") there's no real reason not to have a pre-allocated table of structures for this (e.g. in the kernel's ".bss"); so you end up with no memory management needed at all.

Yes, I understand your method. I'm just testing an alternative as I have bad experience with waiting for CPUs to handle IPIs.

Brendan wrote:For one bit per CPU (rather than a counter), I'd be worried about how many bits you'd need. For x2APIC the upper limit is something like 1 million CPUs/bits, or about 128 KiB. The main difference is that you'd be able to tell which CPU is failing to respond, but I'm not too sure how that can help anything unless there's some sort of (NMI based) fault recovery behind it to restart/reset CPUs that have locked up with interrupts disabled.

I've limited number of CPUs to 64 (but it can be changed if there is a reason). This translates to 8 bytes. Today's software cannot successfully use many cores. Most uses will not be able to use more than 2-3 cores.

rdos · Post by **rdos** » Wed May 11, 2011 12:30 pm

It doesn't work to allocate the data-structure associated with TLB invalidations from the kernel byte aligned allocator. It seems like the flushes happens in ISRs, and the memory allocator uses critical sections to synchronize access, which is not allowed from ISRs. This allocator is also too slow in order to achieve acceptable performance. The alternative is to use some of the fixed-block allocators that are used by USB hardware drivers and similar. These are really fast, can use spinlocks to synchronize, but they must be of a fixed size in order to achieve performance. This means that larger free-requests needs to be split-up into smaller requests that fits into the format (which is no big problem, really).

The current size of the minimal data-structure is 22 bytes. That means it would be possible to select 32 or 64 bytes fixed size. 64 is probably best, and has room for 10 physical addresses.

Because linear memory would be allocated with the same allocator that is supposed to flush the TLB, linear address space must be pre-allocated. When this space is exhausted, the system will simply have to halt until some pending requests have been finished.

rdos · Post by **rdos** » Fri May 13, 2011 4:17 am

Finally, I have experimental code to handle TLB invalidations with fixed-size 64 byte memory blocks. It seems to work on single-processor systems. Just need testing on my SMP-systems later today. It will solve the issue of not reusing physical memory until all affected cores have invalidated their TLBs, as well as to handle the flushing with invlpg. The flushing core will just allocate the required number of memory blocks, setup which cores should flush their TLBs, link them into a list, and then send the IPIs to cores that are not done and have no pending IPIs. This can be extended to survailance of cores in order to detect lockups in cores (and eventually restarting them).

rdos · Post by **rdos** » Sat May 14, 2011 7:57 am

Well, it kind of works on SMP now. At least almost. There is some nasty "bug" in the physical memory handller. Actually, the physical memory handler should use spinlocks (and cli) instead of critical sections, especially since physical memory could be allocated in the pagefault handler, and there is no garantee this will be non-ISR code. This would also simplify freeing the pages after all cores have flushed their TLBs. It could then be done in the same way regardless if the code is called from an ISR (for instance, an IPI) or on behalf of a thread.

So, back to the last problem, and fixing a spinlock-interface for modules outside of the scheduler-module.

OSDev.org

Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment

Re: Flushing TLB in an SMP environment