Use of INVLPG

Colonel Kernel · Post by **Colonel Kernel** » Sun Feb 01, 2009 7:27 pm

Ok, I thought about it some more.

Brendan wrote:
Colonel Kernel wrote:In the PowerPC architecture, the tlbie instruction sends a special "invalidate" message over the address bus to guarantee that all CPUs flush the right entry immediately. This is the kind of thing that should be hardware-accelerated, not A/D stuff.
That sounds like a race condition to me - on 80x86 you'd need to atomically change the page table entry and invalidate at the same time (in which case you could probably send a TLB correction rather than a TLB invalidation).

It turns out that this is only a problem in the "clearing dirty bit" case, just like on x86. Imagine you've implemented your TLB shootdown scheme by modifying the PTE, then sending an IPI, instead of sending the IPI first. In this case, the other CPUs don't need to spin -- they just get the IPI, do invlpg, then decrement the "ack counter". This is the moral equivalent of changing the PTE and using tlbie on PowerPC. The race condition inherent in this scheme has different consequences depending on the case:

For marking a page as "not present" in the case where a page is going to be freed, a few writes to that page by other threads might happen before the TLB shootdown. In the case of freeing a user page, this is a logic error in the other threads. The OS need not make any guarantees about when its "VMFree()" syscall is going to fully take effect. In the case of freeing kernel pages, I think some external synchronization is needed, like your proposed "maintenance mode".
For marking a page as "not present" in the case where a page is going to be swapped out (either written out to disk if it's dirty, or re-allocated to another process if it's clean), any extra writes by other threads will not be lost because the swapping thread (the one that marked it as "not present") isn't going to do anything until the TLB shootdown is complete anyway. The only problem is the possibility of a lost dirty bit in this case.
For marking a page as "read-only" that was formerly "read-write", I don't see why the OS needs to make any guarantees to other threads about when this change takes effect. User code should be using external synchronization in cases like this. A few extra writes snuck in by another thread in the same process isn't going to hurt the kernel, as far as I can tell.

So the only remaining evil case is the "lost dirty bit" problem, and your solution involving the scheduler makes sense there.

What do you think? Did I miss any cases? Analyzing these things hurts my head...

bewing · Post by **bewing** » Mon Feb 02, 2009 1:29 am

Honestly, with all this stuff, you guys are scaring me into holding on that much more tightly to my "all the threads of a process are bound to one and only one CPU" stance. That is, when the logic and algorithms needed to make a scheme like this "work" get this complex -- I tend to say it's time to rethink and simplify the whole concept, because it's too damned close to being unworkable.

Brendan · Post by **Brendan** » Mon Feb 02, 2009 5:22 am

Hi,

Colonel Kernel wrote:Ok, I thought about it some more.

Colonel Kernel wrote:What do you think? Did I miss any cases? Analyzing these things hurts my head...

It all works perfectly for me - I couldn't find any flaws.

But...

I can imagine a system with >= 32 CPUs, where each CPU has trouble executing more than 3 instructions in a row without being interrupted by an IPI.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Mon Feb 02, 2009 8:04 am

bewing wrote:Honestly, with all this stuff, you guys are scaring me into holding on that much more tightly to my "all the threads of a process are bound to one and only one CPU" stance. That is, when the logic and algorithms needed to make a scheme like this "work" get this complex -- I tend to say it's time to rethink and simplify the whole concept, because it's too damned close to being unworkable.

I know, it's crazy. If I were trying to invent something new, I wouldn't do it this way.

My goal is to learn how current systems are implemented, and go from there.

Brendan wrote:I can imagine a system with >= 32 CPUs, where each CPU has trouble executing more than 3 instructions in a row without being interrupted by an IPI.

Heh... This may be why you don't see mainstream OSes doing so well on machines with that many CPUs.

Brendan · Post by **Brendan** » Mon Feb 02, 2009 8:56 am

Hi,

bewing wrote:Honestly, with all this stuff, you guys are scaring me into holding on that much more tightly to my "all the threads of a process are bound to one and only one CPU" stance. That is, when the logic and algorithms needed to make a scheme like this "work" get this complex -- I tend to say it's time to rethink and simplify the whole concept, because it's too damned close to being unworkable.

If you always invalidate effected TLBs in all CPUs when any paging structure is modified (page table entries, page directories entries, etc) then it's relatively simple and entirely workable. Like many things, the more you try to optimize it the more complex it gets. IMHO it's only these optimizations that worry you (not the basic concepts).

This does raise an interesting point though - if you're planning on implementing optimized TLB invalidation, then it may also be worth implementing "simple as possible" TLB invalidation for regression testing (e.g. so that if the OS fails with "optimized" TLB invalidation you can switch to "simple as possible" TLB invalidation to determine if the TLB invalidation is the cause of the problem).

Colonel Kernel wrote:
Brendan wrote:I can imagine a system with >= 32 CPUs, where each CPU has trouble executing more than 3 instructions in a row without being interrupted by an IPI.
Heh... This may be why you don't see mainstream OSes doing so well on machines with that many CPUs.

Most (all?) mainstream OSs were designed back when CPU manufacturers could increase CPU clock speed to make faster CPUs. Things have changed since, and while mainstream OSs are trying to improve scalability it's hard to drastically change existing systems without breaking something (e.g. binary compatibility with existing applications). IMHO scalability is one area where an new/alternative OS can have a much needed advantage over existing/established OSs, especially if my ("double the number of CPUs every 18 to 24 months") prediction is close...

Cheers,

Brendan

Brendan · Post by **Brendan** » Thu Feb 05, 2009 12:00 am

Hi,

Brendan wrote:I can imagine a system with >= 32 CPUs, where each CPU has trouble executing more than 3 instructions in a row without being interrupted by an IPI.

I was thinking about this a little more.

It should be possible to use the process ID to limit the number of CPUs that receive an IPI by dynamically programming the "Logical Destination Register" during task switches. For example, if the "Logical Destination Register" has 8 unused bits, then you could do "bit = process_ID % 8" and set the corresponding bit in the LDR during the task switch. Idle CPUs would have no bits set. Then when you're invalidating a TLB entry for a specific process you'd send an IPI with the same ("process_ID % 8") bit set; and on average about 87.5% of "non-idle" CPUs and all idle CPUs won't receive the IPI and won't be interrupted. Note: I specifically mention idle CPUs here as they might be in a low power state (e.g. HLT) where IPIs take them out of the low power state (and therefore reduce the ability of the low power state to save power/heat).

For X2APIC (where large numbers of CPUs are more likely) it gets even better - if the process isn't pinned to a specific NUMA domain/cluster you could broadcast an IPI to all clusters and use up to 16 bits of the LDR to encode the process ID, so that on average about 93.75% of "non-idle" CPUs won't receive the IPI; and if the process is pinned to a specific NUMA domain/cluster it'd be even better.

Of course it's likely that the OS will use some of the bits in the LDR for other purposes (for e.g. I typically use up to 3 bits - one for "first logical CPU in the core", one for "first logical CPU in the chip" and one for "first logical CPU in the NUMA domain", in case I need one CPU in each core/chip/domain to do something). Even if there's only a few bits left for encoding the "process ID hash" it'd still make a significant difference though:

2 bits = 50% of "non-idle" CPUs won't receive the IPI (on average)
3 bits = 66.667%
4 bits = 75%
5 bits = 80%
6 bits = 83.333%
7 bits = 85.714%
8 bits = 87.5%
9 bits = 88.889%
10 bits = 90%
11 bits = 90.909%
12 bits = 91.667%
13 bits = 92.308%
14 bits = 92.857%
15 bits = 93.333%
16 bits = 93.75%

In general, I'd assume that it might be a very good idea to identify each situation that requires IPIs (and determine how often each of these cases occurs) and then plan the usage of bits in the LDR accordingly (possibly during boot), to minimize the number of unnecessary IPIs received for the entire system (e.g. not just for invalidating pages that belong to a process).

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Sat Mar 21, 2009 11:41 pm

<Necromancer>
I resurrect this thread once again!

</Necromancer>

I thought about it even more, and I think (but am not 100% sure) that the "dirty bit problem" is actually not a problem in practice. It all has to do with what a reasonable memory manager would be doing at the time that the D bit is actually cleared. Please follow along with this scenario and let me know if I missed anything:

A page whose PTE's D bit is set is selected to be evicted from its current working set.
The PTE is marked "not present", with the corresponding TLB shootdown that this requires. Maybe other CPUs sneak in an extra read/write or two before the shootdown happens, but once the shootdown is complete, no CPU is referencing the dirty page.
The OS needs to do out-page I/O to write the contents of the dirty page to non-volatile storage. While the I/O is happening, other threads may try to reference the page, resulting in a "collided page fault". To capture and track these "collided page faults", the MM must leave the page marked as "not present" with some special state in the PTE to indicate that out-page I/O is in progress. It will probably maintain some kind of queue of threads waiting for the page fault to be resolved.
When the out-page I/O completes, the MM will need to mark the page as "present" and wake up all the threads that were blocked on the collided page fault. It must also clear the D bit in the PTE. However, it has the perfect opportunity to do this when it is guaranteed (due to the collided page fault mechanism) that no other threads/CPUs are using it. All it needs to do is clear the D bit and set the P bit in the same operation. All bits of the PTE are written to the page table simultaneously anyway.

I think this is the only reasonable way that this could be implemented, and if so, the "dirty bit data loss problem" is a non-issue.

Opinions...?

Brendan · Post by **Brendan** » Sun Mar 22, 2009 7:49 pm

Hi,

Colonel Kernel wrote:Opinions...?

Rather than doing one IPI per page, I'm in favour of putting the process (and all it's threads) into a "paging maintenance mode" and sending one IPI when it enters this mode.

For example:

23 pages that have the dirty bit set and 51 pages that have the dirty bit clear are all selected to be evicted from its current working set.
The originating CPU sets a "this process is in paging maintenance mode" scheduler flag (so that the scheduler won't switch to a thread that belongs to this process) and sends an "entering paging maintenance mode" IPI to other CPUs (so that if another CPU is already running a thread that belongs to this process, then the CPU stops running that thread).
The originating CPU does everything it needs to, knowing that it has exclusive access to the process.
The originating CPU clears the "this process is in paging maintenance mode" scheduler flag, and everything returns to normal. No other CPUs need to invalidate anything.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Mon Mar 23, 2009 9:38 am

I think you missed the point... Up until now, we have regarded clearing the D bit as a special case because doing it "atomically" is necessary for correctness (to avoid data loss). What I'm saying is that at the time the D bit is cleared, the P bit must already be cleared anyway due to the nature of how paging I/O works (i.e. -- no CPUs anywhere can be referencing the page while its contents are being written out, except maybe the one doing the I/O).

My question was, does this analysis look right? Can we basically forget about clearing D as a special case?

Your answer is about optimizing shootdowns, which I am hoping is an orthogonal concern to managing the D bit.

Combuster · Post by **Combuster** » Mon Mar 23, 2009 5:08 pm

doing it "atomically" is necessary for correctness

There is a non-atomic way to guarantuee correctness over the D bit:

1. clear D bit in memory
2. shoot down the TLB
3. copy the memory to a temporary buffer
4. check the D bit again to see which reads completed successful.
5a. if yes, you have the data ready for writing and the dirty status is correct
5b. if no, the process needs the memory, the read is not atomic, but the dirty status is correct

My question was, does this analysis look right? Can we basically forget about clearing D as a special case?

Basically what your saying is to integrate the D clear into a write-to-swap-and-page-out function. If your system always pages out a page when writing to swap then yes this would work. The real question is: is that a reasonable assumption/limitation.

Your answer is about optimizing shootdowns, which I am hoping is an orthogonal concern to managing the D bit.

In the general case, no. A write will be missed when the D bit is cleared in memory, but the TLBs haven't been updated yet. If you flush first, then the entry can be reloaded before the D bit was cleared, and all consecutive writes would be missed.
That means that a TLB flush is always required to guarantuee correctness over the dirty status. The optimisation comes from not having to issue a shootdown, but forcing the condition that the entry in question is not in any TLB by other means (like the TLB flush following a task switch)

$.02

Brendan · Post by **Brendan** » Tue Mar 24, 2009 2:24 am

Hi,

Combuster wrote:
doing it "atomically" is necessary for correctness
There is a non-atomic way to guarantuee correctness over the D bit:

1. clear D bit in memory
2. shoot down the TLB
3. copy the memory to a temporary buffer
4. check the D bit again to see which reads completed successful.
5a. if yes, you have the data ready for writing and the dirty status is correct
5b. if no, the process needs the memory, the read is not atomic, but the dirty status is correct

That seems to fit well with an idea I've been playing with. If all I/O is asynchronous, then you could:

Send an IPI to other CPUs.
Clear the accessed and dirty flags in the PTE (but not the "present" flag)
Copy the page's data into a message.
Tell other CPUs to INVLPG and return from the IPI.
Send the message to something (e.g. send a request to the swap manager asking it to store the data).
Wait for a reply from something (e.g. swap manager). While one thread waits, all other threads are free to continue doing useful work.
When the reply comes, then:
- If the request failed, set the accessed and dirty flags again and pretend nothing happened
- If the request succeeded:
  - If the dirty flag was set while the request was being handled, then the copy on disk is obsolete (and for swap space, the disk space can be freed); and the page-out operation is done
  - If the accessed flag was set (but not the dirty flag) while the request was being handled, then the copy on disk is still current (and could be kept in case the same page becomes "least recently used" again before it's modified); and the page-out operation is done
  - If the accessed flag and dirty flag are both clear then:
    - Send a second IPI to other CPUs (to ensure exclusive access to the accessed and dirty flags).
    - If the dirty flag is set now, then the copy on disk is obsolete (and for swap space, the disk space can be freed) - tell the other CPUs to return from the IPI, and the page-out operation is done
    - If the accessed flag is set now (but not the dirty flag), then the copy on disk is still current - tell the other CPUs to return from the IPI, and the page-out operation is done
    - If the accessed flag and dirty flag are both still clear then:
      - Free the physical page of RAM
      - Mark the page table entry as "not present"
      - Tell the other CPUs to INVLPG and return from the IPI
      - The page-out operation is done

This approach would (usually) cost an extra IPI, but it means other threads continue doing useful work (no blocking and only small amounts of spinning, rather than all threads potentially blocking or spinning for the entire duration of the I/O request) and an extra IPI is probably less expensive than the alternatives.

Colonel Kernel wrote:My question was, does this analysis look right? Can we basically forget about clearing D as a special case?

You can't assume that (for all possible OS designs) the dirty bit and the present bit will always be cleared at the same time; but making sure that the dirty bit and present bit are always cleared at the same time is one way to solve the "dirty bit problem". Basically, your scenario is a possible solution to the problem (rather than proof that the problem doesn't need a solution in practice).

Cheers,

Brendan

OSDev.org

Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG

Re: Use of INVLPG