Page 1 of 2
Use of INVLPG
Posted: Mon Oct 20, 2008 5:32 pm
by CodeCat
As I was reading about paging-related stuff, I ran into the INVLPG instruction. From what I've gathered, it flushes part of the MMU's page cache, so that it re-fetches that page table entry from memory. But what I'm not clear on is when it should be used. Should I use it every time I change a page table entry? And what if I change a page directory entry, should I just loop over all 1024 entries, INVLPG-ing each one?
Re: Use of INVLPG
Posted: Mon Oct 20, 2008 8:50 pm
by geppy
Re: Use of INVLPG
Posted: Mon Oct 20, 2008 11:19 pm
by egos
It flushes TLB entry for the specific page. It uses when the record about this page probably is located in TLB, but the page no longer is necessary. To flush the large number of page entries you can use reloading of CR3, but only if they are not global.
Re: Use of INVLPG
Posted: Tue Oct 21, 2008 5:31 am
by CodeCat
That file you posted geppy was really helpful. But it doesn't answer one question I still have: If you want to remove a whole PDE, do you have to call INVLPG on all 1024 pages it maps, or is reloading cr3 faster in this case? The reason I'm asking is because I initially have the first 4 MB both identity-mapped and higher-half mapped, and I want to unmap the identity mapping, but I'm not sure how to go about that.
Re: Use of INVLPG
Posted: Tue Oct 21, 2008 5:42 am
by samueldotj
I hope the all your kernel mapping has global bit set. So reloading CR3 might not help you.
How I solve this problem is my first 4 MB identity mapped page is 4MB larg page(no page table). So invalidating it is just one invlpg instruction.
Re: Use of INVLPG
Posted: Tue Oct 21, 2008 7:08 am
by egos
Invalidating PDE is required if you mapped the page table in the virtual address space and had access to it after mapping. If this not so, you must invalidate only that PTEs, which were used for translation during the access to the memory region building on corresponding pages. If there are too much pages, maybe to use reloading CR3 is better way.
Re: Use of INVLPG
Posted: Wed Oct 22, 2008 12:27 am
by Brendan
Hi,
samueldotj wrote:I hope the all your kernel mapping has global bit set. So reloading CR3 might not help you.
How I solve this problem is my first 4 MB identity mapped page is 4MB larg page(no page table). So invalidating it is just one invlpg instruction.
I'm not sure if your first 4 MiB area is the area from 0x00000000 to 0x00FFFFFF or not, but just in case...
Intel's System Programming Guide wrote:
10.11.9. Large Page Size Considerations
The MTRRs provide memory typing for a limited number of regions that have a 4 KByte granularity (the same granularity as 4-KByte pages). The memory type for a given page is cached in the processor’s TLBs. When using large pages (2 or 4 MBytes), a single page-table entry covers multiple 4-KByte granules, each with a single memory type. Because the memory type for a large page is cached in the TLB, the processor can behave in an undefined manner if a large page is mapped to a region of memory that MTRRs have mapped with multiple memory types.
Undefined behavior can be avoided by insuring that all MTRR memory-type ranges within a large page are of the same type. If a large page maps to a region of memory containing different MTRR-defined memory types, the PCD and PWT flags in the page-table entry should be set for the most conservative memory type for that range. For example, a large page used for memory mapped I/O and regular memory is mapped as UC memory. Alternatively, the operating system can map the region using multiple 4-KByte pages each with its own memory type.
The requirement that all 4-KByte ranges in a large page are of the same memory type implies that large pages with different memory types may suffer a performance penalty, since they must be marked with the lowest common denominator memory type.
The Pentium 4, Intel Xeon, and P6 family processors provide special support for the physical memory range from 0 to 4 MBytes, which is potentially mapped by both the fixed and variable MTRRs. This support is invoked when a Pentium 4, Intel Xeon, or P6 family processor detects a large page overlapping the first 1 MByte of this memory range with a memory type that conflicts with the fixed MTRRs. Here, the processor maps the memory range as multiple 4-KByte pages within the TLB. This operation insures correct behavior at the cost of performance. To avoid this performance penalty, operating-system software should reserve the large page option for regions of memory at addresses greater than or equal to 4 MBytes.
If the CPU automatically splits the large page into many separate 4 KiB TLB entries, then one INVLPG probably won't work, and you'd probably still need to do 1024 separate INVPLG's.
Cheers,
Brendan
Re: Use of INVLPG
Posted: Wed Oct 22, 2008 3:12 am
by Brendan
Hi,
General notes...
You are meant to invalidate the TLB entries any time you change the paging structures. This includes changing the busy or accessed flags (for e.g. if you clear them as part of an algorithm to decide which pages to send to swap space and the CPU still thinks the flags are set because you didn't invalidate the TLB entries, then the CPU won't set the accessed/dirty bits again and you'll think the pages aren't being used and send them to swap when they are being used). It also includes changing a page from "not present" to "present" (some CPUs do remember that a page is "not present") or the PAT field or the read/write/execute permissions, or the global flag, or the base address field.
The only thing you can change (without worrying about the TLB) in entries with the "present" bit set is the "available" bits. For entries with the "present" bit clear, you can change anything you like (except the "present" bit itself) without worrying about the TLB.
If you change a page directory entry, then you're meant to invalidate everything that could've been using that page directory entry. For plain 32-bit paging that means 1024 pages, for PAE and long mode it's 512 pages. For PAE and long mode the same applies to changing page directory pointer table entries, page map level 4 entries, etc (for e.g. with PAE, if you change a PDPT entry you're meant to invalidate 262144 pages).
Reloading CR3 is an option (unless "global" pages are involved); which may or may not be a good idea (reloading CR3 may or may not be faster than doing 1024 INVLPG's). However, this could lead to potential optimizations - for example, if a thread has almost used all of the CPU time that the scheduler gave it, then you might be able to do the task switch early (if doing a task switch reloads CR3); which would be much better than invalidating everything and then doing a task switch that invalidates everything again. I'd be very careful with this idea though, especially if spinlocks might have been acquired.
Also don't forget that different CPUs support different things. 80386 doesn't support INVLPG and you have to reload CR3. 80486 and Pentium don't support global pages (so you could reload CR3 in kernel space instead of doing lots of INVLPGs in this case).
If page tables (and page directories, etc for PAE and long mode) are used in more than one place, then you'd need to invalidate multiple areas. For an example (for plain 32-bit paging), a lot of people insert the page directory into itself, so they end up with a 4 MiB mapping of all page table entries. In this case, if you change a page directory entry you'd need to invalidate 1024 pages plus one extra page in the mapping of all page table entries.
For multi-CPU things get more complicated, because you need to invalidate the TLB in the current CPU and in all other CPUs that may have the TLB entry (something called "TLB shootdown"). Typically this is done with IPIs (InterProcessor Interrupts). For example, if you change a page table entry in kernel space you'd invalidate it on the current CPU, set a CPU counter to the number of CPUs present, send a "broadcast to all except self" IPI and then wait for the CPU counter to reach zero. The other CPUs would receive the IPI (interrupting anything they were doing), invalidate the page, decrease the CPU counter, then return to whatever they were doing.
TLB shootdown is expensive - it trashes pipelines, etc on all CPUs in the system, and the more CPU's there are the worse it gets. For example, if each CPU does a TLB shootdown of once per second (on average), then for 32 CPUs you'd get an average of 32 TLB shootdowns per second. Even worse is if you're changing page directory entries or something and need to flush large areas in many CPUs (in this case you get IPI costs plus lots of TLB misses everywhere).
If you change a page table entry or page directory entry in user-space, it'd be nice if you didn't have to send an IPI to every CPU (e.g. sending an IPI to all CPUs, where each CPU does "if(effected_process == my_current_threads_process) then INVLPG"). There's 2 things that make sending an IPI to every CPU difficult to avoid. The first thing is determining which other CPUs may be running threads that belong to the same process without race conditions. For example, you could build a list of CPUs that are running threads that belong to the same CPU and then send the IPIs, but a CPU might do a thread switch after you've built the list but before you sent the IPIs. The other problem is the IPIs themselves - rather than sending several seperate IPIs (one for each CPU that's running a thread that belongs to the same process) it'd be nice if you could broadcast one IPI to all CPUs that need it. This actually is possible in some situations (e.g. logical destination mode, with a different bit set in each CPU's "logical destination register") but isn't possible when there's more than 8 CPUs (unless you're using x2APIC, in which case it doesn't work for more than 32 CPUs).
There are a few easy ways to avoid this. If the process only has one thread then you know that no other CPU could be running thread that belong to the same process and you can skip TLB shootdown. Also, if the process has CPU affinity (where threads may only be run on certain CPUs) then you know you won't need to send "TLB shootdown IPIs" to CPUs that aren't in the processes CPU affinity (unless the processes CPU affinity may have been changed, or may be changed while you're sending the IPIs).
For a major difference in TLB shootdown overhead, you can implement something called "lazy TLB invalidation". The theory here is that if a CPU using stale TLB information would generate a page fault, then you can skip the TLB shootdown and the page fault handler can do the TLB invalidation if the page fault occurs. In this case the page fault handler needs to check CR2 and walk the paging structures to see if the page fault is caused by a stale TLB entry (or something else) and if the page fault was caused by a stale TLB entry it can INVPLG and IRET. This can avoid about half the IPIs, etc.
For an example, imagine if a page table entry says the page is "not present" and you allocated a page and change the page table entry to "present". In this case you'd do INVLPG on the current CPU like normal but you don't need to send any IPIs anywhere. It also works for a few other situations, like making a read-only page writable or executable. Lazy TLB invalidation can't be used for a lot of things though - freeing pages, clearing busy and accessed flags, reducing permissions (e.g. making a previously read/write page into a read-only page), changing PAT fields, etc.
There's also one special case I didn't mention - if the "global" flag is clear and the only thing you do is set the "global" flag, then you don't need to invalidate any TLB entry (not even on the current CPU).
Of course it's usually safe to assume that the only really important thing is allocating and freeing pages, because usually the other page table flags are set correctly when a page is allocated and aren't changed after that (until the page is freed). This means other operations (like changing permission flags, PAT entries, etc) don't really happen (except for clearing the dirty and accessed flags, but that only happens for pages not page directories, etc).
If I haven't completely confused you, then you'll see that the most expensive operation is freeing pages (and page tables, etc) in kernel space, because you can't use lazy TLB invalidation for it and (unlike pages, etc in user-space) it effects all CPUs. There's a simple solution to avoid the overhead here - don't bother freeing stuff. If the kernel is constantly allocating and freeing pages, then not freeing pages means you don't need to do the TLB shootdown and also means you can "allocate" the pages again fast. Of course this is a good idea or not depends on what the kernel is using pages for. For example, if the kernel occasionally allocates a large number of pages for temporary use, then it'd be a bad idea (lots of pages allocated for no reason, that could be used for something more important); but even in this case you could postpone freeing the pages until idle time or until those pages are actually needed elsewhere.
Cheers,
Brendan
Re: Use of INVLPG
Posted: Wed Oct 22, 2008 4:17 am
by ineo
Very interesting reply Brendan.
Thanks for sharing your knowledge.
Re: Use of INVLPG
Posted: Wed Dec 31, 2008 11:22 am
by sawdust
Brendan wrote:Hi,
General notes...
You are meant to invalidate the TLB entries any time you change the paging structures. This includes changing the busy or accessed flags (for e.g. if you clear them as part of an algorithm to decide .......
Cheers,
Brendan
Have you ever considered compiling your postings and publishing a book?
Re: Use of INVLPG
Posted: Sat Jan 31, 2009 6:51 pm
by Colonel Kernel
Sorry to commit thread necromancy... I have finally re-started my OS project after a three-year hiatus and I'm working on designing my TLB shoot-down scheme. This caught my attention:
Brendan wrote:You are meant to invalidate the TLB entries any time you change the paging structures. This includes changing the busy or accessed flags (for e.g. if you clear them as part of an algorithm to decide which pages to send to swap space and the CPU still thinks the flags are set because you didn't invalidate the TLB entries, then the CPU won't set the accessed/dirty bits again and you'll think the pages aren't being used and send them to swap when they are being used).
Correct me if I'm wrong... I think a different scheme needs to be used for TLB shoot-down due to clearing accessed/dirty bits versus clearing the "present" flag or changing a formerly read-write page to be read-only. Otherwise, there is a race condition that would cause the following to happen:
- CPU 0 clears accessed/dirty bits in PTE.
- CPU 1, which has that PTE in its TLB with accessed and/or dirty bits set, writes to that page. CPU 1, seeing that "dirty" is set in the TLB, doesn't bother to set it in the PTE.
- CPU 0 initiates a TLB shootdown, but it's too late -- the page is "dirty", and CPU 0 doesn't know it.
Question 1: Is this a real problem, or am I imagining things?
Question 2: If it is a real problem, does this solve it:
CPU 0 sets the page as "not present" and writes a special signature into the "available" bits. This will cause a page fault on any CPU that tries to access the page and doesn't already have the PTE in its TLB. Next, CPU 0 triggers a TLB shoot-down. After the shootdown, every other CPU that may be trying to access the page enters the page fault handler. The PF handler has special logic in the "lazy TLB invalidation" case for the "not present with special signature" sub-case. This special logic will keep the CPUs stuck there spinning on the "present" bit in the PTE until CPU 0 sets it. Finally, CPU 0 clears the accessed/dirty bits and sets the present bit, causing the other CPUs to stop spinning, invalidate the TLB entry once again, and return. At this point, all CPUs will pick up the correct PTE.
I can post pseudo-code if it will help.
BTW, does anyone else think that having the hardware set the A/D flags in the PTEs is the stupidest design decision Intel ever made...?
Re: Use of INVLPG
Posted: Sun Feb 01, 2009 12:29 am
by Brendan
Hi,
Colonel Kernel wrote:Brendan wrote:You are meant to invalidate the TLB entries any time you change the paging structures. This includes changing the busy or accessed flags (for e.g. if you clear them as part of an algorithm to decide which pages to send to swap space and the CPU still thinks the flags are set because you didn't invalidate the TLB entries, then the CPU won't set the accessed/dirty bits again and you'll think the pages aren't being used and send them to swap when they are being used).
Correct me if I'm wrong... I think a different scheme needs to be used for TLB shoot-down due to clearing accessed/dirty bits versus clearing the "present" flag or changing a formerly read-write page to be read-only. Otherwise, there is a race condition that would cause the following to happen:
- CPU 0 clears accessed/dirty bits in PTE.
- CPU 1, which has that PTE in its TLB with accessed and/or dirty bits set, writes to that page. CPU 1, seeing that "dirty" is set in the TLB, doesn't bother to set it in the PTE.
- CPU 0 initiates a TLB shootdown, but it's too late -- the page is "dirty", and CPU 0 doesn't know it.
Question 1: Is this a real problem, or am I imagining things?
It is a real problem. From Intel's "TLBs, Paging-Structure Caches, and Their Invalidation Application Note" (in Section 5.3 "Optional Invalidation"):
Intel wrote:- If a paging-structure entry is modified to transition the accessed bit from 1 to 0, failure to perform an invalidation may result in the processor not setting that bit in response to a subsequent access to a linear address whose translation may use the entry. Software cannot interpret the bit being clear as an indication that such an access has not occurred.
- If a PTE is modified to transition the dirty bit from 1 to 0, failure to perform an invalidation may result in the processor not setting that bit in response to a subsequent write to a linear address whose translation may use the entry. Software cannot interpret the bit being clear as an indication that such a write has not occurred.
I'd assume this is in the "Optional Invalidation" section because it is actually optional - if an OS doesn't use the accessed and dirty flags for anything then it won't matter, and if an OS does use the accessed and dirty flags the worst case might not be too bad anyway (for example, if these flags are being used to determine which pages are the best pages to send to swap space, then worst case might be making a bad decision occasionally, which isn't necessarily that bad considering any code that tries to determine which pages are the best pages to send to swap space will be making guesses anyway).
Colonel Kernel wrote:Question 2: If it is a real problem, does this solve it:
CPU 0 sets the page as "not present" and writes a special signature into the "available" bits. This will cause a page fault on any CPU that tries to access the page and doesn't already have the PTE in its TLB. Next, CPU 0 triggers a TLB shoot-down. After the shootdown, every other CPU that may be trying to access the page enters the page fault handler. The PF handler has special logic in the "lazy TLB invalidation" case for the "not present with special signature" sub-case. This special logic will keep the CPUs stuck there spinning on the "present" bit in the PTE until CPU 0 sets it. Finally, CPU 0 clears the accessed/dirty bits and sets the present bit, causing the other CPUs to stop spinning, invalidate the TLB entry once again, and return. At this point, all CPUs will pick up the correct PTE.
If you have to do TLB shootdown anyway, then I'd:
- Send an IPI to all CPUs
- Then, wait for all other CPUs to enter the IPI handler and start spinning
- Then, clear the accessed and/or dirty bits on one CPU
- Then, make all CPUs invalidate the page in their TLBs and tell other CPUs that are spinning that they can continue
This is the same method I'd use to do TLB shootdown for any other purpose.
The hard part is improving efficiency - doing what you can to reduce the number of CPUs that need to handle each specific "TLB shootdown" IPI, and reducing the time CPUs spend waiting before they can return from the "TLB shootdown" IPI.
Colonel Kernel wrote:BTW, does anyone else think that having the hardware set the A/D flags in the PTEs is the stupidest design decision Intel ever made...?
For CPUs that don't have A/D flags in hardware, some OSs set the pages to "not present" so they can emulate the A/D flags in software (in the page fault handler). Basically it costs a lot more overhead, and you'd still have to deal with race conditions.
Cheers,
Brendan
Re: Use of INVLPG
Posted: Sun Feb 01, 2009 10:21 am
by Colonel Kernel
Brendan wrote:I'd assume this is in the "Optional Invalidation" section because it is actually optional - if an OS doesn't use the accessed and dirty flags for anything then it won't matter, and if an OS does use the accessed and dirty flags the worst case might not be too bad anyway (for example, if these flags are being used to determine which pages are the best pages to send to swap space, then worst case might be making a bad decision occasionally, which isn't necessarily that bad considering any code that tries to determine which pages are the best pages to send to swap space will be making guesses anyway).
I'm more concerned about the "dirty" bit, since in the worst case the memory manager might discard a dirty page rather than write the most recent changes out to disk. That's data loss, and it's not good!
Brendan wrote:If you have to do TLB shootdown anyway, then I'd:
- Send an IPI to all CPUs
- Then, wait for all other CPUs to enter the IPI handler and start spinning
- Then, clear the accessed and/or dirty bits on one CPU
- Then, make all CPUs invalidate the page in their TLBs and tell other CPUs that are spinning that they can continue
This is the same method I'd use to do TLB shootdown for any other purpose.
That's simpler, although in my scheme only CPUs that actually try to access the page end up spinning. I suppose the spinning won't last very long, so the extra complexity of my scheme is not worth it.
Brendan wrote:Colonel Kernel wrote:BTW, does anyone else think that having the hardware set the A/D flags in the PTEs is the stupidest design decision Intel ever made...?
For CPUs that don't have A/D flags in hardware, some OSs set the pages to "not present" so they can emulate the A/D flags in software (in the page fault handler). Basically it costs a lot more overhead, and you'd still have to deal with race conditions.
There would be more overhead the first time you access a "not accessed" page and write to an "accessed but not dirty" page. However, the overhead wouldn't be too bad because you could use lazy TLB invalidation instead of using an IPI in those cases (not present -> present and read-only -> read-write).
The disadvantage of doing it in hardware is simple. The guys at Intel are
not perfect. Take a look at the list of errata relating to the automatic management of the A/D bits! At least when there are bugs in software, you can actually
fix them!
In the PowerPC architecture, the tlbie instruction sends a special "invalidate" message over the address bus to guarantee that all CPUs flush the right entry immediately.
This is the kind of thing that should be hardware-accelerated, not A/D stuff.
Re: Use of INVLPG
Posted: Sun Feb 01, 2009 10:59 am
by Brendan
Hi,
Colonel Kernel wrote:Brendan wrote:I'd assume this is in the "Optional Invalidation" section because it is actually optional - if an OS doesn't use the accessed and dirty flags for anything then it won't matter, and if an OS does use the accessed and dirty flags the worst case might not be too bad anyway (for example, if these flags are being used to determine which pages are the best pages to send to swap space, then worst case might be making a bad decision occasionally, which isn't necessarily that bad considering any code that tries to determine which pages are the best pages to send to swap space will be making guesses anyway).
I'm more concerned about the "dirty" bit, since in the worst case the memory manager might discard a dirty page rather than write the most recent changes out to disk. That's data loss, and it's not good!
Oops - yes, that would be bad.
Colonel Kernel wrote:Brendan wrote:If you have to do TLB shootdown anyway, then I'd:
- Send an IPI to all CPUs
- Then, wait for all other CPUs to enter the IPI handler and start spinning
- Then, clear the accessed and/or dirty bits on one CPU
- Then, make all CPUs invalidate the page in their TLBs and tell other CPUs that are spinning that they can continue
This is the same method I'd use to do TLB shootdown for any other purpose.
That's simpler, although in my scheme only CPUs that actually try to access the page end up spinning. I suppose the spinning won't last very long, so the extra complexity of my scheme is not worth it.
Using APICs you get decent interrupt priorities - I make IPIs the highest priority APIC interrupts to make sure the spinning isn't for long. I also use interruptable IRQ handlers (otherwise CPUs could be spinning for longer because one CPU is handling an IRQ when the IPI occurs).
There are other ways though - for example, tell the scheduler that all threads that belong to the process shouldn't be scheduled and wait until none of the threads are running, then clear all the A/D bits while you know there aren't any other CPUs using them; or perhaps implement a multi-threaded A/D clearing algorithm (where all CPUs that the process is using do the A/D bits for part of the address space while they know that no other CPUs will be accessing the pages, then reload CR3 when all CPUs are done). I haven't thought about it too much, but sending an IPI for every page might be a lot more expensive than putting the process into some sort of maintenance mode.
Colonel Kernel wrote:Brendan wrote:Colonel Kernel wrote:BTW, does anyone else think that having the hardware set the A/D flags in the PTEs is the stupidest design decision Intel ever made...?
For CPUs that don't have A/D flags in hardware, some OSs set the pages to "not present" so they can emulate the A/D flags in software (in the page fault handler). Basically it costs a lot more overhead, and you'd still have to deal with race conditions.
There would be more overhead the first time you access a "not accessed" page and write to an "accessed but not dirty" page. However, the overhead wouldn't be too bad because you could use lazy TLB invalidation instead of using an IPI in those cases (not present -> present and read-only -> read-write).
The disadvantage of doing it in hardware is simple. The guys at Intel are
not perfect. Take a look at the list of errata relating to the automatic management of the A/D bits! At least when there are bugs in software, you can actually
fix them!
To be honest, I've haven't noticed any errata for A/D bits (or at least none that I thought would effect my code at the time), although I haven't gone through the errata for recent Intel CPUs, for any AMD CPUs, or for any CPUs from other manufacturers (where errata is extremely hard to find) yet.
Total overhead of doing A/D in software would depend on how often you clear A/D bits. In this case clearing the A/D bits less often would reduce overhead but also you'd get a less accurate idea of how often pages are used.
Colonel Kernel wrote:In the PowerPC architecture, the tlbie instruction sends a special "invalidate" message over the address bus to guarantee that all CPUs flush the right entry immediately.
This is the kind of thing that should be hardware-accelerated, not A/D stuff.
That sounds like a race condition to me - on 80x86 you'd need to atomically change the page table entry and invalidate at the same time (in which case you could probably send a TLB correction rather than a TLB invalidation).
Cheers,
Brendan
Re: Use of INVLPG
Posted: Sun Feb 01, 2009 11:14 am
by Colonel Kernel
Brendan wrote:There are other ways though - for example, tell the scheduler that all threads that belong to the process shouldn't be scheduled and wait until none of the threads are running, then clear all the A/D bits while you know there aren't any other CPUs using them; or perhaps implement a multi-threaded A/D clearing algorithm (where all CPUs that the process is using do the A/D bits for part of the address space while they know that no other CPUs will be accessing the pages, then reload CR3 when all CPUs are done). I haven't thought about it too much, but sending an IPI for every page might be a lot more expensive than putting the process into some sort of maintenance mode.
I like the idea of suspending the other threads in the process until the clearing is done. It doesn't work for kernel pages of course, but I won't be messing with the A/D bits of those, so it doesn't matter.
Brendan wrote:To be honest, I've haven't noticed any errata for A/D bits (or at least none that I thought would effect my code at the time), although I haven't gone through the errata for recent Intel CPUs, for any AMD CPUs, or for any CPUs from other manufacturers (where errata is extremely hard to find) yet.
My favourite is AE5
here. Not fatal, but knowing that these little gotchas exist gives me the willies.
Brendan wrote:Colonel Kernel wrote:In the PowerPC architecture, the tlbie instruction sends a special "invalidate" message over the address bus to guarantee that all CPUs flush the right entry immediately.
This is the kind of thing that should be hardware-accelerated, not A/D stuff.
That sounds like a race condition to me - on 80x86 you'd need to atomically change the page table entry and invalidate at the same time (in which case you could probably send a TLB correction rather than a TLB invalidation).
Good point.