smp invlpg interrupt

FlashBurn · Post by **FlashBurn** » Sun Mar 26, 2006 7:02 am

I?ve written an interrupt for invaliding pages on a smp system.

;----------------------------
PROC int_invld_tlb
;----------------------------
BEGIN_EXCPT
   mov ebx,[invld_cr3]
   mov eax,[invld_addr]

   cmp [fs:thread_t.pd],ebx
   je .process

   cmp ebx,0xc0000000
   jnge .end
;----------------------------
;   it?s a kernel address so we need to invalid it everytime
.kernel:
   invlpg [eax]

   jmp .end
;----------------------------
;   it?s the same process so we need to invalid the tlb entry
align 4
.process:
   invlpg [eax]
;----------------------------
align 4
.end:
   mov eax,[num_cpus]
   lock add dword[invld_cpus],1

   cmp eax,[invld_cpus]
   jne code_return_excpt

   CALL spinlock_release, invld_spin

   jmp code_return_excpt
ENDP
;----------------------------

The problem are the last few lines. I add 1 tp the "invld_cpus" var and then I test if this var has the same value as the var "num_cpus" so that the last cpu releases the spinlock. But here could something very strange happen. When cpu0 adds 1 and cpu1 adds also 1 and then cpu0 compares the 2 vars and releases the spinlock and cpu1 also releases the spinlock, it could be that on cpu2 the spinlock was taken for a new invld int!

How can I solve this problem w/o using one more spinlock?

Brendan · Post by **Brendan** » Sun Mar 26, 2006 7:51 am

Hi,

FlashBurn wrote:How can I solve this problem w/o using one more spinlock?

The values for invld_cpus, invld_cr3 and invld_addr should all be protected by something to prevent them from being modified by multiple CPUs at the same time, and this lock should only be acquired and released by the CPU that triggers the IPI. This implies that you must have at least one lock (but one lock is all you need).

For example:

Code: Select all

;Input
; ebx = address to invalidate
; ecx = page directory for process (if not kernel space)

do_SMP_invlpg:
    push eax
    push esi
    mov edx,[num_cpus]                ;edx = number of CPUs
    invlpg [ebx]                      ;Invalidate TLB for this CPU
    sub edx,1                         ;edx = number of other CPUs
    jbe .done                         ;Skip it if there's only one CPU

.retry:
    mov eax,0xFFFFFFFF                ;eax = value for "lock free"
    lock cmpxchg [invld_cpus],ebx     ;Try to acquire lock and set number of CPUs
    cmp eax,0xFFFFFFFF                ;Was lock acquired?
    je .gotLock                       ; yes
    pause                             ; no, wait a little
    jmp .retry                        ;     and try again

.gotLock:
    mov [invld_addr],ebx
    mov [invld_cr3],ecx

    <send the IPI using "all except self">

.wait:
    pause                             ;Wait a little
    cmp dword [invld_cpus],0          ;Have the other CPUs finished?
    jne .wait                         ; no, keep waiting

    mov dword [invld_cpus],0xFFFFFFFF ;Release the lock

.done:
    pop edx
    pop eax
    ret



SMP_invlpg_handler:
    push eax
    push ebx
    mov eax,[invld_addr]            ;eax = address to invalidate
    cmp eax,0xC0000000              ;Is it in kernel space?
    jae .invalidate                 ; yes, invalidate the page
    mov ebx,[invld_cr3]             ;ebx = page directory for invalidation
    cmp [fs:thread_t.pd],ebx        ;Is this process effected?
    jne .done                       ; no, don't invalidate the page
.invalidate:
    invplg [eax]                    ;Invalidate the TLB
.done:
    lock sub dword [invld_cpus],1   ;Decrease number of CPUs left
    pop ebx
    pop eax
    iret

Also, there's a way to cheat called "lazy TLB invalidation". The basic idea is that interrupting all of the CPUs will effect their performance, so avoiding this can help perfomance.

Consider what would happen if a page is marked "not present" and you changed it to a present page but didn't send any IPI. In this case another CPU might try to access this page, the stale TLB information would say it's not present and you'd get a page fault. The page fault handler can then check if the page is really present or not, and do an INVLPG and return if it is actually present. That way no IPI is needed, other CPUs aren't interrupted and the INVLPG is only done when it's absolutely necessary.

This doesn't work when you change something from "present" to "not present" though - in this case you wouldn't get a page fault and an IPI is required.

Cheers,

Brendan

Candy · Post by **Candy** » Sun Mar 26, 2006 8:52 am

FlashBurn wrote:
Code: Select all
   mov eax,[num_cpus]
   lock add dword[invld_cpus],1

   cmp eax,[invld_cpus]
   jne code_return_excpt

   CALL spinlock_release, invld_spin

   jmp code_return_excpt
ENDP
;----------------------------
The problem are the last few lines. I add 1 tp the "invld_cpus" var and then I test if this var has the same value as the var "num_cpus" so that the last cpu releases the spinlock. But here could something very strange happen. When cpu0 adds 1 and cpu1 adds also 1 and then cpu0 compares the 2 vars and releases the spinlock and cpu1 also releases the spinlock, it could be that on cpu2 the spinlock was taken for a new invld int!

How can I solve this problem w/o using one more spinlock?

The add instruction doesn't modify the flags (in any useful way). If you make it count down to 0, you can add a virtual check for when it hits 0 (jz) and when it goes below 0 (jc). I would recommend using the jz method.

Code: Select all

lock dec [semaphore]
jz none_left_task_for_last

FlashBurn · Post by **FlashBurn** » Sun Mar 26, 2006 9:44 am

@candy

Thanks I was looking for such an algo!

@brendan

Also thanks! I think there is a little failure in your code, it has to be "lock cmpxchg [invld_cpus],edx"!?

Candy · Post by **Candy** » Sun Mar 26, 2006 9:59 am

FlashBurn wrote: @candy

Thanks I was looking for such an algo!

@brendan

Also thanks! I think there is a little failure in your code, it has to be "lock cmpxchg [invld_cpus],edx"!?

In terms of locking, why not abuse the semaphore as a lock? That is a more generic solution that I tend to like. It only invalidates a single value from the reach that your counter can ever reach (but most semaphore counters are over positive integers anyway, so you can abuse the max (or the virtual -1) for locked:

Code: Select all

semaphore_read:
mov eax, 0xFFFFFFFF
mov ebx, [ebp+4]
lock xchg [ebx], eax
cmp eax, 0xFFFFFFFF
je semaphore_read
ret

semaphore_write:
mov ebx, [ebp+4]
mov eax, [ebp+8]
mov [ebx], eax
ret

Again 100% thread safe but not entirely secure (you need to make sure you don't write 0xFFFFFFFF ever).

I just like to think up thread safe stuff in my free time, makes for a nice thought game...

Brendan · Post by **Brendan** » Sun Mar 26, 2006 10:36 am

Hi,

FlashBurn wrote:@brendan

Also thanks! I think there is a little failure in your code, it has to be "lock cmpxchg [invld_cpus],edx"!?

Doh! You are of course correct.

I guess I should also emphasise something I wrote earlier - this lock should only be acquired and released by the CPU that triggers the IPI.

If I understand your original code correctly, the CPU that triggers the IPI doesn't wait until all CPUs have dealt with that IPI. This can lead to subtle synchronization problems later on, which can be very difficult to track down.

For example, imagine if one CPU tries to add some data to some sort of structure and finds it needs more memory. It locks the structure, allocates a page, sends the IPI, adds it's data to the structure and then unlocks the structure.

Now, imagine a second CPU that is also trying to add data to this structure at almost the same time. It'd try to add it's data to the structure but the structure would have been locked. The problem is that the first CPU may unlock the structure before the second CPU has received the IPI - in this case leading to an unexpected page fault.

For TLB invalidation, (if you're lucky and depending on a lot of things) this sort of problem might not occur in practice (but then it might occur with a specific application on a specific computer, or it might occur all the time).

IMHO it's a good idea to get in the habit of avoiding these sort of problems, whether they actually occur or not - for example, always wait for all CPUs to handle any IPI before allowing the CPU that sent the IPI to continue. This might be more important later on (e.g. when you start using IPIs within the scheduler's code)...

Cheers,

Brendan

FlashBurn · Post by **FlashBurn** » Sun Mar 26, 2006 11:01 am

I don?t really understand what are you wanting to show me ???

This is my code for invaliding pages:

Code: Select all

;----------------------------
PROC paging_unmap_4kb_smp, page
;----------------------------
;   check if this is a 4KiB page
BEGIN
   mov eax,[page]
   test eax,0fffh
   jne .end_err
;----------------------------
;   check if this is a used page
;----------------------------
;   1st to go, the pde
   mov esi,PAGING_PD
   shr eax,22

   mov edx,[esi+4*eax]
   test edx,edx
   jz .end_err
;----------------------------
;   now check the pte
   mov esi,PAGING_PTS
   shl eax,12
   add esi,eax
   mov eax,[page]
   and eax,PAGE_ADDR_PT
   shr eax,12

   mov edx,[esi+4*eax]
   test edx,edx
   jz .end_err
;----------------------------
;   invalid tlb entry
   xor ebx,ebx
   invlpg dword[page]
;----------------------------
;   remove page from pt
   mov [esi+4*eax],ebx

   push edx
;----------------------------
;   mov the needed vals into regs
   mov edi,[num_cpus]
   mov ecx,[page]
   mov edx,[fs:thread_t.pd]
   sub edi,1
;----------------------------
;   call all other cpus to invalid this pg
   CALL spinlock_acquire, invld_spin

   mov [invld_cpus],edi
   mov [invld_addr],ecx
   mov [invld_cr3],edx

   CALL apic_write_icr, ebx, dword (ALL_EXCL_SELF or TRIGGER_EDGE or LEVEL_ASSERT or FIXED or INT_INVLD_TLB)
;----------------------------
.end:
   pop edx

   and edx,PAGE_ADDR
   RETURN edx
;----------------------------
align 4
.end_err:
   RETURN 0
ENDP
;----------------------------

;----------------------------
;   CALL paging_unmap_4mb_normal, addr
PROC paging_unmap_4mb_smp, page
;----------------------------
;   test if this is a 4MiB page
BEGIN
   mov eax,[page]
   test eax,3fffffh
   jne .end_err
;----------------------------
;   calc the pd entry
   mov esi,PAGING_PD
   shr eax,22

   mov edx,[esi+4*eax]
   test edx,edx
   jz .end_err
;----------------------------
;   invalid tlb entry
.invalid:
   invlpg dword[page]
;----------------------------
;   remove page from pt and ...
   xor ebx,ebx
   mov [esi+4*eax],ebx
;----------------------------
;   its special place
   mov esi,PAGING_PTS + 1023 * 4096
   mov [esi+4*eax],ebx
;----------------------------
;   invalid it
   shl eax,2
   add eax,esi
   invlpg [eax]

   push edx eax
;----------------------------
;   mov the needed vals into regs
   mov edi,[num_cpus]
   mov ecx,[page]
   mov edx,[fs:thread_t.pd]
   sub edi,1
;----------------------------
;   call all other cpus to invalid this pg
   CALL spinlock_acquire, invld_spin

   mov [invld_cpus],edi
   mov [invld_addr],ecx
   mov [invld_cr3],edx

   CALL apic_write_icr, ebx, dword (ALL_EXCL_SELF or TRIGGER_EDGE or LEVEL_ASSERT or FIXED or INT_INVLD_TLB)
;----------------------------
;   mov the needed vals into regs
   pop ecx

   mov edi,[num_cpus]
   mov edx,[fs:thread_t.pd]
   or ecx,1
   sub edi,1
   xor ebx,ebx
;----------------------------
;   call all other cpus to invalid this pg
   CALL spinlock_acquire, invld_spin

   mov [invld_cpus],edi
   mov [invld_addr],ecx
   mov [invld_cr3],edx

   CALL apic_write_icr, ebx, dword (ALL_EXCL_SELF or TRIGGER_EDGE or LEVEL_ASSERT or FIXED or INT_INVLD_TLB)
;----------------------------
.end:
   pop edx

   and edx,PAGE_ADDR
   RETURN edx
;----------------------------
align 4
.end_err:
   RETURN 0
ENDP
;----------------------------

to be continued ...

FlashBurn · Post by **FlashBurn** » Sun Mar 26, 2006 11:02 am

And this is the int handler:

Code: Select all

;----------------------------
PROC int_invld_tlb
;----------------------------
BEGIN_EXCPT
   mov ebx,[invld_cr3]
   mov eax,[invld_addr]

   cmp [fs:thread_t.pd],ebx
   je .process

   cmp ebx,0xc0000000
   jnge .end
;----------------------------
;   it?s a kernel address so we need to invalid it everytime
.kernel:
   invlpg [eax]

   jmp .end
;----------------------------
;   it?s the same process so we need to invalid the tlb entry
align 4
.process:
   test eax,1
   jne .process@kernel

   invlpg [eax]

   jmp .end
;----------------------------
;   we need to invalid a 4MiB pg in its special place
align 4
.process@kernel:
   and eax,PAGE_ADDR

   invlpg [eax]
;----------------------------
align 4
.end:
   lock sub dword[invld_cpus],1
   jnz code_return_excpt

   CALL spinlock_release, invld_spin

   jmp code_return_excpt
ENDP
;----------------------------

I couldn?t imagine a situation where this could lead to a problem.

Brendan · Post by **Brendan** » Sun Mar 26, 2006 12:14 pm

Hi,

FlashBurn wrote:I couldn?t imagine a situation where this could lead to a problem.

Ok, imagine you've got some code to add data to a queue like this:

Code: Select all

struc QUEUE_HEADER
   .lock:            resd 1
   .pages:           resd 1
   .queue_top:       resd 1
   .data:
endstruc



;Input
; ecx = number of bytes to add to queue
; esi = address of bytes to add to queue
; edi = address of queue header

add_some_data:
   lea eax,[edi+QUEUE_HEADER.lock]            ;eax = address of re-entrancy lock
   call acquire_spinlock                      ;Lock the queue

   mov edx,[edi+QUEUE_HEADER.queue_top]       ;edx = offset to top of queue
   add edx,ecx                                ;edx = offset for top of queue after new data is added
   test edx,edx                               ;Is the last page partially used?
   jne .l1                                    ; no
   add edx,0x1000                             ; yes, increase size by one page
.l1:
   shr edx,12                                 ;edx = total number of pages needed

   mov ebx,[edi+QUEUE_HEADER.pages]           ;ebx = current number of pages allocated
   cmp ebx,edx                                ;Are there enough pages already?
   jae .gotPages                              ; yes

   mov eax,edx                                ;eax = total number of pages needed
   sub eax,[edi+QUEUE_HEADER.pages]           ;eax = number of extra pages needed
   shl ebx,12                                 ;ebx = offset to end of allocated space
   add ebx,edi                                ;ebx = address of first un-allocated page
   call allocate_pages                        ;Allocate more pages at the end of the queue
.gotPages:

   push edi
   cld
   add edi,[edi+QUEUE_HEADER.queue_top]       ;edi = address to put new data
   rep movsb                                  ;Copy the new data to the end of the queue
   pop edi                                    ;edi = address of queue header again

   lea eax,[edi+QUEUE_HEADER.lock]            ;eax = address of re-entrancy lock
   call release_spinlock                      ;Unlock the queue
   ret

Now imagine what happens if 2 CPUs do "call add_some_data" at the same time. It goes like this:

CPU 1: Acquires the "QUEUE_HEADER.lock"
CPU 2: Tries to acquire the "QUEUE_HEADER.lock" but can't
CPU 1: Calls "allocate_pages"
CPU 2: Tries to acquire the "QUEUE_HEADER.lock" but can't
CPU 1: The "allocate_pages" code allocates the pages
CPU 2: Tries to acquire the "QUEUE_HEADER.lock" but can't
CPU 1: The "allocate_pages" sends an "invalidate" IPI
CPU 2: Tries to acquire the "QUEUE_HEADER.lock" but can't
CPU 1: Copies it's data and frees the lock
CPU 2: Acquires the "QUEUE_HEADER.lock"
CPU 1: Returns to caller
CPU 2: Tries to add data to the end of the queue
CPU 2: Generates page fault because of stale TLB information
CPU 2: Page fault handler terminates process due to page fault
CPU 2: Finally recieves the "invalidate" IPI (too late)

In this example, failure is possible when "QUEUE_HEADER.lock" is freed by CPU 1 before the "invalidate" IPI is received by CPU 2.

The only real testing I've done is a "send to self" IPI on a Pentium 4, which took around 50 cycles to arrive (I would assume a "send to a different CPU" IPI would take longer). This creates a "window of failure".

If you're unlucky, this code would work fine 99.9999% of the time and won't fail until CPU 2 gets too hot and starts running slowly because of thermal throttling, or because something else is flooding the system bus and the IPI takes much longer than normal, and then only when both CPUs are trying to use the "add_some_data" at the same time, and only when the first caller allocates new pages, and possibly only when everything is already in CPU 1's caches.

This is what makes it impossible to find the bug later - because it might only crash once a year on a specific pair of CPUs under a specific load you couldn't recreate the bug and you won't know if something you've done has fixed the bug or not. It'd be an OS that crashes occasionally for unknown reasons...

For this particular example, it's very likely that CPU 1 will take too long to add it's data to the queue before unlocking the lock, and the problem may never occur. This is just one example though, and it's very difficult to predict what unwritten applications might do (or how long an IPI might take on future hardware).

Cheers,

Brendan

FlashBurn · Post by **FlashBurn** » Mon Mar 27, 2006 7:17 am

No I see where I don?t understand you! Do I need to do an invlpg when I map a new page? I thought that I have to do this only when I unmap or change the physical address of a page?!

Brendan · Post by **Brendan** » Mon Mar 27, 2006 8:49 am

Hi,

FlashBurn wrote:No I see where I don?t understand you! Do I need to do an invlpg when I map a new page? I thought that I have to do this only when I unmap or change the physical address of a page?!

If you're using "lazy TLB invalidation" where the page fault handler invalidates TLB entries, then you only need to send the IPI if the page table entry or page directory entry was originally marked as "present".

This has absolutely nothing to do with making that a CPU that sends an IPI waits until it knows that the IPI was handled though.

My previous example wasn't so good (it was intended for an OS that doesn't use "lazy TLB invalidation"), but it's the same principle for all IPIs regardless of what they're for.

Think of it like this. What if you send an email to your friend saying "Don't open the package I'm about to send because it'll explode!!", and then mailed a package full of explosives. 99.999% of the time your friend will receive the email first and won't open the package. Every now and then you'll be attending a funeral.

A better idea would be to send you friend an email saying "I'm about to send a package that will explode if it's opened - let me know when you get this email!!", and then wait until your friend replies to this email before you send the package. That way you can guarantee that they received the email before they receive the package.

The IPI thing is the same - always make sure that the receiving CPU/s have handled the IPI before allowing the sender of the IPI to continue, otherwise the sending CPU may do something that effects a receiver before the receiver is ready for it.

If you like I can think of an example (like the "adding data to a queue" example) where this sort of synchronization problem can cause your memory management to fail even though it does use lazy TLB invalidation... To be honest, I like the "sending explosives" example above better - it's much easier to understand.

Cheers,

Brendan

FlashBurn · Post by **FlashBurn** » Mon Mar 27, 2006 12:36 pm

Ok, I changed my code so that the cpu who sends the IPI is waiting for the other cpus to finish. This has also shown a problem

I have to wait for all APs to get to a special point and then let them waiting till the BSP has finished with some needed things, like setting function ptrs to smp functions and things like that.

Pype.Clicker · Post by **Pype.Clicker** » Tue Mar 28, 2006 6:04 am

Brendan wrote: To be honest, I like the "sending explosives" example above better - it's much easier to understand.

acknowledgments of IRQs, yep ... synchronous vs. asynchronous communications always come back at a time or another

this thread definitely confirms me that, when it comes to SMP, Brendan is _the_ guy to ask

Btw, dude, i suppose you send IPIs to other processors only if you know they're likely to share that TLB entry, right ? i mean, if you're remapping the user-level page of a single-threaded program, there's no need to bother other processors, nor to wait for their feedback, right ?

Brendan · Post by **Brendan** » Tue Mar 28, 2006 7:17 am

Hi,

Pype.Clicker wrote:this thread definitely confirms me that, when it comes to SMP, Brendan is _the_ guy to ask Btw, dude, i suppose you send IPIs to other processors only if you know they're likely to share that TLB entry, right ? i mean, if you're remapping the user-level page of a single-threaded program, there's no need to bother other processors, nor to wait for their feedback, right ?

Yes. If only one CPU can be using the effected part of the effected address space, then no IPI is needed.

This means for an OS like Windows (where there's "process space" and "kernel space") an "invalidate" IPI is always needed for kernel space, and only needed for process space if the process has more than one thread. Of course Windows supports shared memory which would make this more complicated (and "device space" which is mostly the same as kernel space IIRC).

For an OS like Linux (where there's only processes and "child processes") it's probably faster to always send the IPI - trying to figure out if the IPI is actually needed or not would be messy... ::)

For my OS (where each address space is split into "process space", "thread space" and "kernel space") an "invalidate" IPI is always needed for kernel space, never needed for thread space and only needed for process space if the process has more than one thread.

Normally, my kernel only ever allocates pages and "lazy TLB invalidation" means no IPI is needed anyway. If the physical memory manager runs out of memory it'll round up all the spare kernel pages and free them all in one go, which results in one IPI for a several freed pages (rather than one per page).

For processes, I encourage using "thread space" as much as possible (no IPIs), and allocating a page in "process space" is handled with "lazy TLB invalidiation" (no IPIs). When a process is terminated, it's threads are terminated first and process space is only freed when the last thread is being terminated - in this case no IPIs are needed.

If you combine all of this, you'll see that it's possible for my OS to run many multi-threaded processes without any "invalidate" IPIs at all (if the processes are designed well and the OS doesn't run out of physical memory). This is good for performance/scalability on "many-CPU" computers (i.e. much better than Windows or Linux)... ;D

Cheers,

Brendan

Rob · Post by **Rob** » Wed Mar 29, 2006 6:00 am

Why don't you need an IPI for an application with a single thread? Could that thread not be scheduled to run on a different CPU when its next slice comes up? Or is a thread always bound to a specific cpu?

(or am I just not understanding all of this

)

OSDev.org

smp invlpg interrupt

smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt

Re:smp invlpg interrupt