Code patching and SMP

rdos · Post by **rdos** » Thu Mar 17, 2011 4:44 pm

The AMD manual gave me an idea. They wrote about coding a 2 byte jmp at the beginning. A better idea is to start by writing a int instruction at the first byte.

Something like this: (16 bits)

Code: Select all

67 66 9A gg gg 00 00 01 00

CD 66 9A gg gg 00 00 01 00

And (32 bits):

Code: Select all

3E 67 9A gg gg 00 00 01 00

CD 67 9A gg gg 00 00 01 00

This will make all cores execute int 66 and int 67. In those, they will spin unti the instruction is modified to start with a "nop". An additional advantage is that the GPF-handler could start with this:

Code: Select all

    mov al,instr´
    cmp al,90h
    je reexecute
;
    cmp al,0CDh
    je wait_patch
;
   cmp al,67h
   je lock
;
   cmp al,3Eh
   je lock
;
    goto default GPF handler

lock:
    setup alias selector that can modify code-segment
    mov al,0CDh
    xchg al,instr
    cmp al,0CDh
    je wait_patch
;
    check if instruction that should be patched. Goto patch if true
;
    xchg al,instr
    goto default GPF handler

patch:
    wait until prefetch queue is drained
    patch instruction
    reexecute

wait_patch:
    xor cx,cx

wait_patch_loop:
    mov al,instr
    cmp al,90h
    je reexecute
    pause
    loop wait_patch_loop
;
    goto default GPF handler

int_66:
int_67:
    save and setup address (back up EIP two steps)
    
wait_int:
    mov al,instr
    cmp al,90h
    je wait_done
    pause
    jmp wait_int

wait_done:
    restore
    iretd

rdos · Post by **rdos** » Fri Mar 18, 2011 2:21 am

An alternative is to defer patching and create a list of locations to be patched. Cores executing the code until the gate is patched will setup a return frame, and call the entrypoint with an iretd. Then there could be a kernel thread that does the patching at regular intervals by checking if ample time has passed since the int instruction was inserted, and then patch the code if it has. That would solve the problem with the first core needing to wait for all prefetches without the int instruction to be done. In this case, it would be possible to use something like a 1ms timeout, which would garantee no new core could get inconsistent instruction data.

rdos · Post by **rdos** » Fri Mar 18, 2011 5:25 pm

I've just implemented & tested a 3:rd version. Rather than doing the patching directly, I decided to do the 0xCD (int nn) patch in the GPF-handler, and then just reexecute the instruction. The int 66 / int 67 handler will then use a spin-lock to synchronize the patch process, and if it is the first core (the instruction is still a 0xCD), it will patch the code and reexecute the instruction, otherwise it will just reexecute the already patched instruction. The time it takes to leave the GPF-handler + the time it takes to invoke another int-handler + access a spinlock and calculate the patch address must be longer than the time it takes for a core to prefetch the code even if it spans cache-lines.

Owen · Post by **Owen** » Sat Mar 19, 2011 7:04 am

I personally quite like AMD's solution; in fact, it seems the perfect method. Rewrite the first two bytes as "1: jmp 1", patch the rest of the instruction, and then finally patch the first two bytes into the desired instruction.

Then you just need to synchronize the patching of the first two bytes of the instruction with a cmpxchg. If they're not an instruction which needs patching, just return from the GPF handler. You build an implicit spinlock around those two bytes, and therefore get maximum parallelism.

rdos · Post by **rdos** » Sat Mar 19, 2011 8:58 am

Owen wrote:I personally quite like AMD's solution; in fact, it seems the perfect method. Rewrite the first two bytes as "1: jmp 1", patch the rest of the instruction, and then finally patch the first two bytes into the desired instruction.

The only problem with it is that is doesn't work across cache-lines (it is two bytes, and thus can span cache-lines). They also suggest a solution with the only one-byte int instruction (int 3), but that is not a good solution either as it is used by debuggers to insert breakpoints into code. I like my current solution best (patching the first byte to an int nn instruction, and keeping the second original byte as either 0x66 or0x 67). It only occupies two int-vectors (0x66 and 0x67), and the patcher can see if the call should be 16 or 32-bits by looking at the int-vector. That is a single byte instruction patch, and as such will work anywhere in the code, and can never span cache-lines.

And when patching instructions from ring 3, which will need the allocation of call-gate selectors, it is quite convinient to replace the spinlock in the interrupt handler with a semaphore/critical section, and then the patcher can call other gate-functions that might not yet patched without a risk for reentering the same spinlock and creating a patch-loop that could hangup everything.

From ring 3 (call-gates), the code could look like this instead (8 bytes).

16-bit:

Code: Select all

66 9A gg gg 00 00 03 00

CD 9A gg gg 00 00 03 00 ; changes to int 0x9A

90 9A 00 00 ss ss 90 90 ; final

32-bit:

Code: Select all

67 9A gg gg 00 00 03 00

CD 9A gg gg 00 00 03 00 ; changes to int 0x9A

90 9A 00 00 00 00 ss ss ; final

Owen · Post by **Owen** » Sat Mar 19, 2011 4:24 pm

I personally would have taken the occasional hit of a NOP to bring the instruction to alignment over the cost of the INT, though I admit I have somewhat more complex hot-patching plans.

rdos · Post by **rdos** » Thu Apr 14, 2011 4:13 am

The patching-logic is now more or less finalized. The gates looks like this:

16-bit kernel API:

Code: Select all

67 66 9A gg gg gg gg 02 00

CD 66 9A gg gg gg gg 02 00

90 66 9A oo oo oo oo ss ss

32-bit kernel API:

Code: Select all

3E 67 9A gg gg gg gg 02 00

CD 67 9A gg gg gg gg 02 00

90 67 9A oo oo oo oo ss ss

16-bit application API (from kernel):

Code: Select all

67 66 9A gg gg gg gg 01 00

CD 66 9A gg gg gg gg 01 00

90 66 9A oo oo oo oo ss ss

32-bit application API (from kernel):

Code: Select all

3E 67 9A gg gg gg gg 03 00

CD 67 9A gg gg gg gg 03 00

90 67 9A oo oo oo oo ss ss

Using the 32-bit application API from a 16-bit kernel device-driver:

Code: Select all

67 66 9A gg gg gg gg 03 00

CD 66 9A gg gg gg gg 03 00

90 66 9A oo oo oo oo ss ss

16-bit application API from application:

Code: Select all

66 9A gg gg gg gg 01 00

CD 9A gg gg gg gg 01 00

66 9A 00 00 00 00 ss ss

32-bit application API from application:

Code: Select all

67 9A gg gg gg gg 03 00

CD 9A gg gg gg gg 03 00

90 9A 00 00 00 00 ss ss

OSDev.org

Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP