Page 2 of 2
Re: Code patching and SMP
Posted: Thu Mar 17, 2011 4:44 pm
by rdos
The AMD manual gave me an idea. They wrote about coding a 2 byte jmp at the beginning. A better idea is to start by writing a int instruction at the first byte.
Something like this: (16 bits)
Code: Select all
67 66 9A gg gg 00 00 01 00
CD 66 9A gg gg 00 00 01 00
And (32 bits):
Code: Select all
3E 67 9A gg gg 00 00 01 00
CD 67 9A gg gg 00 00 01 00
This will make all cores execute int 66 and int 67. In those, they will spin unti the instruction is modified to start with a "nop". An additional advantage is that the GPF-handler could start with this:
Code: Select all
mov al,instrĀ“
cmp al,90h
je reexecute
;
cmp al,0CDh
je wait_patch
;
cmp al,67h
je lock
;
cmp al,3Eh
je lock
;
goto default GPF handler
lock:
setup alias selector that can modify code-segment
mov al,0CDh
xchg al,instr
cmp al,0CDh
je wait_patch
;
check if instruction that should be patched. Goto patch if true
;
xchg al,instr
goto default GPF handler
patch:
wait until prefetch queue is drained
patch instruction
reexecute
wait_patch:
xor cx,cx
wait_patch_loop:
mov al,instr
cmp al,90h
je reexecute
pause
loop wait_patch_loop
;
goto default GPF handler
int_66:
int_67:
save and setup address (back up EIP two steps)
wait_int:
mov al,instr
cmp al,90h
je wait_done
pause
jmp wait_int
wait_done:
restore
iretd
Re: Code patching and SMP
Posted: Fri Mar 18, 2011 2:21 am
by rdos
An alternative is to defer patching and create a list of locations to be patched. Cores executing the code until the gate is patched will setup a return frame, and call the entrypoint with an iretd. Then there could be a kernel thread that does the patching at regular intervals by checking if ample time has passed since the int instruction was inserted, and then patch the code if it has. That would solve the problem with the first core needing to wait for all prefetches without the int instruction to be done. In this case, it would be possible to use something like a 1ms timeout, which would garantee no new core could get inconsistent instruction data.
Re: Code patching and SMP
Posted: Fri Mar 18, 2011 5:25 pm
by rdos
I've just implemented & tested a 3:rd version. Rather than doing the patching directly, I decided to do the 0xCD (int nn) patch in the GPF-handler, and then just reexecute the instruction. The int 66 / int 67 handler will then use a spin-lock to synchronize the patch process, and if it is the first core (the instruction is still a 0xCD), it will patch the code and reexecute the instruction, otherwise it will just reexecute the already patched instruction. The time it takes to leave the GPF-handler + the time it takes to invoke another int-handler + access a spinlock and calculate the patch address must be longer than the time it takes for a core to prefetch the code even if it spans cache-lines.
Re: Code patching and SMP
Posted: Sat Mar 19, 2011 7:04 am
by Owen
I personally quite like AMD's solution; in fact, it seems the perfect method. Rewrite the first two bytes as "1: jmp 1", patch the rest of the instruction, and then finally patch the first two bytes into the desired instruction.
Then you just need to synchronize the patching of the first two bytes of the instruction with a cmpxchg. If they're not an instruction which needs patching, just return from the GPF handler. You build an implicit spinlock around those two bytes, and therefore get maximum parallelism.
Re: Code patching and SMP
Posted: Sat Mar 19, 2011 8:58 am
by rdos
Owen wrote:I personally quite like AMD's solution; in fact, it seems the perfect method. Rewrite the first two bytes as "1: jmp 1", patch the rest of the instruction, and then finally patch the first two bytes into the desired instruction.
The only problem with it is that is doesn't work across cache-lines (it is two bytes, and thus can span cache-lines). They also suggest a solution with the only one-byte int instruction (int 3), but that is not a good solution either as it is used by debuggers to insert breakpoints into code. I like my current solution best (patching the first byte to an int nn instruction, and keeping the second original byte as either 0x66 or0x 67). It only occupies two int-vectors (0x66 and 0x67), and the patcher can see if the call should be 16 or 32-bits by looking at the int-vector. That is a single byte instruction patch, and as such will work anywhere in the code, and can never span cache-lines.
And when patching instructions from ring 3, which will need the allocation of call-gate selectors, it is quite convinient to replace the spinlock in the interrupt handler with a semaphore/critical section, and then the patcher can call other gate-functions that might not yet patched without a risk for reentering the same spinlock and creating a patch-loop that could hangup everything.
From ring 3 (call-gates), the code could look like this instead (8 bytes).
16-bit:
Code: Select all
66 9A gg gg 00 00 03 00
CD 9A gg gg 00 00 03 00 ; changes to int 0x9A
90 9A 00 00 ss ss 90 90 ; final
32-bit:
Code: Select all
67 9A gg gg 00 00 03 00
CD 9A gg gg 00 00 03 00 ; changes to int 0x9A
90 9A 00 00 00 00 ss ss ; final
Re: Code patching and SMP
Posted: Sat Mar 19, 2011 4:24 pm
by Owen
I personally would have taken the occasional hit of a NOP to bring the instruction to alignment over the cost of the INT, though I admit I have somewhat more complex hot-patching plans.
Re: Code patching and SMP
Posted: Thu Apr 14, 2011 4:13 am
by rdos
The patching-logic is now more or less finalized. The gates looks like this:
16-bit kernel API:
Code: Select all
67 66 9A gg gg gg gg 02 00
CD 66 9A gg gg gg gg 02 00
90 66 9A oo oo oo oo ss ss
32-bit kernel API:
Code: Select all
3E 67 9A gg gg gg gg 02 00
CD 67 9A gg gg gg gg 02 00
90 67 9A oo oo oo oo ss ss
16-bit application API (from kernel):
Code: Select all
67 66 9A gg gg gg gg 01 00
CD 66 9A gg gg gg gg 01 00
90 66 9A oo oo oo oo ss ss
32-bit application API (from kernel):
Code: Select all
3E 67 9A gg gg gg gg 03 00
CD 67 9A gg gg gg gg 03 00
90 67 9A oo oo oo oo ss ss
Using the 32-bit application API from a 16-bit kernel device-driver:
Code: Select all
67 66 9A gg gg gg gg 03 00
CD 66 9A gg gg gg gg 03 00
90 66 9A oo oo oo oo ss ss
16-bit application API from application:
Code: Select all
66 9A gg gg gg gg 01 00
CD 9A gg gg gg gg 01 00
66 9A 00 00 00 00 ss ss
32-bit application API from application:
Code: Select all
67 9A gg gg gg gg 03 00
CD 9A gg gg gg gg 03 00
90 9A 00 00 00 00 ss ss