Code patching and SMP

rdos · Post by **rdos** » Sat Mar 05, 2011 10:55 am

Anybody implemented dynamic code-patching with SMP? The problem appears when more than one core executes the same (unpatched) code at the same time. I use an invalid far call to selector 1,2 or 3 to indicate code that needs patching. The offset part contains a gate number, which is defined in a table.

The following scenario is especially problematic:

Code: Select all

core 1: 
    execute unpatched code
    GPF fault

core 2:
    execute unpatched code
    GPF fault

core 1:
    check if it is a gate
    YES: patch the code so it is now valid
    re-execute patched code

core 2:
    check if it is a gate
    NO: execute default GPF handler

The only solution to the above is to let core 2 also re-execute the instruction. The problem then is that if it is a real GPF-fault, it will simply loop and hangup the core, which is very bad behavior. The x86 architecture has no counter for re-executing faulting instructions, and it is not obvious how to detect re-execution loops that are real faults. The only other solution I can come up with is to check if a call is valid, and if it is, always re-execute it.

EDIT: Another possibility strikes me as possible. By always having one extra byte avaliable in the unpatched instruction (nop padding after), it would be possible to patch the code so it always starts with a nop. The GPF handler could then re-execute all instructions that begin with nop. It is a simple solution, but it requires an extra byte.

Brendan · Post by **Brendan** » Sat Mar 05, 2011 11:56 am

Hi,

It's worse than that. For example:

Code: Select all

core 1: check if it is a gate
core 2: check if it is a gate
core 1: YES: patch the code so it is now valid
core 2: YES: patch the code a second time for no reason
core 1: re-execute patched code
core 2: re-execute patched code

core 3: execute "patched" code
core 3: GPF fault because unpatched code is still in pipeline or trace cache
core 3: check if it is a gate
core 3: NO: execute default GPF handler

In the Intel manuals there's some notes on both self-modifying code and cross-modifying code (code running on one CPU that is being modified by another CPU). For cross-modifying code Intel says to put a spinlock before the code that will be modified.

Basically, the first CPU does this:

Code: Select all

    flag = 0;
    /* modify the instruction here */
    flag = 1;

And the code being modified is to changed to:

Code: Select all

    while(flag == 0) { }
    /* modify the instruction here */
    serialising instruction (e.g. CPUID)

In general (e.g. if the code is modified several times by several CPUs) this still looks too racey to me. For your case (where the code is only meant to be modified once) it'd work.

That still leaves 2 problems.

The first problem is that you probably don't want a little spinlock (and serialising instruction) in user-space before each piece of code that may be modified. IMHO, in your case, that can safely be shifted into the GPF handler instead.

Basically you end up with something like this:

Code: Select all

GPFhandler:
    if( error_code says GPF may have been caused by unpatched code) {
        if( code at EIP is a JMP/CALL that may or may not need patching) {
             atomic{ set_flag and return old value of flag}
             if(old value of flag == 0) {
                 patch the instruction;
                 set flag to 0 again;
                 IRET (and let IRET do the serialising for self-modifying code);
             } else {
                 while(flag !=0) { /* wait for other CPU to finished modifying it */ }
                 do serialising instruction like CPUID (as IRET may not be enough for cross-modifying code)
                 IRET;
             }
        }
    }

The second problem is that performance will suck. Even for simple self-modifying code Intel warns against "severe performance degradation". By the time you add the overhead of exceptions and the spin-locking (and lock contention), it'll be "much worse that severe performance degradation". I'm not too sure if there's anything you can do to prevent that, other than using a different lock for every patched/unpatched instruction (to help reduce lock contention).

rdos wrote:The only solution to the above is to let core 2 also re-execute the instruction. The problem then is that if it is a real GPF-fault, real GPF-faults will simply loop and hangup the core, which is very bad behavior. The x86 architecture has no counter for re-executing faulting instructions, and it is not obvious how to detect re-execution loops that are real faults. The only other solution I can come up with is to check if a call is valid, and if it is, always re-execute it.

If your code can't tell the difference between a real GPF fault and an instruction that needs patching, then your code would "patch" the real GPF fault (so that the real GPF fault ends up being something that uses your gate without causing a GPF fault). You'd really want to make sure the code being patched actually is code that should be patched, but there's no "guaranteed foolproof" way of doing that.

I'd also be tempted to suggest investigating how other OSs handle things like dynamic linking. I'd be very surprised if it looked anything like this. Maybe the executable file format could have a standard "big table that contains the address of each instruction that needs to be patched", and then you could patch everything when the process is started so that you don't need to bother with any of the mess (and performance problems).

Cheers,

Brendan

rdos · Post by **rdos** » Sat Mar 05, 2011 12:44 pm

Brendan wrote:Hi,

It's worse than that. For example:

Code: Select all

core 1: check if it is a gate
core 2: check if it is a gate
core 1: YES: patch the code so it is now valid
core 2: YES: patch the code a second time for no reason
core 1: re-execute patched code
core 2: re-execute patched code

core 3: execute "patched" code
core 3: GPF fault because unpatched code is still in pipeline or trace cache
core 3: check if it is a gate
core 3: NO: execute default GPF handler

Yes, I left that part out for clarity. The GPF handler has a spinlock it needs to acquire before reading the instruction. This same spinlock is also used by the patch-handler as it modifies the instruction. In addition to that, there is another spinlock for the patch handler itself. I think those solve the reentrancy problems in the code, but it cannot solve the instruction reading problems.

Brendan wrote:The first problem is that you probably don't want a little spinlock (and serialising instruction) in user-space before each piece of code that may be modified. IMHO, in your case, that can safely be shifted into the GPF handler instead.

Right. I would not want that. The patched code should be as short as possible for efficiency reasons.

Brendan wrote:The second problem is that performance will suck. Even for simple self-modifying code Intel warns against "severe performance degradation". By the time you add the overhead of exceptions and the spin-locking (and lock contention), it'll be "much worse that severe performance degradation". I'm not too sure if there's anything you can do to prevent that, other than using a different lock for every patched/unpatched instruction (to help reduce lock contention).

But patch-time is not an issue. The instruction will only be patched once, and if the code is called a second time it will go directly to the destination, with no overhead. This is faster than the dynamic linking alternatives that usually make a jump to a patch-table or similar.

There is another issue as well. As the patching of the code is performed, it must be invalid at all times until it is in its final shape. This is not as simple as it might sound, as some patches will replace a far call with a push cs / call near, and the instruction itself is 8 bytes long, and there are no atomic instructions for 8 bytes.

Example of atomic patching to a far call:

Code: Select all

66 9A gg gg 00 00 03 00 (call far 0003:0000gggg from 16-bit segment which is always invalid)

66 9A gg gg 90 90 90 90 (place 90909090 (4 nops) at offset 4. This results in call far 9090:9090gggg which might be valid)

66 9A gg gg ss ss 90 90 (place new selector at offset 4. This results in call far 9090:ssssgggg which might be valid)

90 9A oo oo ss ss 90 90 (place nop, call far and offset at offset 0. This results in the final call to ssss:oooo)

Example of atomic patching to push cs + near call

Code: Select all

66 9A gg gg 00 00 03 00 (call far 0003:0000gggg from 16-bit segment which is always invalid)

66 9A gg gg 90 90 90 90 (place 90909090 (4 nops) at offset 4. This results in call far 9090:9090gggg which might be valid)

66 9A gg gg oo oo 90 90 (place new offset at offset 4. This results in call far 9090:oooogggg which might be valid)

90 90 0E E8 oo oo 90 90 (place nops and push cs / call near at offset 0. This results in the final push cs / call oooo)

The problem is that two nops (90 90) will translate to selector 9090h, which could be valid.

In both these examples, the GPF handler only needs to look at the first instruction byte, and if it's a nop, it can safely re-execute the instruction as nops can never GP fault. If the patched code faults, it will not point to the nop, but rather to the new call instruction, which the GPF handler can chain to the default protection-fault handler.

EDIT: Looking for better instructions to patch with, I found cmc (0F5h). Two cmcs will also be a nop. This results in a selector of 0F5F5h, which is garanteed to always be invalid in RDOS, as it is in the LDT, and RDOS currently does not use more than a few selectors in the LDT, and they are allocated from the button. Additionally, the RPL field points to ring 1, and RDOS does not use ring 1.

Updated patching logic:

Code: Select all

66 9A gg gg 00 00 03 00 (call far 0003:0000gggg from 16-bit segment which is always invalid)

66 9A gg gg ss ss F5 F5 (place new selector at offset 4 + 2 cmcs. This results in call far F5F5:ssssgggg which is always invalid)

90 9A oo oo ss ss F5 F5 (place nop, call far and offset at offset 0. This results in the final call to ssss:oooo)

and:

Code: Select all

66 9A gg gg 00 00 03 00 (call far 0003:0000gggg from 16-bit segment which is always invalid)

66 9A gg gg oo oo F5 F5 (place new offset at offset 4 + 2 cmcs. This results in call far F5F5:oooogggg which is always invalid)

90 90 0E E8 oo oo F5 F5 (place nops and push cs / call near at offset 0. This results in the final push cs / call oooo)

Optionally, the cmcs can be patched to nops in a final step if this is more effiicient with no adverse affects, as it is only a potential performance issue.

Brendan · Post by **Brendan** » Sun Mar 06, 2011 12:36 am

Hi,

rdos wrote:This is not as simple as it might sound, as some patches will replace a far call with a push cs / call near, and the instruction itself is 8 bytes long, and there are no atomic instructions for 8 bytes.

I wouldn't say "none" - there is CMPXCHG8B.

For single-CPU machines (including 80486 and older) you can use anything you like because it's single-CPU. For multi-CPU, you say "for multi-CPU the minimum requirement is Pentium or newer" and then you can use "LOCK CMPXCHG8B". In practice this is perfectly fine, as the number of 80486 machines that have multiple CPUs is so close to zero that it's not worth caring about; and for 80386 and older there's even fewer multi-CPU machines (and because they pre-date Intel's MultiProcessor Specification there's no standard way of doing anything with them anyway).

The only other problem is that "LOCK CMPXCHG8B" forms the basis of the F00F bug on Pentium so you'd want to implement a F00F bug workaround, but you want to do that anyway if you support Pentium CPUs.

rdos wrote:EDIT: Looking for better instructions to patch with, I found cmc (0F5h).

I'd be tempted to use "UD2". It's guaranteed to generate a undefined opcode exception (even on older CPUs), and it's only 2 bytes. For example, the first 2 bytes would be UD2, the next 2 bytes would determine the type of patch needed, and the last 4 bytes would be spare space. You'd write the last 4 bytes (which won't effect the UD2 for any other CPUs), then you'd write the first 4 bytes.

As a bonus, the undefined opcode handler (typically) needs to examine the opcode of the instruction that caused the fault (to emulate instructions that aren't supported by the CPU), so adding a check for UD2 doesn't add much extra overhead to that exception handler; while the GPF handler usually doesn't need to examine the opcode of the instruction that caused the fault, so adding a check for "call far" does add extra overhead to that exception handler.

rdos wrote:But patch-time is not an issue. The instruction will only be patched once, and if the code is called a second time it will go directly to the destination, with no overhead.

Let's do a rough "worst case" estimate. An exception is about 100 cycles. The exception handler needs to determine if the exception was caused by an instruction that needs patching or not; and has to take into account a few corner cases - e.g. first half of "instruction to be patched" placed so that the second half is on a different ("not present" or "supervisor") page, or beyond a segment limit; and checking if the instruction is in a valid code section (and the process isn't executing random data because it crashed). The checking, etc might cost another 50 cycles. Then there's the spinlock and lock contention - even without any contention that might add 20 cycles (and with lock contention it might be as bad as 50 cycles per CPU or something). A pipeline flush (caused by self/cross modifying code) is probably another 100 cycles. So, for an extremely rough "worst case" estimate, maybe it adds up to about 300 cycles.

Now imagine some initialisation code - no loops or anything, and the code is only run once. Maybe there's 100 instructions that need patching. Maybe that adds up to a total of 30000 cycles of overhead, and maybe it could run 10 times faster if the patching was done differently.

rdos wrote:This is faster than the dynamic linking alternatives that usually make a jump to a patch-table or similar.

For some OS's there's a jump table (e.g. ELF and the Global Offset Table). For other OSs there isn't - there's a table of relocations, and the "executable loader" patches all the offsets in the instructions themselves. There's advantages/disadvantages in both cases. The main disadvantage of the second approach is that it gets very messy once you start looking at memory mapped executable files, and it causes problems for code sharing (e.g. same pages of executable code shared by different processes, with different relocations).

All I'm saying is that if you had a table containing the offset of each instruction that needs patching, you could patch everything before the code is executed (e.g. in the executable loader), and get rid of the exception overhead and serialising, get rid of the need for spinlocks and lock contention, get rid of the "does this instruction need patching" testing (and get rid of the chance of false positives), get rid of the most of the "is it safe to patch" corner-cases, and improve startup times for executables.

Of course I'm only making suggestions (that may or may not be suitable for your specific case), based on very little knowledge about you specific case - I know you're patching code, but I don't know why and don't know much else about your OS (e.g. what your executable format looks like, if you're using paging, if you're planning to support memory mapped executables, etc).

Cheers,

Brendan

a5498828 · Post by **a5498828** » Sun Mar 06, 2011 1:26 am

Is there any reason other than rootkits to use self modyfing code?

rdos · Post by **rdos** » Sun Mar 06, 2011 3:04 am

Brendan wrote:I'd be tempted to use "UD2". It's guaranteed to generate a undefined opcode exception (even on older CPUs), and it's only 2 bytes. For example, the first 2 bytes would be UD2, the next 2 bytes would determine the type of patch needed, and the last 4 bytes would be spare space. You'd write the last 4 bytes (which won't effect the UD2 for any other CPUs), then you'd write the first 4 bytes.

Yes. In the beginning I used UD2, but the main problem with UD2 is that debuggers don't know how to trace/pace it, and don't even know how long the instruction is, and sometimes therefore will not even find the next instruction so it could produce a valid disassembly of the code. Additionally, the debugger don't even know where to set a breakpoint if it is not single-stepping it. This is why I changed to call to a know-invalid selector instead. It works perfectly well with the debugger, and the debugger knows how to trace it and pace it.

OTOH, I could possibly fix these issues in the OpenWatcom debugger, but then if I decide to switch to GCC for applications, for instance, this problem will re-appear.

It is not a attractive option to switch back to using UD2.

Brendan wrote:As a bonus, the undefined opcode handler (typically) needs to examine the opcode of the instruction that caused the fault (to emulate instructions that aren't supported by the CPU), so adding a check for UD2 doesn't add much extra overhead to that exception handler; while the GPF handler usually doesn't need to examine the opcode of the instruction that caused the fault, so adding a check for "call far" does add extra overhead to that exception handler.

The GPF handler will almost always abort execution of an application, except when an instruction needs to be patched, or when an instruction needs to be emulated. Emulation of instructions is not done on newer (SMP) processors, as these are configured for VME-mode, and then V86 code can run without GP faults. And V86 code is only run when doing mode-switching of the video adapter, so it is pretty uncommon.

Brendan wrote:For some OS's there's a jump table (e.g. ELF and the Global Offset Table). For other OSs there isn't - there's a table of relocations, and the "executable loader" patches all the offsets in the instructions themselves. There's advantages/disadvantages in both cases. The main disadvantage of the second approach is that it gets very messy once you start looking at memory mapped executable files, and it causes problems for code sharing (e.g. same pages of executable code shared by different processes, with different relocations).

I basically use PE format for applications, and a custom format for device-drivers.

The reason why I do not resolve these references in applications is because every call used will consume a GDT entry (a call gate), and therefore I want to resolve these as they are referenced.

The gates are used in device-drivers (and the kernel itself) as well. Doing relocations in device-drivers before load will work in most device-drivers with no major problems other than a need to provide a reference table. These will not consume call gate entries, but will simply insert the correct call into the code. The only things that will be broken is when the device-driver itself defines the gate, and then also uses it. Or if it uses some other call that is not yet exported because the device-driver is not yet initialized. IOW, some code will be broken in device-drivers if I do this.

Using a call-table might solve this issue, but it would require a lot of space if defined one-to-one (there are currently close to 500 oscalls and 500 usercalls). The relocations would still need to be done at runtime (faults), but a table might be easier to maintain.

The case for the kernel is worse. The kernel uses gates itself, and it uses them in a very early stage of booting the OS. The kernel also is a binary image with no load information. Any relocations in kernel need to be done a build time, and not at load time. However, since the gate entry points are defined at runtime, patching at build time will not work.

rdos · Post by **rdos** » Sun Mar 06, 2011 3:18 am

a5498828 wrote:Is there any reason other than rootkits to use self modyfing code?

The device-driver model used is very old (dates to the early 90s). At this time the 386 processor was new, and there where no appropriate tools for building DLLs or importing entry-points. I used Microsofts MASM and targeted DOS (ordinary exe-files). In order to link calls between device-drivers I therefore invented this method that is pretty smart on single-core processors. I let the provider of the function define the entry-point, which then is stored in a table in the gate resolving part of the kernel. Then the user inserts an UD2 (originally) or an invalid call (now) that includes a gate number. The invalid instruction or GPF handler then can patch the entry-point into the code and re-execute it. I even had an method that did not require patching the code that built a new stack-frame and returned that worked on read-only images (this still works because the fault handler first patches the code and then builds the stack-frame and irets to it). This even worked for V86 (DOS) applications, which again did not patch the code, but rather reflected it to the protected mode handler.

The alternatives I've seen depend on creating a mega-image with a linker, which is not an attractive option for me. All device-drivers in RDOS are separate files that are linked together with gate definitions. The gate definitions forms the basic API.

Another feature for applications is that all APIs will use the carry flag to signal success / failure, and if a gate is not defined (because the driver is not installed), code with be patched to a stub that returns with carry flag set.

rdos · Post by **rdos** » Mon Mar 07, 2011 4:04 pm

The device-driver gates now works (at least when executed by 3 cores at the same time). I had a bug that loaded the wrong offset into IP. Next will be to convert & check the user/application gates, which have more variants.

rdos · Post by **rdos** » Tue Mar 08, 2011 2:42 pm

Patching logic for user/application calls. As stated before, the rules are that every intermediate step must be garanteed to be invalid, and the final step must contain a nop at the first position in order for the GPF handler to know when to re-execute (nops are always valid to re-execute). Instructions used must be avaliable on 386 (because this code must run on all platforms, not only SMP platforms).

16-bit code segment (most device-drivers) to 16-bit handler procedure, far version: (this is the same as the previous device-driver gates, except for the need for 10 bytes, which is demonstrated below).

Code: Select all

66 9A gg gg 00 00 01 00 90 90 (call far 0001:0000gggg from 16-bit segment which is always invalid + nop)

66 9A gg gg ss ss F5 F5 90 90 (place new selector at offset 4 + 2 cmcs. This results in call far F5F5:ssssgggg which is always invalid)

90 9A oo oo ss ss F5 F5 90 90 (place nop, call far and offset at offset 0. This results in the final call to ssss:oooo)

90 9A oo oo ss ss 90 90 90 90 (replace cmcs with nops)

Near version: (also the same, except size):

Code: Select all

66 9A gg gg 00 00 01 00 90 90 (call far 0001:0000gggg from 16-bit segment which is always invalid)

66 9A gg gg oo oo F5 F5 90 90 (place new offset at offset 4 + 2 cmcs. This results in call far F5F5:oooogggg which is always invalid)

90 90 0E E8 oo oo F5 F5 90 90 (place nops and push cs / call near at offset 0. This results in the final push cs / call oooo)

90 90 0E E8 oo oo 90 90 90 90 (replace cmcs with nops)

16-bit code segment to 32-bit handler procedure, far version: (this is a pretty complex transformation that might have easier solutions)

Code: Select all

66 9A gg gg 00 00 01 00 90 90 (call far 0001:0000gggg from 16-bit segment which is always invalid)

66 9A gg gg 00 00 F5 F5 F5 F5 (place 4 cmcs are offset 6. This results in call far F5F5:0000gggg which is always invalid)

3E 67 66 9A 00 00 F5 F5 F5 F5 (change the opcode to ds-override, address size override and operand size override. This results in call far F5F5:F5F50000 which is always invalid)

3E 67 66 9A oo oo oo oo F5 F5 (place offset at offset 4. This results in call far F5F5:oooooooo which is always invalid)
 
3E 67 66 9A oo oo oo oo ss ss (place selector at offset 8. This results in call far ssss:oooooooo, which is the final code, but with extra operand size overrides)

90 90 66 9A oo oo oo oo ss ss (place 2 nops at offset 0. This results in call far ssss:oooooooo)

16-bit code segment to 32-bit handler procedure, near version:

Code: Select all

66 9A gg gg 00 00 01 00 90 90 (call far 0001:0000gggg from 16-bit segment which is always invalid)

66 9A 66 0E 66 E8 01 00 90 90 (place 66 0E 66 E8 at offset 2. This results in call far 0001:E8D60E66 which is always invalid)

66 9A 66 0E 66 E8 oo oo oo oo (place offset at offset 6. This results in call far oooo:E8660E66 which is always invalid  because no code segment in RDOS can be larger than E0000000)

90 90 66 0E 66 E8 oo oo oo oo (place two nops at offset 0. This results in push cs / call near oooooooo which is the final code)

For 32-bit code segments, it is sufficient to use 9 bytes for the opcode.

32-bit code segment to 32-bit handler procedure, far version:

Code: Select all

9A gg gg 00 00 02 00 90 90 (call far 0002:0000gggg from 32-bit segment which is always invalid)

9A gg gg 00 00 F5 F5 F5 F5 (place 4 cmcs at offset 5. This results in call F5F5:0000gggg which is always invalid)

3E 67 9A 00 00 F5 F5 F5 F5 (place ds-override and address size override (no effect) + new call opcode at offset 0. This results in call F5F5:F5F50000 which is always invalid)

3E 67 9A oo oo oo oo F5 F5 (place offset at offset 3. This results in call F5F5:oooooooo which is always invalid)

3E 67 9A oo oo oo oo ss ss (place selector at offset 7. This results in call ssss:oooooooo, which is the final opcode, but with two address size overrides)

90 90 9A oo oo oo oo ss ss (place 2 nops at offset 0. This results in call ssss:oooooooo)

32-bit code segment to 32-bit handler procedure, near version:

Code: Select all

9A gg gg 00 00 02 00 90 90 (call far 0002:0000gggg from 32-bit segment which is always invalid)

9A gg gg 00 00 F5 F5 F5 F5 (place 4 cmcs at offset 5. This results in call far F5F5:0000gggg which is always invalid)

3E 67 9A 00 00 F5 F5 F5 F5 (place ds-override and address size overrides + new call at offset 0. This results in call far F5F5:F5F50000 which is always invalid)

3E 67 9A oo oo oo oo F5 F5 (place offset at offset 3. This results in call far F5F5:ooooooo which is always invalid)

90 0E E8 oo oo oo oo F5 F5 (place nop, push cs and call near at offset 0. This results in push cs / call oooooooo + 2 cmcs, which is the final operation)

90 0E E8 oo oo oo oo 90 90 (place 2 nops at offset 7 in order to remove cmcs)

EDIT: Removed multiple prefixes of the same sort as Intel states the function of these are undefined. Better to use ds-override and address size override, which has no effect.

rdos · Post by **rdos** » Wed Mar 09, 2011 10:18 am

Flat application version (uses a call-gate, only needs 7 bytes):

Code: Select all

9A gg gg 00 00 02 00  (call far 0002:0000gggg from 32-bit segment which is always invalid)

67 66 9A 00 00 02 00 (change to a 16-bit call, and add an address override. This results in call far 0002:0000 which is always invalid)

67 66 9A 00 00 ss ss (put call-gate selector at offset 5. This results in call far ssss:0000, which is the final operation, but with an extra address override)

90 66 9A 00 00 ss ss (put nop at offset 0 to remove address override)

rdos · Post by **rdos** » Mon Mar 14, 2011 4:22 pm

Unfortunately, there is another complication, and that is if the opcode ends with nops, and those are patched, a debugger will still insert the breakpoints at the wrong location, and fail to step-over a call. Therefore, the initial opcode needs to start with extra overrides instead of ending with nops.

New 16-bit code segment to 32-bit handler procedure, far version: (interestingly, this leads to fewer patch steps)

Code: Select all

3E 67 66 9A gg gg 00 00 01 00 (call far 0001:0000gggg from 16-bit segment which is always invalid)

3E 67 66 9A oo oo oo oo 01 00 (place offset at offset 4. This results in call far 0001:oooooooo which is always invalid)

3E 67 66 9A oo oo oo oo ss ss (place selector at offset 8. This results in call far ssss:oooooooo, which is the final code, but with extra operand size overrides)

90 90 66 9A oo oo oo oo ss ss (place 2 nops at offset 0. This results in call far ssss:oooooooo)

16-bit code segment to 32-bit handler procedure, near version: (this patch sequence is longer)

Code: Select all

3E 67 66 9A gg gg 00 00 01 00 (call far 0001:0000gggg from 16-bit segment which is always invalid)

3E 67 66 9A 66 E8 00 00 01 00 (place 66 E8 at offset 4. This results in call 0001:0000E866 which is always invalid)

66 9A 66 0E 66 E8 00 00 01 00 (place 66 9A 66 0E at offset 0. This results in call 0000:E8660E66 which is always invalid)

66 9A 66 0E 66 E8 oo oo oo oo (place offset at offset 6. This results in call far oooo:E8660E66 which is always invalid  because no code segment in RDOS can be larger than E0000000)

90 90 66 0E 66 E8 oo oo oo oo (place two nops at offset 0. This results in push cs / call near oooooooo which is the final code)

While it is possible to reduce the far version to 9 bytes, it does not seem to be possible to construct a patch-sequence for the near version with less than 10 bytes. However, shorter code should be more important, so the solution might be to use 9 bytes, and always patch to the far version instead.

It looks like this:

Code: Select all

67 66 9A gg gg 00 00 01 00 (call far 0001:0000gggg from 16-bit segment which is always invalid)

67 66 9A oo oo oo oo 01 00 (place offset at offset 3. This results in call far 0001:oooooooo which is always invalid)

67 66 9A oo oo oo oo ss ss (place selector at offset 7. This results in call far ssss:oooooooo, which is the final code, but with extra operand size overrides)

90 66 9A oo oo oo oo ss ss (place a nop at offset 0. This results in call far ssss:oooooooo)

mikeee · Post by **mikeee** » Thu Mar 17, 2011 2:47 am

Another thing you have to consider is patching instructions that cross cache lines. The patching process must guarantee that other CPUs never fetch cache lines with "half-patched" instructions that aren't yet valid.

In this case it's important that the address bytes are on one cache line only or that you correctly handle the case where the CPUs see parts of the old address and parts of the new. It may for example get "gg gg oo oo" if the cache line end in the middle of the four address bytes.

rdos · Post by **rdos** » Thu Mar 17, 2011 6:06 am

mikeee wrote:Another thing you have to consider is patching instructions that cross cache lines. The patching process must guarantee that other CPUs never fetch cache lines with "half-patched" instructions that aren't yet valid.

In this case it's important that the address bytes are on one cache line only or that you correctly handle the case where the CPUs see parts of the old address and parts of the new. It may for example get "gg gg oo oo" if the cache line end in the middle of the four address bytes.

I think the real issue is if code-prefetch follow the rules of ordinary bus-locking or not. The patching process uses xchg to patch the code, and it will be atomic regardless of cache line crosses. Provided that code-prefetch obey the bus lock rules, it should not be possible for a core to execute something that is valid (will not protection fault), but still provides the wrong end-result, like you describe above with half of the offset mixed with half the gate number. This could only occur if code-prefetch would not obey the locking rules. Until otherwise is indicated (the code seems to work with 3 cores), I'll assume that code prefetch must obey lock semantics, and will not prefetch inconsistent data for instructions. After all, the processor must prefetch instructions from memory much the same way as data, and the locking algorithm must be implemented at a more basic level in the cache.

mikeee · Post by **mikeee** » Thu Mar 17, 2011 10:39 am

See 11.2 in this manual: http://support.amd.com/us/Embedded_TechDocs/40546.pdf

rdos · Post by **rdos** » Thu Mar 17, 2011 11:14 am

I see the problem now. The current patch logic is only garanteed to work when all the code is fetched at once, which the compiler cannot garantee. Even if the compiler could align the code to 4-byte / 8-byte boundraries, this would mean inserting more nops that slow down execution even after patching.

So, is it valid to assume that all SMP-enabled processors will use at least 8-byte aligned accesses to read at least 8 bytes at a time when prefetching code? That way it would be possible to construct 8 different cases, and make sure that the processor cannot execute mixed-stage code without protection-faulting, even if the prefetches are at different patching-stages. It should be valid to assume that prefetching always goes from lower address towards higher addresses, and always occurs at least 8-byte aligned in 8-byte chunks?

OSDev.org

Code patching and SMP

Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP

Re: Code patching and SMP