Hi,
rdos wrote:This is not as simple as it might sound, as some patches will replace a far call with a push cs / call near, and the instruction itself is 8 bytes long, and there are no atomic instructions for 8 bytes.
I wouldn't say "none" - there is CMPXCHG8B.
For single-CPU machines (including 80486 and older) you can use anything you like because it's single-CPU. For multi-CPU, you say "for multi-CPU the minimum requirement is Pentium or newer" and then you can use "LOCK CMPXCHG8B". In practice this is perfectly fine, as the number of 80486 machines that have multiple CPUs is so close to zero that it's not worth caring about; and for 80386 and older there's even fewer multi-CPU machines (and because they pre-date Intel's MultiProcessor Specification there's no standard way of doing anything with them anyway).
The only other problem is that "LOCK CMPXCHG8B" forms the basis of the F00F bug on Pentium so you'd want to implement a F00F bug workaround, but you want to do that anyway if you support Pentium CPUs.
rdos wrote:EDIT: Looking for better instructions to patch with, I found cmc (0F5h).
I'd be tempted to use "UD2". It's guaranteed to generate a undefined opcode exception (even on older CPUs), and it's only 2 bytes. For example, the first 2 bytes would be UD2, the next 2 bytes would determine the type of patch needed, and the last 4 bytes would be spare space. You'd write the last 4 bytes (which won't effect the UD2 for any other CPUs), then you'd write the first 4 bytes.
As a bonus, the undefined opcode handler (typically) needs to examine the opcode of the instruction that caused the fault (to emulate instructions that aren't supported by the CPU), so adding a check for UD2 doesn't add much extra overhead to that exception handler; while the GPF handler usually doesn't need to examine the opcode of the instruction that caused the fault, so adding a check for "call far" does add extra overhead to that exception handler.
rdos wrote:But patch-time is not an issue. The instruction will only be patched once, and if the code is called a second time it will go directly to the destination, with no overhead.
Let's do a rough "worst case" estimate. An exception is about 100 cycles. The exception handler needs to determine if the exception was caused by an instruction that needs patching or not; and has to take into account a few corner cases - e.g. first half of "instruction to be patched" placed so that the second half is on a different ("not present" or "supervisor") page, or beyond a segment limit; and checking if the instruction is in a valid code section (and the process isn't executing random data because it crashed). The checking, etc might cost another 50 cycles. Then there's the spinlock and lock contention - even without any contention that might add 20 cycles (and with lock contention it might be as bad as 50 cycles per CPU or something). A pipeline flush (caused by self/cross modifying code) is probably another 100 cycles. So, for an extremely rough "worst case" estimate, maybe it adds up to about 300 cycles.
Now imagine some initialisation code - no loops or anything, and the code is only run once. Maybe there's 100 instructions that need patching. Maybe that adds up to a total of 30000 cycles of overhead, and maybe it could run 10 times faster if the patching was done differently.
rdos wrote:This is faster than the dynamic linking alternatives that usually make a jump to a patch-table or similar.
For some OS's there's a jump table (e.g. ELF and the Global Offset Table). For other OSs there isn't - there's a table of relocations, and the "executable loader" patches all the offsets in the instructions themselves. There's advantages/disadvantages in both cases. The main disadvantage of the second approach is that it gets very messy once you start looking at memory mapped executable files, and it causes problems for code sharing (e.g. same pages of executable code shared by different processes, with different relocations).
All I'm saying is that if you had a table containing the offset of each instruction that needs patching, you could patch everything before the code is executed (e.g. in the executable loader), and get rid of the exception overhead and serialising, get rid of the need for spinlocks and lock contention, get rid of the "does this instruction need patching" testing (and get rid of the chance of false positives), get rid of the most of the "is it safe to patch" corner-cases, and improve startup times for executables.
Of course I'm only making suggestions (that may or may not be suitable for your specific case), based on very little knowledge about you specific case - I know you're patching code, but I don't know why and don't know much else about your OS (e.g. what your executable format looks like, if you're using paging, if you're planning to support memory mapped executables, etc).
Cheers,
Brendan