OSDev.org

Posted: **Fri Sep 01, 2006 2:44 am**

just an idea that popped up in my mind yesterday ... Considering the fact that one machine here supports e.g. mmx but not "cmov" instruction, that another one has both mmx, cmov but is an intel ignoring AMD extensions for system calls, etc... and another machine is just a raw pentium with no such fancy extension.

If i'm to optimize through GCC, i'd have to recompile 3 binaries and deploy them appropriately (-march=pentium, -march=pentium2, -march=pentium3, etc). Not that cool. somehow, it could be interesting to create _all_ versions right from the start, pick one as "reference" and attach binary patches in the .ELF file ... a 'smart' linker could then realise from /proc/cpuid that cmov is not supported here and that the ELF file should be patched to another architecture before being run (and raising exceptions).

Does that make sense or should i just turn to autoconf/automake and some fancy package maintainer tool that would push the different "variants" of my ELF file in an RPM/DEB file or whatever ?

Posted: **Fri Sep 01, 2006 5:23 am**

I think this is a great idea, if used appropriately. My advice is to have a 'base edition' which works for the minimum supported platform (say i386) and a collection of patches, each compressed. Since the most benefit will be seen by the big packages, the compression is almost essential.

There are other options like using the same binary with a different edition of a shared library for each platform. Another option is to use a binary which uses mmx or whatever as needed, and either emulate the instructions (by catching an exception when the processor doesn't understand the opcode) or use the 'smart linker' to scan the .text segment and replace instructions which are not valid on the current processor, or even use a Just-In-Time (JIT) compilation system which replaces the opcodes on a per-function basis when that function is first run.

The binary patch could even be treated as the saved output of such opcode replacement (resulting in a base edition for the best platform and patches which gracefully degrade to the minimal supported cpu), so that the extra work only needs to be done for binaries which do not include a patch for the current system. Once a binary has been 'fixed' for the current system, you could save the result as a patch so the work doesn't need to be repeated next time.

One thing to be warned about: static analysis of a binary rarely works without extra information. You need to watch out for data in the middle of the code stream (which should not be patched) and static functions which don't appear in the symbol table (so patching each named function won't work). It's definately something of a black art, and I know pretty much nothing about it.

The advantage of patching specially for the current system is of course that CPU & other bugs can be avoided by replacing the instructions which trigger the bug.

The point is - there are many options and I'm pretty sure there is a lot of research and info available on this subject. Something like http://en.wikipedia.org/wiki/Binary_translation may be a start, or maybe google the phrase "Binary Translation". This is not exactly what we are talking about but it seems close.

My advice regarding autoconf will always be that i hate it, but thats a whole different story

.

Posted: **Fri Sep 01, 2006 8:19 am**

I've actually seen this done on a small scale in a lot of commercial products. Basically some commonly used routines are implemented let's say using: normal instructions, FPU and MMX.

one of these is memcpy.

so, you link in all 3 each named slightly different. memcpy_std, memcpy_fpu, memcpy_mmx. Then at startup time, you do a quick "what's enabled" check and set a memcpy function pointer to the appropriate version. no need for a small linker in the case, just hand optimize a few commonly used functions and there actually is a sizable gain in speed sometimes.

proxy

Posted: **Fri Sep 01, 2006 8:39 am**

proxy wrote: one of these is memcpy.

so, you link in all 3 each named slightly different. memcpy_std, memcpy_fpu, memcpy_mmx. Then at startup time, you do a quick "what's enabled" check and set a memcpy function pointer to the appropriate version. no need for a small linker in the case, just hand optimize a few commonly used functions and there actually is a sizable gain in speed sometimes.

You'll be then using indirect calls like [tt]call [memcpy_ptr][/tt], which are slower than normal calls like [tt]call memcpy[/tt]. That means you should only use such a method on functions that experiment BIG performace gains (i.e. multimedia functions from MMX, 3DNow! and SSE). Care should be taken and the results evaluated if one does not want the "optimization" to become a self-defeating tactic.

Example: maybe IA-32 memcpy would benefit from copying 128 bits at a time through SSE2 instead of 4x 32bit transfers, but in x86-64 the price of indirect calls would be too much for the gain of 128 bit SSE vs 2 x 64 bit transfers.

Posted: **Fri Sep 01, 2006 8:42 am**

proxy wrote: I've actually seen this done on a small scale in a lot of commercial products. Basically some commonly used routines are implemented let's say using: normal instructions, FPU and MMX.

one of these is memcpy.

so, you link in all 3 each named slightly different. memcpy_std, memcpy_fpu, memcpy_mmx. Then at startup time, you do a quick "what's enabled" check and set a memcpy function pointer to the appropriate version.

true. You can indeed achieve it that way, but then you'll be losing the advantage of any "inlining" you could do on memcpy, right ?
(and iirc, memcpy is typically something GCC loves to inline when you don't disable its built-ins as for medium sizes, you'd experience a severe penalty from the prologue/epilogue code).

Posted: **Fri Sep 01, 2006 9:05 am**

Remember that many small memcpys will take a big performance hit from the prologue/epilogue and call, whereas a single large memcpy will take hardly any performance hit (copying say 1MB). It really depends on the application and how much speedup you expect from using faster instructions.

There is going to be an overhead when patching the code as you describe, the best idea is probably to move the overhead to load or even install time (by patching the binary as you first suggested) rather than having the overhead at run time (function pointers, if/then/else blocks).

Another trick which sadly loses the ability to inline functions is to put the function you want to change on a page of its own (or related functions which will be changed together on a page with no other functions) and decide at load time which 'edition' of this page to load from the binary.

Posted: **Fri Sep 01, 2006 9:20 am**

Hi,

Pype.Clicker wrote:true. You can indeed achieve it that way, but then you'll be losing the advantage of any "inlining" you could do on memcpy, right ?

Yes, but you could do a seperate version of every function that has memcpy inlined.

More specifically, if your kernel API looks like a call table then you could replace entries in this call table depending on supported features - for e.g. there's nothing to stop you having 6 "allocate_linear_page()" functions in your kernel and modifying the kernel API's table so that the best version is used for the features supported by the CPU/s.

The same could be done for the IDT, and if you're using a modular micro-kernel with an "intra-kernel API" between the kernel modules....

Anyway, it's just a thought.

Cheers,

Brendan

Posted: **Fri Sep 01, 2006 9:30 am**

yep, clearly, the compiler and the linker would have to arrange things so that everything remains at the same address in the binary, so that you don't need to patch _everything_ because of a shorter version of function _init at the start of your file.

there's already some sort of padding between functions in binaries (to make sure each function starts on a cache line -- 32 bytes last time i checked), it wouldn't hurt a lot if you'd make sure there's enough padding for the largest version of each functions (imvho).

My advice is to have a 'base edition' which works for the minimum supported platform (say i386) and a collection of patches, each compressed. Since the most benefit will be seen by the big packages, the compression is almost essential.

having the i386 as default could be indeed a nice choice for distribution (e.g. the smallest cpu will also be the the slowest, so we should not request 386 to patch the whole distro, should we?)
However, once the binary is loaded on a given machine, you'll turn it into a binary _suited to your machine_, only once.

Of course if you screw your HDD into a new machine, the system may have to patch all binaries again, but at least, it would still work (even if moving to a less powerful machine).

Posted: **Fri Sep 01, 2006 7:40 pm**

Habbit wrote: You'll be then using indirect calls like [tt]call [memcpy_ptr][/tt], which are slower than normal calls like [tt]call memcpy[/tt]. That means you should only use such a method on functions that experiment BIG performace gains (i.e. multimedia functions from MMX, 3DNow! and SSE). Care should be taken and the results evaluated if one does not want the "optimization" to become a self-defeating tactic.

Example: maybe IA-32 memcpy would benefit from copying 128 bits at a time through SSE2 instead of 4x 32bit transfers, but in x86-64 the price of indirect calls would be too much for the gain of 128 bit SSE vs 2 x 64 bit transfers.

I'm on the same vote as this however, the performance differnts with indirect call [mem_ptr] then immediate call code_label isn't that big of a differnts and not enough to justify if a differnt approach is needed. Aligned rep movsd is in general the fastest for general purposes. Once you move into MMX and SSE etc they'res also pit falls from movsd, such as the data is not aligned within 8byte or 16 bytes, and having to empty the state of MMX registers which is a bad performance loss using with floating point math. So I guess what I'm going at is that not only should should determin that a particular feature is available on one system, before patching it should be analyzed to suit whatever its doing.

paulbarker wrote: Remember that many small memcpys will take a big performance hit from the prologue/epilogue and call, whereas a single large memcpy will take hardly any performance hit (copying say 1MB). It really depends on the application and how much speedup you expect from using faster instructions.

There is going to be an overhead when patching the code as you describe, the best idea is probably to move the overhead to load or even install time (by patching the binary as you first suggested) rather than having the overhead at run time (function pointers, if/then/else blocks).

Another trick which sadly loses the ability to inline functions is to put the function you want to change on a page of its own (or related functions which will be changed together on a page with no other functions) and decide at load time which 'edition' of this page to load from the binary.

I think for epilogues/prologues isn't very a big concern either. In my opinion memcpy is more for generic purposes and if speed is really required copying small blocks then the designer should, in my opinion implement a special memcpy for the purpose, I'd say anything smaller then 64 bytes.

---

I have though done a static NT library that does the sysenter directly which get called as immediate call code_label. Which during initialisation I have to detect the gate numbers and if sysenter or syscall is appropiate. The way I've set it up was each of the call has its own code block and initially the code calls a routine to replace the function that was called. So the initial call will modify the code the first time around and recall the modified code, and the second time around code is directly executed. You'll have to keep this code in Execute RW memory, which I'm not sure of theres any performance loss if the memory is writtable and executable, but its possible to even hand code a source this way with a bit of source code design thoughts.

Personally I think inlining can reduce the performance too because code size is bigger which effects instruction prefetch caches.

Posted: **Sat Sep 02, 2006 3:25 am**

Yes, but you could do a seperate version of every function that has memcpy inlined.

More specifically, if your kernel API looks like a call table then you could replace entries in this call table depending on supported features - for e.g. there's nothing to stop you having 6 "allocate_linear_page()" functions in your kernel and modifying the kernel API's table so that the best version is used for the features supported by the CPU/s.

Yes, that could be done. That'd look much like a kernel where you'd have e.g. a memman module compiled for i386, another one compiled for i686, etc. and having the loader figure out what to link with what (or having a meta-module sensing the CPU features and exporting a different set of functions).

Though, most of the time, your module has relocation info, right? so there's no need for an actual "jump table": you know by advance where all the function calls are and how to patch the "call xxxxxxx" with the actual location of the callee.

I guess arguing more about "we could have something that patches 'bne $+4 mov eax, ebx' into 'cmoveq eax, ebx'" without actually coming with a tool that produces "diffs" between memman-i386.o and memman-i686.o so that we can add this into the .patch-cpu-i686 section of the final "memman.o" is just speaking in the void. So i'll be back on it when i have the "diff" tool -- and the accompanying patcher

Posted: **Sat Sep 02, 2006 5:48 am**

Hi,

Pype.Clicker wrote:Yes, that could be done. That'd look much like a kernel where you'd have e.g. a memman module compiled for i386, another one compiled for i686, etc. and having the loader figure out what to link with what (or having a meta-module sensing the CPU features and exporting a different set of functions).

Though, most of the time, your module has relocation info, right? so there's no need for an actual "jump table": you know by advance where all the function calls are and how to patch the "call xxxxxxx" with the actual location of the callee.

For the kernel API, user-level code needs to know how to access kernel functions. A common way to do this is to have an entry point (software interrupt, SYCALL/SYSENTER, etc) that does something like "call [kernelAPItable + eax * 4]". It's this call table that is wide open for boot-time configuration.

The IDT is also quite similar - for e.g. it's not hard to have several "device not available" exception handlers, and decide which to install depending on whether or not FPU, FXSAVE, MMX/SEE, etc is supported. None of this reduces performance, or requires any boot-time linker or auto-configure mess...

For my OS, there's kernel modules which also use a call table to access each others functions (the "internal kernel API"). On a larger scale this means I can have a "plain paging" module and a "PAE paging" module and still use the same "scheduler" module without caring which paging module is being used. On a smaller scale, it means that a module can select which versions of it's functions to install into the internal kernel API depending on what features, etc are present. Despite this, none of my binaries ever use run-time linking and there is no relocation information, etc. The only disadvantage is the additional cost of indirect calls, and that it requires the "internal kernel API" to be well documented (something I consider a good thing in any case).

Pype.Clicker wrote:I guess arguing more about "we could have something that patches 'bne $+4 mov eax, ebx' into 'cmoveq eax, ebx'" without actually coming with a tool that produces "diffs" between memman-i386.o and memman-i686.o so that we can add this into the .patch-cpu-i686 section of the final "memman.o" is just speaking in the void. So i'll be back on it when i have the "diff" tool -- and the accompanying patcher

Hehehee - I'll expect your return somewhere near the end of the century...

Will your "memman.o" patch also replace occurances of "mov eax,cr3; mov cr3,eax" with "invlpg [???]" instructions, re-optimize register usage now that "eax" isn't used, change all of the relative and fixed offsets (for both code and data) to account for the differences in instruction sizes, and still not mess things up when presented with multi-CPU "TLB shootdown"?

After you've spent many years perfecting this method will anyone be able to notice the performance difference compared to run-time decisions, like:

Code: Select all

   if (CPU.features.invlpg == 1) {
      invalidate(address);
   } else {
      reload_CR3();
   }

And finally, if you had tools capable of doing this patching now, would you decide to use them or would it increase maintenance and testing hassles too much to consider? I would assume it wouldn't take that long before you're applying several different patches to the same base code, and would need to find any incompatabilities and/or dependancies this causes...

The way I see it, you could spend a little time creating tools to allow very minor changes with very minor performance differences and minor hassles, or you could spend a huge amount of time creating tools that allow large changes with large performance differences and large hassles....

Cheers,

Brendan

OSDev.org

all-in-one binaries...

all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...

Re:all-in-one binaries...