Hi,
This may be a little more complex than it first seems.
First, different CPUs have different bugs which make CPUID unreliable. A CPU might report that it supports a certain feature when it doesn't (e.g. Pentium Pro says it supports SYSENTER when it doesn't), a CPU might report that it doesn't support a feature when it actually does (e.g. lots of CPUs do support CMPXCHG8B but don't report it because of an old bug in Windows NT). Also, a CPU might correctly report that it supports a feature, but even though the feature is supported it might be so buggy that it's better to pretend that the CPU doesn't support the feature. There's also a few features that are supported by some CPUs in long mode but not in protected mode, or supported in protected mode but not long mode, depending on who made the CPU (mostly SYSENTER and SYSCALL). Finally, you might be interested in features that aren't reported by CPUID at all. This could include certain debugging and performance monitoring options (where software can't tell if they're supported or not without looking at MSRs).
For all of these reasons, the first step is writing code to examine CPUID (and any other information) to build a standardized set of feature flags, so that all software can check your feature flags to determine if a feature is supported or not (so that software can avoid all the messy parts of feature detection). I'd actually build 2 sets of feature flags - one set for features that normal processes might care about, and another set for features that only the kernel should care about.
Next, a kernel can emulate instructions that aren't supported by the CPU, and it's fairly easy to emulate some things (CMOVcc, MOVBE, SYSCALL, SYSRET, etc). Some things (e.g. FPU, MMX, SSE) are much harder to emulate, but it's entirely possible with enough work. Because of this I'd have an extra set of flags (with identical meanings for each bit as the first set of flags) that includes things that are supported by the CPU and things that are emulated. That way if a piece of software requires a certain feature then it can check the "supported or emulated" flags to see if it can be executed or not, and these flags can also be used when performance doesn't matter (e.g. initialization code that would like to use CMOVcc but doesn't care if it's emulated).
However, instruction emulation can involve splitting a feature flag in half. For example, if a CPU doesn't support CMPXCHG8B then you can easily emulate the CMPXCHG8B instruction, but you won't be able to emulate "LOCK CMPXCHG8B" on multi-CPU machines (on single CPU machines you can probably get away with disabling IRQs while doing a plain CMPXCHG8B). In this case you might have one flag for "non-atomic CMPXCHG8B is emulated" and another flag for "atomic CMPXCHG8B is emulated" so that it's possible for software to use (emulated) CMPXCHG8B when that software doesn't need it to be atomic. Another example might be FPU instructions. I'd be tempted to split them in half too - one flag for simple operations and another flag for complex operations (sin, cos, sqrt, etc) so that a kernel can emulate the easy stuff without emulating the hard stuff if it wants to.
In some cases software might want to disable a CPU's feature and use your emulation instead. For example, some Pentium CPUs have
problems with the FDIV instruction (where the result is wrong/inaccurate), and an accounting package might want to use your emulated FPU instead of the real FPU to avoid the problem.
Also note that in multi-CPU systems it's possible for different CPUs to support different features. At a minimum you could AND all the feature flags from each CPU together to find the set of features that are supported by all CPUs. A better idea might be to use CPU affinity to make sure that software isn't scheduled on CPUs that don't support features that the software is using.
Some CPU features require kernel support. This includes MMX and SSE (where the kernel must be able to save/restore MMX and/or SSE state, and must be able to handle any SSE exceptions) and certain debugging and performance monitoring features. You should be able to find relevant information for what is required in the software developer's manual.
In all executable files I'd have a similar set of flags (e.g. in the header) that say which features the exectuable requires; so that the loader (and other software) can determine if the executable requires features that aren't supported or emulated. That way if someone tries to run a program that can't be executed, the GUI can pop up a nice "
This program requires features that aren't supported on this computer" dialog box; and you won't need to add this initial check (or the dialog box) into every executable.
That only leaves features that a piece of software doesn't require but can optionally use. In this case, during initialization I'd ask the OS for the "supported feature flags" (*not* the "supported or emulated feature flags"), and use them to setup function pointers during initialization and for conditional branches after intialization.
This can be extremely fast, or extremely slow, depending on where and how it's used, and how often it's used. For example, instead of using a function pointer or a conditional branch in the middle of a tight loop it'd be better to have 2 seperate loops. Most CPUs remember branch targets (including indirect calls) so after the first time there's no penalty at all (unless there's too many branch targets for the CPU to remember and least recently used branch targets get overwritten).
IMHO using self modifying code (e.g. copying the best version of some code into a fixed location) is usually a bad idea - it means that your code needs to be writable and can be modified by bugs (e.g. uninitialized pointer) or malicious code (e.g. bad plugins or libraries). The overhead of modifying the code is often larger than the overhead of using function pointers and/or conditional branches (assuming function pointers and/or conditional branches are used sensibly). Worst case is if the OS supports memory mapped executable files; where unmodified pages are loaded from the file system if/when needed and can be freed at any time to save RAM. In this case modified pages would need to loaded from disk when they're modified, then sent to swap to reclaim the RAM, and then reloaded from swap if/when needed; which either means a lot more file I/O, or means that less RAM is left free for more important things.
Cheers,
Brendan