Make sure of new instructions, support old processors.

Stachelsk · Post by **Stachelsk** » Mon Jun 29, 2009 3:46 pm

How does one go about supporting new instructions, like those provided with the release of SSE4.1/SSE4.2, without killing support for older processors...? I'm thinking really basic right now, and I can't think of an efficient way to go about doing this without building separate binaries... which would be a huge waste of space, and really cumbersome.

The best thing that I can think of is to use cpuid to get the processors feature, and store a bit somewhere if the "new" instruction is supported, whatever that may be. Then later on, using a conditional...

Code: Select all

if(new_instruction_is_supported)
    use_new_instruction
    else {
    software_emulation_for_instruction
    }

but this is rather inefficient, especially if called multiple times...

JohnnyTheDon · Post by **JohnnyTheDon** » Mon Jun 29, 2009 4:44 pm

There are a couple of methods that I can think of. One is building seperate binaries, like you said.

Any code that goes through dynamic translation (including but not limited to Java and .NET) can be, in theory, conditionally compiled to use/not use SSE instructions based on the target computer's abilities. However, I'm not sure that either of these actually use any of the newer SSE instructions.

You could also use function pointers. The program can check if the instructions it wants to use are supported, and change function pointers accordingly. You might want to do this with a math function like cos.

It also depends on what level of SSE you plan on using. If you're not going to use anything past SSE2 and the program is 64-bit, you don't have to worry about breaking compatibility because SSE2 is part of AMD64 (and EM64T and Intel 64 and x86-64 ...).

JamesM · Post by **JamesM** » Mon Jun 29, 2009 6:43 pm

You could also use function pointers. The program can check if the instructions it wants to use are supported, and change function pointers accordingly. You might want to do this with a math function like cos.

A quicker way of doing this might actually be to have "cos" (for example) defined as a label to a blank area of memory and to memcpy in to it the best version available on the given processor. That would remove any cost associated with that extra indirection.

JohnnyTheDon · Post by **JohnnyTheDon** » Mon Jun 29, 2009 7:07 pm

JamesM wrote:
You could also use function pointers. The program can check if the instructions it wants to use are supported, and change function pointers accordingly. You might want to do this with a math function like cos.
A quicker way of doing this might actually be to have "cos" (for example) defined as a label to a blank area of memory and to memcpy in to it the best version available on the given processor. That would remove any cost associated with that extra indirection.

Excellent idea. You would have to be careful about making sure the memory that you copy the function to is excecutable so it will run on processors that support the Excecute Disable bit. On Linux, I believe this is done using the mprotect() function, and I'm sure Windows has a similar construct. You could also change the headers of your excecutable to make the data section where your functions will be excecutable.

frank · Post by **frank** » Mon Jun 29, 2009 8:52 pm

JamesM wrote:A quicker way of doing this might actually be to have "cos" (for example) defined as a label to a blank area of memory and to memcpy in to it the best version available on the given processor. That would remove any cost associated with that extra indirection.

But then you would have to reserve enough memory for all of the implementations, which might not be to hard when you have 2 or 3 but if you end up with 10 different versions of the same function it might get a little complicated.

Does anyone have any hard data to support the idea that using function pointers really affects performance all that much? I would guess the processor could in theory perform indirect calls just as fast as direct ones, isn't the only real difference a memory access and with all the out of order execution and prefetching wouldn't that cost be virtual eliminated? Or does the processor treat a indirect call differently?

Edit: According to the Intel Instruction Set Manual the indirect and direct calls are treated the same however according to the Optimization Manual the processor may have different branch prediction algorithms for indirect and direct branches. So an indirect branch may have a negative affect on the branch prediction both types of calls are about the same if the address is in the cache.

NickJohnson · Post by **NickJohnson** » Mon Jun 29, 2009 9:05 pm

Afaik (through Gentoo), the way this is usually "solved" on a entire-program basis (i.e. not just for key functions) is just by having a source based installation system. You compile with optimizations for instructions on your specific processor, then only have those binaries on your machine. No extra features or programs needed, and no overhead except on install. Of course, you need to *have* the source of the programs you're installing, which may not be an option.

JohnnyTheDon wrote:... isn't the only real difference a memory access ...

That's a pretty big difference - (non-cached) memory is 10-100 times slower than the processor's registers.

frank · Post by **frank** » Mon Jun 29, 2009 9:30 pm

@ OP Well the three methods we've come up with so far are

Conditional compilation - have #ifdef for different architectures
Copy the optimized function to a region of memory and run it from there
Function pointers

NickJohnson wrote:That's a pretty big difference - (non-cached) memory is 10-100 times slower than the processor's registers.

Yes if you can make the call into a direct call by compiling for a specific architecture ahead of time then you would have huge benefits, but what if you can't? What if you want a program, or maybe a kernel, run on everything from Pentiums to Core 2s out of the box and not be limited to the functionality of the Pentium? On some architectures it may be impossible to modify the code sections to copy the new function and they most certainly won't run code from a data section. That and the No eXecute protection built into new CPUs would certainly raise a fault if you tried to run code from a data area.

About the not in cache penalty, assuming all of the data for a program is nicely packed any memory access would populate the cache and the indirect jump would have almost a zero cost. Hopefully if the address wasn't in cache the prefetcher would grab it even before the CPU made it to the indirect jump.

NickJohnson · Post by **NickJohnson** » Mon Jun 29, 2009 10:12 pm

NickJohnson wrote:Yes if you can make the call into a direct call by compiling for a specific architecture ahead of time then you would have huge benefits, but what if you can't? What if you want a program, or maybe a kernel, run on everything from Pentiums to Core 2s out of the box and not be limited to the functionality of the Pentium?

As long as you have the source and a fast compiler, you can get the same out of the box functionality you're talking about on all architectures. It all depends on whether you consider the compilation to be part of a normal installation; being in contact with portage/ports/slackbuild etc. gives you a very different perspective. And because this is your OS, it's usually the case that you have the source code for the program you're building. A kernel is usually not optimized anyway, and if it is, it's only with a few specific functions, a problem solved by the other proposed methods, which all work much more easily in kernelspace than userspace.

If you don't have the source, I also have a solution. I'm developing a very simple bytecode that can be the output of any normal compiler (i.e. it's not managed), but is then able to be statically recompiled to optimized machine code. It's essentially a reified version of the connection between a compiler front end and back end. But the point is that you can get the benefits of optimized code without having to parse C while compiling (so it's faster), and also can satisfy the needs of closed-source programs that need an opaque binary format for distribution.

JohnnyTheDon · Post by **JohnnyTheDon** » Mon Jun 29, 2009 10:33 pm

If you don't have the source, I also have a solution. I'm developing a very simple bytecode that can be the output of any normal compiler (i.e. it's not managed), but is then able to be statically recompiled to optimized machine code. It's essentially a reified version of the connection between a compiler front end and back end. But the point is that you can get the benefits of optimized code without having to parse C while compiling (so it's faster), and also can satisfy the needs of closed-source programs that need an opaque binary format for distribution.

Sounds much like LLVM, but still a great idea. One thing that I really wish LLVM could do is configuration options. However, I think (though I haven't looked at the code) that LLVM uses SSE optimization when possible for mathematical operations. I really can't imagine them not doing so (SSE is SO much faster and easier than x87).

Another option that I just thought of is jump gates. Windows uses these for its DLLs, so I imagine Intel and AMD have implemented some kind of optimization for this type of call. The jump gate could be rewritten when the program figures out what features the processor supports.

Brendan · Post by **Brendan** » Tue Jun 30, 2009 12:38 am

Hi,

This may be a little more complex than it first seems.

First, different CPUs have different bugs which make CPUID unreliable. A CPU might report that it supports a certain feature when it doesn't (e.g. Pentium Pro says it supports SYSENTER when it doesn't), a CPU might report that it doesn't support a feature when it actually does (e.g. lots of CPUs do support CMPXCHG8B but don't report it because of an old bug in Windows NT). Also, a CPU might correctly report that it supports a feature, but even though the feature is supported it might be so buggy that it's better to pretend that the CPU doesn't support the feature. There's also a few features that are supported by some CPUs in long mode but not in protected mode, or supported in protected mode but not long mode, depending on who made the CPU (mostly SYSENTER and SYSCALL). Finally, you might be interested in features that aren't reported by CPUID at all. This could include certain debugging and performance monitoring options (where software can't tell if they're supported or not without looking at MSRs).

For all of these reasons, the first step is writing code to examine CPUID (and any other information) to build a standardized set of feature flags, so that all software can check your feature flags to determine if a feature is supported or not (so that software can avoid all the messy parts of feature detection). I'd actually build 2 sets of feature flags - one set for features that normal processes might care about, and another set for features that only the kernel should care about.

Next, a kernel can emulate instructions that aren't supported by the CPU, and it's fairly easy to emulate some things (CMOVcc, MOVBE, SYSCALL, SYSRET, etc). Some things (e.g. FPU, MMX, SSE) are much harder to emulate, but it's entirely possible with enough work. Because of this I'd have an extra set of flags (with identical meanings for each bit as the first set of flags) that includes things that are supported by the CPU and things that are emulated. That way if a piece of software requires a certain feature then it can check the "supported or emulated" flags to see if it can be executed or not, and these flags can also be used when performance doesn't matter (e.g. initialization code that would like to use CMOVcc but doesn't care if it's emulated).

However, instruction emulation can involve splitting a feature flag in half. For example, if a CPU doesn't support CMPXCHG8B then you can easily emulate the CMPXCHG8B instruction, but you won't be able to emulate "LOCK CMPXCHG8B" on multi-CPU machines (on single CPU machines you can probably get away with disabling IRQs while doing a plain CMPXCHG8B). In this case you might have one flag for "non-atomic CMPXCHG8B is emulated" and another flag for "atomic CMPXCHG8B is emulated" so that it's possible for software to use (emulated) CMPXCHG8B when that software doesn't need it to be atomic. Another example might be FPU instructions. I'd be tempted to split them in half too - one flag for simple operations and another flag for complex operations (sin, cos, sqrt, etc) so that a kernel can emulate the easy stuff without emulating the hard stuff if it wants to.

In some cases software might want to disable a CPU's feature and use your emulation instead. For example, some Pentium CPUs have problems with the FDIV instruction (where the result is wrong/inaccurate), and an accounting package might want to use your emulated FPU instead of the real FPU to avoid the problem.

Also note that in multi-CPU systems it's possible for different CPUs to support different features. At a minimum you could AND all the feature flags from each CPU together to find the set of features that are supported by all CPUs. A better idea might be to use CPU affinity to make sure that software isn't scheduled on CPUs that don't support features that the software is using.

Some CPU features require kernel support. This includes MMX and SSE (where the kernel must be able to save/restore MMX and/or SSE state, and must be able to handle any SSE exceptions) and certain debugging and performance monitoring features. You should be able to find relevant information for what is required in the software developer's manual.

In all executable files I'd have a similar set of flags (e.g. in the header) that say which features the exectuable requires; so that the loader (and other software) can determine if the executable requires features that aren't supported or emulated. That way if someone tries to run a program that can't be executed, the GUI can pop up a nice "This program requires features that aren't supported on this computer" dialog box; and you won't need to add this initial check (or the dialog box) into every executable.

That only leaves features that a piece of software doesn't require but can optionally use. In this case, during initialization I'd ask the OS for the "supported feature flags" (*not* the "supported or emulated feature flags"), and use them to setup function pointers during initialization and for conditional branches after intialization.

This can be extremely fast, or extremely slow, depending on where and how it's used, and how often it's used. For example, instead of using a function pointer or a conditional branch in the middle of a tight loop it'd be better to have 2 seperate loops. Most CPUs remember branch targets (including indirect calls) so after the first time there's no penalty at all (unless there's too many branch targets for the CPU to remember and least recently used branch targets get overwritten).

IMHO using self modifying code (e.g. copying the best version of some code into a fixed location) is usually a bad idea - it means that your code needs to be writable and can be modified by bugs (e.g. uninitialized pointer) or malicious code (e.g. bad plugins or libraries). The overhead of modifying the code is often larger than the overhead of using function pointers and/or conditional branches (assuming function pointers and/or conditional branches are used sensibly). Worst case is if the OS supports memory mapped executable files; where unmodified pages are loaded from the file system if/when needed and can be freed at any time to save RAM. In this case modified pages would need to loaded from disk when they're modified, then sent to swap to reclaim the RAM, and then reloaded from swap if/when needed; which either means a lot more file I/O, or means that less RAM is left free for more important things.

Cheers,

Brendan

Stachelsk · Post by **Stachelsk** » Tue Jun 30, 2009 9:56 pm

Thanks for all the responses... I really appreciate them. This is going to be more of a brainstorming problem then I thought. I'll definitely be experimenting to see what works best for me...

JohnnyTheDon · Post by **JohnnyTheDon** » Tue Jun 30, 2009 10:23 pm

IMHO using self modifying code (e.g. copying the best version of some code into a fixed location) is usually a bad idea - it means that your code needs to be writable and can be modified by bugs (e.g. uninitialized pointer) or malicious code (e.g. bad plugins or libraries).

If you write the function during initialization, and then switch the permissions to read only and excecutable you could avoid this problem.

Worst case is if the OS supports memory mapped executable files; where unmodified pages are loaded from the file system if/when needed and can be freed at any time to save RAM. In this case modified pages would need to loaded from disk when they're modified, then sent to swap to reclaim the RAM, and then reloaded from swap if/when needed; which either means a lot more file I/O, or means that less RAM is left free for more important things.

If the best function is chosen and writen during initialization, then the page will never be modified again and after being paged out does not need to be written to disk in the future. It would be best to keep these functions on pages seperate from normal data in this case, to prevent them from having to be written to disk because a global/static variable was changed.

Brendan · Post by **Brendan** » Wed Jul 01, 2009 12:13 am

Hi,

JohnnyTheDon wrote:
IMHO using self modifying code (e.g. copying the best version of some code into a fixed location) is usually a bad idea - it means that your code needs to be writable and can be modified by bugs (e.g. uninitialized pointer) or malicious code (e.g. bad plugins or libraries).
If you write the function during initialization, and then switch the permissions to read only and excecutable you could avoid this problem.

That'd work, as long as a program can only reduce it's own permissions - e.g. a process (or malicious code running in the context of that process) can't tell the kernel to make a read only code page writable again. However, that'd also mean the kernel would need to make the code writable to begin with, and there'd always be the chance of a process forgetting to lock it's code pages, and also a "window of opportunity" where malicious code could modify the process before it locks the code.

JohnnyTheDon wrote:
Worst case is if the OS supports memory mapped executable files; where unmodified pages are loaded from the file system if/when needed and can be freed at any time to save RAM. In this case modified pages would need to loaded from disk when they're modified, then sent to swap to reclaim the RAM, and then reloaded from swap if/when needed; which either means a lot more file I/O, or means that less RAM is left free for more important things.
If the best function is chosen and writen during initialization, then the page will never be modified again and after being paged out does not need to be written to disk in the future. It would be best to keep these functions on pages seperate from normal data in this case, to prevent them from having to be written to disk because a global/static variable was changed.

Yes.

For function pointers and/or conditional branches, code that isn't used would never be loaded at all, and code that is used could be freed and reloaded from the original file; while for self modifying code it'd be loaded, then modified, then (potentially) sent to swap once (where it could be kept in swap and reloaded from swap as many times as necessary). Worst case difference is 2 extra transfers to/from disk per page.

Of course this also depends on how you do dynamic linking (if you do it at all) - maybe the code is modified during linking anyway, and self modifying code wouldn't make any difference.

The other problem is finding tools that will support it. I'm not too sure how you'd convince something like GNU LD to link properly when several pieces of code are stored at different addresses in the file but need to be linked to run at the same virtual address...

Cheers,

Brendan

frank · Post by **frank** » Wed Jul 01, 2009 9:26 am

LD has something called overlays which allow different code to be linked to run at the same address:
http://os.cqu.edu.au/cgi-bin/info/info2 ... escription

Brendan · Post by **Brendan** » Wed Jul 01, 2009 10:48 am

Hi,

frank wrote:LD has something called overlays which allow different code to be linked to run at the same address:
http://os.cqu.edu.au/cgi-bin/info/info2html.cgi?(ld)Overlay Description

Wow - you're right. I've never seen anything like it before (although I guess I've never read everything in LD's manual either).

Cheers,

Brendan

OSDev.org

Make sure of new instructions, support old processors.

Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.

Re: Make sure of new instructions, support old processors.