making my program faster...
making my program faster...
hi, I have been making an x86 emulator in C (not ++ ) and I'm attempting to make it faster(don't want a slow emulator like bochs)
so here are my few questions
1. does the -O3 optimization perform any speed optimizations or should I use -Os (in gcc)
2. which is faster, using an array of pointers to functions or using a switch and then calling the functions from the cases of the switch
3. does gcc use the mmx stuff to make it faster? like if I choose a -march= something with mmx will it make it faster?
4. what is a fairly safe limit for the total size of an exe, like I know a 40mb exe is really big and would have speed issues from paging
edit:
5. what is expensive optimizations?
edit2:
6. does the inline modifier on functions make them faster? or what exactly does it do?
so here are my few questions
1. does the -O3 optimization perform any speed optimizations or should I use -Os (in gcc)
2. which is faster, using an array of pointers to functions or using a switch and then calling the functions from the cases of the switch
3. does gcc use the mmx stuff to make it faster? like if I choose a -march= something with mmx will it make it faster?
4. what is a fairly safe limit for the total size of an exe, like I know a 40mb exe is really big and would have speed issues from paging
edit:
5. what is expensive optimizations?
edit2:
6. does the inline modifier on functions make them faster? or what exactly does it do?
Re: making my program faster...
-O3 makes it faster, -Os makes it smaller. You want the first.hckr83 wrote:so here are my few questions
1. does the -O3 optimization perform any speed optimizations or should I use -Os (in gcc)
The first, except in the case where the first case handles nearly all hits. Instructions are a bit too spread out, so I'd go for the first.2. which is faster, using an array of pointers to functions or using a switch and then calling the functions from the cases of the switch
AFAIK not, but it might in very recent editions. You should probably use some form of explicit form to be sure to use it. Your libraries might include those types of functions though, so you might inadvertently be using it anyway when calling library functions.3. does gcc use the mmx stuff to make it faster? like if I choose a -march= something with mmx will it make it faster?
There's no real limit to it, it won't even have much impact from paging. There are a number of boundaries that make the optimization a bit slower again. There is each level of the processor cache, which most likely is L1 cache (on processor, separate for instruction and data) which is around 8k on either, AMD caches x86 instructions Intel caches uops, L2 cache which would be around 256k or such (but that's a pretty wild guess, since I'm not sure the P4's have the 2M as L2 or L3) and the L3 cache which can be up to a few megabytes. Then there's a limit with the TLB, which you might be able to avoid by telling the OS to use 2MB/4MB pages (which use a lot less TLB entries). AFAIK, Windows pages out the executable more agressively, preferring free memory to having the running exe in memory. In most cases, that's just plain stupid, so if you could tell it not to swap it out, that'd be great. The EXE size isn't everything though, there can be loads of debug info (which isn't used for running it) and if you allocate huge amounts of memory that'll slow it down more.4. what is a fairly safe limit for the total size of an exe, like I know a 40mb exe is really big and would have speed issues from paging
Loop unrolling, inlining. Stuff that takes your code and puts it in the output file a number of times to avoid jumps and calls. They are the two most expensive optimizations, but there are more. There are also doubtful optimizations, which might make the code faster but then again might make it slower again.edit:
5. what is expensive optimizations?
That depends on how big the function is. It always makes the function faster, since the entire contents of the function is copy-pasted into the calling function so you don't get another stack frame and you don't get another call. On large, often-called functions this gives quite a lot of code and recursive functions can't even be fully inlined (since you'd get an infinitely large executable). It also requires you to specify the entire contents of the function in your header for inlining, and possibly a flag to tell your compiler that it shouldn't output the function or that it should ignore multiple definitions.6. does the inline modifier on functions make them faster? or what exactly does it do?
Re: making my program faster...
Hi,
Consider the minimum work an instruction interpretter needs to do to emulate a simple fast instruction. For an example, consider the "stc" instruction, as it's one opcode and does almost nothing (set the carry flag - how hard can it be?).
First you need to get the opcode for it, which includes checking to see if EIP reached the code segment limit, and then checking if paging is enabled or not (and possibly checking if EIP moved on to the next 4 KB page if paging is enabled). I'd guess these checks add up to 6 cycles (best case).
Then there's getting the first opcode. For this I'd probably do something like "movzx eax, byte [ <foo> ]". The reason for making sure the highest bits are clear will be obvious soon. Anyway, add another 3 cycles for this.
Next, I'd do something like "jmp near [mainOpcodeTable + eax * 4]". You could use an indirect call (slightly worse) or a "switch/case" (much worse). Regardless of what you do the CPU can't guess what's going on and can't prefetch anything, so you're looking at a full pipeline stall. The costs for this is around 20 cycles (but depends a lot on a lot of things).
Now you've reached the "STC" emulation code. There's no need for any checks, and you can do something like "or [CPU.eflags],CARRYFLAG". Very roughly, another 3 cycles.
Lastly, you'd need to update EIP (and possibly the time stamp counter if you're being accurate), then do either a "ret" or a "jmp mainloop" depending on how you got here. Regardless of how you look at it, I'd estimate at least 6 cycles.
Add up all of this and you get roughly 38 cycles to emulate the simplest instruction I can think of. Of course this is "best case" - there's a large number of things that can make this much worse (doing a page table walk, checking for a "no-execute" flag, updating a "read" page table flag, having any form of cache miss on the host CPU, etc).
Now consider an instruction like "push dword [foo]" - how many checks are you going to need to do to find out if an exception needs to be generated? Then there's emulating the hardware (working out an IRQ is about to occur, updating the video, etc). This costs more overhead.
Then there's the host OS's overhead - task switches, memory management, GUI, etc.
For Bochs, the emulated computer is (roughly) 100 times slower than the host computer. I really don't think it's possible to get much improvement on that using the same techniques.
Cheers,
Brendan
For performance, Bochs is one of the fastest instruction interpretters I've ever seen, and there's probably nothing I can think of to make it faster that hasn't already been done. It's slow compared to dynamic translation, and even slower compared to virtualisers/hyper-visors, but it's still one of the fastest emulators of it's type.hckr83 wrote:hi, I have been making an x86 emulator in C (not ++ ) and I'm attempting to make it faster(don't want a slow emulator like bochs)
Consider the minimum work an instruction interpretter needs to do to emulate a simple fast instruction. For an example, consider the "stc" instruction, as it's one opcode and does almost nothing (set the carry flag - how hard can it be?).
First you need to get the opcode for it, which includes checking to see if EIP reached the code segment limit, and then checking if paging is enabled or not (and possibly checking if EIP moved on to the next 4 KB page if paging is enabled). I'd guess these checks add up to 6 cycles (best case).
Then there's getting the first opcode. For this I'd probably do something like "movzx eax, byte [ <foo> ]". The reason for making sure the highest bits are clear will be obvious soon. Anyway, add another 3 cycles for this.
Next, I'd do something like "jmp near [mainOpcodeTable + eax * 4]". You could use an indirect call (slightly worse) or a "switch/case" (much worse). Regardless of what you do the CPU can't guess what's going on and can't prefetch anything, so you're looking at a full pipeline stall. The costs for this is around 20 cycles (but depends a lot on a lot of things).
Now you've reached the "STC" emulation code. There's no need for any checks, and you can do something like "or [CPU.eflags],CARRYFLAG". Very roughly, another 3 cycles.
Lastly, you'd need to update EIP (and possibly the time stamp counter if you're being accurate), then do either a "ret" or a "jmp mainloop" depending on how you got here. Regardless of how you look at it, I'd estimate at least 6 cycles.
Add up all of this and you get roughly 38 cycles to emulate the simplest instruction I can think of. Of course this is "best case" - there's a large number of things that can make this much worse (doing a page table walk, checking for a "no-execute" flag, updating a "read" page table flag, having any form of cache miss on the host CPU, etc).
Now consider an instruction like "push dword [foo]" - how many checks are you going to need to do to find out if an exception needs to be generated? Then there's emulating the hardware (working out an IRQ is about to occur, updating the video, etc). This costs more overhead.
Then there's the host OS's overhead - task switches, memory management, GUI, etc.
For Bochs, the emulated computer is (roughly) 100 times slower than the host computer. I really don't think it's possible to get much improvement on that using the same techniques.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
I skipped over the "faster than bochs" line pretty quickly. Bochs is incredibly fast for a pure-software pure-emulation of other software. You are going to have one hell of a time to beat that project, not in a minor bit because of the great people working on it.
You can make a faster emulator however, but you'll have to use a different approach to make a difference. Might I recommend runtime local-recompiling code sequences to run on the native machine? It's awfully hard to get right still, but easier than beating bochs.
You can make a faster emulator however, but you'll have to use a different approach to make a difference. Might I recommend runtime local-recompiling code sequences to run on the native machine? It's awfully hard to get right still, but easier than beating bochs.
They cheat. Any of just letting the processor do the stuff still, runtime recompiling or translating, letting it run with a lot of catches that fix stuff at runtime, stuff like that. VPC might cheat even more when you're running windows in windows, since it can just pass on the calls.hckr83 wrote:how is it done faster in VPC and VMWARE then?For Bochs, the emulated computer is (roughly) 100 times slower than the host computer. I really don't think it's possible to get much improvement on that using the same techniques.
The point of bochs is that you can run WinXP on a Sparc if you like to. It doesn't rely on the host and target having ANYTHING in common, so it doesn't assume anything. You can specify at compile time that you do want it to do a bit of this, but it's disrecommended as it's against the spirit. I'm still amazed that it makes 10k cycles / sec here with full tracing on.
Hi,
The fastest method is "para-virtualisation", where the guest OS is modified to suit the environment, but runs directly on the CPU. The benefit here is pure speed. The disadvantages are that the host CPU must be the same as the guest CPU and the guest OS needs to be modified. Xen is an example of this.
The next fastest, is hardware virtualisation, where the CPU has special support for running guest OS's. For 80x86 this includes Intel's "Vanderpool" and AMD's "Pacifica" extensions. I'd also consider virtual8086 mode in this category (e.g. emulating an 8086 CPU in protected mode using hardware support), although in this case it's not a perfect emulation. The benefit here is pure speed (although it's slower than para-virtualisation because there's a lot of "traps" used). The disadvantages are that the host CPU must be the same as the guest CPU, and it only works if the host CPU supports it. Xen is an example of this too (both Intel and AMD added support for their extensions into Xen version 3.0).
The next fastest method is to scan the instruction stream and copy most instructions "as is" to a cache of translated instructions. Some (priveleged) instructions aren't copied "as is", but are replaced with special code sequences. The end result is that the cache of translated instructions can be run on the CPU directly. The benefit is that the CPU doesn't need to support it. The downside is that the host CPU must be the same as the guest CPU.
The next method is dynamic translation, where all instructions are parsed and converted into instructions suitable for the host CPU to run. This is similar to the last option, but allows the host CPU to be completely different to the guest CPU. Qemu is an example of this.
Lastly, there's "instruction interpretting", which is like Bochs....
Cheers,
Brendan
I'm not sure exactly how these (commercial) products work. There's only about 5 common techniques though.hckr83 wrote:how is it done faster in VPC and VMWARE then?
The fastest method is "para-virtualisation", where the guest OS is modified to suit the environment, but runs directly on the CPU. The benefit here is pure speed. The disadvantages are that the host CPU must be the same as the guest CPU and the guest OS needs to be modified. Xen is an example of this.
The next fastest, is hardware virtualisation, where the CPU has special support for running guest OS's. For 80x86 this includes Intel's "Vanderpool" and AMD's "Pacifica" extensions. I'd also consider virtual8086 mode in this category (e.g. emulating an 8086 CPU in protected mode using hardware support), although in this case it's not a perfect emulation. The benefit here is pure speed (although it's slower than para-virtualisation because there's a lot of "traps" used). The disadvantages are that the host CPU must be the same as the guest CPU, and it only works if the host CPU supports it. Xen is an example of this too (both Intel and AMD added support for their extensions into Xen version 3.0).
The next fastest method is to scan the instruction stream and copy most instructions "as is" to a cache of translated instructions. Some (priveleged) instructions aren't copied "as is", but are replaced with special code sequences. The end result is that the cache of translated instructions can be run on the CPU directly. The benefit is that the CPU doesn't need to support it. The downside is that the host CPU must be the same as the guest CPU.
The next method is dynamic translation, where all instructions are parsed and converted into instructions suitable for the host CPU to run. This is similar to the last option, but allows the host CPU to be completely different to the guest CPU. Qemu is an example of this.
Lastly, there's "instruction interpretting", which is like Bochs....
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Also, bochs doesn't just emulate your CPU. It emulates the video card, timers, interrupt handling, APIC's, ACPI stuff, buttons, mouse, keyboard, network card, sound card etc. as well. You get a full emulated computer that is in no way related to your current computer. That's how I've been developing AMD64 stuff for a while now, although I do have a physical computer since a few months that can also do that.
ahh.. well it's not just the speed I don't like about bochs but also things like it not being straight foward to most users(with all the config file stuff)
one thing I like about my system is how it will be much more modular than bochs, as it will use external libraries for devices(I can't say dll because it's going to be cross-os and platform)
but anyway...
how would the dynamic compiling work!? what about self-modifing code?
one thing I like about my system is how it will be much more modular than bochs, as it will use external libraries for devices(I can't say dll because it's going to be cross-os and platform)
but anyway...
how would the dynamic compiling work!? what about self-modifing code?
Hi,
Now what I'd like to see is a "mega-beast" emulator, that uses inter-changeable modules to support hardware virtualisation, dynamic translation and instruction interpretting, that also has support for emulating common virtual hardware and using real hardware devices (where possible).
Cheers,
Brendan
For all emulators (using any method), each piece of hardware could be either a real piece of hardware or an emulated piece of hardware, except for the chipset. What is actually supported is a different matter. AFAIK Qemu allows some real PCI devices to be assigned to the guest OS if you run Linux, and Xen allows some devices to be emulated in software.Candy wrote:Also, bochs doesn't just emulate your CPU. It emulates the video card, timers, interrupt handling, APIC's, ACPI stuff, buttons, mouse, keyboard, network card, sound card etc. as well. You get a full emulated computer that is in no way related to your current computer. That's how I've been developing AMD64 stuff for a while now, although I do have a physical computer since a few months that can also do that.
Now what I'd like to see is a "mega-beast" emulator, that uses inter-changeable modules to support hardware virtualisation, dynamic translation and instruction interpretting, that also has support for emulating common virtual hardware and using real hardware devices (where possible).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Now what I'd like to see is a "mega-beast" emulator, that uses inter-changeable modules to support hardware virtualisation, dynamic translation and instruction interpretting, that also has support for emulating common virtual hardware and using real hardware devices (where possible).
emulating common virtual hardware
isn't that the same thing or did I mis understand something?to support hardware virtualisation
Hi,
Each block of translated code usually ends when a jump, branch, call, ret or software interrupt occurs, so that each block is sequential code. Each block is "tagged" with certain properties - what CR3 and CR0 should be, the CS, the privilege level, the original address of the block, etc.
When the translated code is being run, the emulator searches for an existing block, compares the current CPU state with the block's "tag", and executes that block if everything matches. If a matching block isn't found it creates a new block.
Each block of translated code may have other information associated with it, like what sort of instruction caused the block to end, the address of the next block in the chain (if known), how many "virtual cycles" the block adds up to, etc.
For example, something like this:
Could be translated into 3 blocks, like this:
Of course this is very basic example. Instructions could be optimised, and the dynamic translator can do part of it's work in the translated code. For example, that last block might end up more like this:
Some things invalidate existing blocks, including writes to the same area of memory, changes in page tables/directories, chipset changes (e.g. the A20 gate), etc.
Emulated IRQs have to occur at "translated block boundaries", and things like single stepping are hard to implement, because the real CPU executes entire blocks at once.
Cheers,
Brendan
The general idea is to parse the instructions (just like an interpretter), but instead of doing each instruction you add translated code to the current "block of translated code".hckr83 wrote:how would the dynamic compiling work!? what about self-modifing code?
Each block of translated code usually ends when a jump, branch, call, ret or software interrupt occurs, so that each block is sequential code. Each block is "tagged" with certain properties - what CR3 and CR0 should be, the CS, the privilege level, the original address of the block, etc.
When the translated code is being run, the emulator searches for an existing block, compares the current CPU state with the block's "tag", and executes that block if everything matches. If a matching block isn't found it creates a new block.
Each block of translated code may have other information associated with it, like what sort of instruction caused the block to end, the address of the next block in the chain (if known), how many "virtual cycles" the block adds up to, etc.
For example, something like this:
Code: Select all
mov esi,foo
mov ecx,100
xor eax,eax
.here:
add eax,[esi]
loop .here
mov [edi],eax
ret
Code: Select all
BLOCK0:
mov esi,foo
mov ecx,100
xor eax,eax
add eax,[esi]
ENDBLOCK, condition "LOOP"
NEXTBLOCKA = BLOCK1 (used if ECX = 1)
NEXTBLOCKB = BLOCK2 (used if ECX = 0)
CYCLES = 12
Code: Select all
BLOCK1:
add eax,[esi]
ENDBLOCK, condition "LOOP"
NEXTBLOCKA = BLOCK1 (used if ECX = 1)
NEXTBLOCKB = BLOCK2 (used if ECX = 0)
CYCLES = 3
Code: Select all
BLOCK2:
mov [edi],eax
ENDBLOCK, condition "RET"
NEXTBLOCKA = NONE
NEXTBLOCKB = NONE
CYCLES = 3
Code: Select all
BLOCK2:
call MY_EMULATOR_check_dword_write_at_EDI
mov [edi],eax
pushfd
add dword [MY_EMULATOR_time_stamp_counter],3
adc dword [MY_EMULATOR_time_stamp_counter+4],0
jmp MY_EMULATOR_end_block_via_RET_stub
Emulated IRQs have to occur at "translated block boundaries", and things like single stepping are hard to implement, because the real CPU executes entire blocks at once.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Hi,
Cheers,
Brendan
Hardware virtualisation = special stuff built into the CPU to make it possible to run a guest OS's code directly on the CPU and trap into a "hyper-visor" when something needs to be emulated (e.g. the "virtualization" extensions Intel and AMD's recently added to their CPUs).hckr83 wrote:emulating common virtual hardwareisn't that the same thing or did I mis understand something?to support hardware virtualisation
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.