making my program faster...

earlz · Post by **earlz** » Fri Nov 10, 2006 10:22 pm

hi, I have been making an x86 emulator in C (not ++ ) and I'm attempting to make it faster(don't want a slow emulator like bochs)

so here are my few questions

1. does the -O3 optimization perform any speed optimizations or should I use -Os (in gcc)

2. which is faster, using an array of pointers to functions or using a switch and then calling the functions from the cases of the switch

3. does gcc use the mmx stuff to make it faster? like if I choose a -march= something with mmx will it make it faster?

4. what is a fairly safe limit for the total size of an exe, like I know a 40mb exe is really big and would have speed issues from paging

edit:
5. what is expensive optimizations?

edit2:
6. does the inline modifier on functions make them faster? or what exactly does it do?

Candy · Post by **Candy** » Sat Nov 11, 2006 1:46 am

hckr83 wrote:so here are my few questions

1. does the -O3 optimization perform any speed optimizations or should I use -Os (in gcc)

-O3 makes it faster, -Os makes it smaller. You want the first.

2. which is faster, using an array of pointers to functions or using a switch and then calling the functions from the cases of the switch

The first, except in the case where the first case handles nearly all hits. Instructions are a bit too spread out, so I'd go for the first.

3. does gcc use the mmx stuff to make it faster? like if I choose a -march= something with mmx will it make it faster?

AFAIK not, but it might in very recent editions. You should probably use some form of explicit form to be sure to use it. Your libraries might include those types of functions though, so you might inadvertently be using it anyway when calling library functions.

4. what is a fairly safe limit for the total size of an exe, like I know a 40mb exe is really big and would have speed issues from paging

There's no real limit to it, it won't even have much impact from paging. There are a number of boundaries that make the optimization a bit slower again. There is each level of the processor cache, which most likely is L1 cache (on processor, separate for instruction and data) which is around 8k on either, AMD caches x86 instructions Intel caches uops, L2 cache which would be around 256k or such (but that's a pretty wild guess, since I'm not sure the P4's have the 2M as L2 or L3) and the L3 cache which can be up to a few megabytes. Then there's a limit with the TLB, which you might be able to avoid by telling the OS to use 2MB/4MB pages (which use a lot less TLB entries). AFAIK, Windows pages out the executable more agressively, preferring free memory to having the running exe in memory. In most cases, that's just plain stupid, so if you could tell it not to swap it out, that'd be great. The EXE size isn't everything though, there can be loads of debug info (which isn't used for running it) and if you allocate huge amounts of memory that'll slow it down more.

edit:
5. what is expensive optimizations?

Loop unrolling, inlining. Stuff that takes your code and puts it in the output file a number of times to avoid jumps and calls. They are the two most expensive optimizations, but there are more. There are also doubtful optimizations, which might make the code faster but then again might make it slower again.

6. does the inline modifier on functions make them faster? or what exactly does it do?

That depends on how big the function is. It always makes the function faster, since the entire contents of the function is copy-pasted into the calling function so you don't get another stack frame and you don't get another call. On large, often-called functions this gives quite a lot of code and recursive functions can't even be fully inlined (since you'd get an infinitely large executable). It also requires you to specify the entire contents of the function in your header for inlining, and possibly a flag to tell your compiler that it shouldn't output the function or that it should ignore multiple definitions.

Brendan · Post by **Brendan** » Sat Nov 11, 2006 6:58 am

Hi,

hckr83 wrote:hi, I have been making an x86 emulator in C (not ++ ) and I'm attempting to make it faster(don't want a slow emulator like bochs)

For performance, Bochs is one of the fastest instruction interpretters I've ever seen, and there's probably nothing I can think of to make it faster that hasn't already been done. It's slow compared to dynamic translation, and even slower compared to virtualisers/hyper-visors, but it's still one of the fastest emulators of it's type.

Consider the minimum work an instruction interpretter needs to do to emulate a simple fast instruction. For an example, consider the "stc" instruction, as it's one opcode and does almost nothing (set the carry flag - how hard can it be?).

First you need to get the opcode for it, which includes checking to see if EIP reached the code segment limit, and then checking if paging is enabled or not (and possibly checking if EIP moved on to the next 4 KB page if paging is enabled). I'd guess these checks add up to 6 cycles (best case).

Then there's getting the first opcode. For this I'd probably do something like "movzx eax, byte [ <foo> ]". The reason for making sure the highest bits are clear will be obvious soon. Anyway, add another 3 cycles for this.

Next, I'd do something like "jmp near [mainOpcodeTable + eax * 4]". You could use an indirect call (slightly worse) or a "switch/case" (much worse). Regardless of what you do the CPU can't guess what's going on and can't prefetch anything, so you're looking at a full pipeline stall. The costs for this is around 20 cycles (but depends a lot on a lot of things).

Now you've reached the "STC" emulation code. There's no need for any checks, and you can do something like "or [CPU.eflags],CARRYFLAG". Very roughly, another 3 cycles.

Lastly, you'd need to update EIP (and possibly the time stamp counter if you're being accurate), then do either a "ret" or a "jmp mainloop" depending on how you got here. Regardless of how you look at it, I'd estimate at least 6 cycles.

Add up all of this and you get roughly 38 cycles to emulate the simplest instruction I can think of. Of course this is "best case" - there's a large number of things that can make this much worse (doing a page table walk, checking for a "no-execute" flag, updating a "read" page table flag, having any form of cache miss on the host CPU, etc).

Now consider an instruction like "push dword [foo]" - how many checks are you going to need to do to find out if an exception needs to be generated? Then there's emulating the hardware (working out an IRQ is about to occur, updating the video, etc). This costs more overhead.

Then there's the host OS's overhead - task switches, memory management, GUI, etc.

For Bochs, the emulated computer is (roughly) 100 times slower than the host computer. I really don't think it's possible to get much improvement on that using the same techniques.

Cheers,

Brendan

earlz · Post by **earlz** » Sat Nov 11, 2006 9:28 am

For Bochs, the emulated computer is (roughly) 100 times slower than the host computer. I really don't think it's possible to get much improvement on that using the same techniques.

how is it done faster in VPC and VMWARE then?

Candy · Post by **Candy** » Sat Nov 11, 2006 9:30 am

I skipped over the "faster than bochs" line pretty quickly. Bochs is incredibly fast for a pure-software pure-emulation of other software. You are going to have one hell of a time to beat that project, not in a minor bit because of the great people working on it.

You can make a faster emulator however, but you'll have to use a different approach to make a difference. Might I recommend runtime local-recompiling code sequences to run on the native machine? It's awfully hard to get right still, but easier than beating bochs.

Candy · Post by **Candy** » Sat Nov 11, 2006 9:33 am

hckr83 wrote:
For Bochs, the emulated computer is (roughly) 100 times slower than the host computer. I really don't think it's possible to get much improvement on that using the same techniques.
how is it done faster in VPC and VMWARE then?

They cheat. Any of just letting the processor do the stuff still, runtime recompiling or translating, letting it run with a lot of catches that fix stuff at runtime, stuff like that. VPC might cheat even more when you're running windows in windows, since it can just pass on the calls.

The point of bochs is that you can run WinXP on a Sparc if you like to. It doesn't rely on the host and target having ANYTHING in common, so it doesn't assume anything. You can specify at compile time that you do want it to do a bit of this, but it's disrecommended as it's against the spirit. I'm still amazed that it makes 10k cycles / sec here with full tracing on.

Brendan · Post by **Brendan** » Sat Nov 11, 2006 10:07 am

Hi,

hckr83 wrote:how is it done faster in VPC and VMWARE then?

I'm not sure exactly how these (commercial) products work. There's only about 5 common techniques though.

The fastest method is "para-virtualisation", where the guest OS is modified to suit the environment, but runs directly on the CPU. The benefit here is pure speed. The disadvantages are that the host CPU must be the same as the guest CPU and the guest OS needs to be modified. Xen is an example of this.

The next fastest, is hardware virtualisation, where the CPU has special support for running guest OS's. For 80x86 this includes Intel's "Vanderpool" and AMD's "Pacifica" extensions. I'd also consider virtual8086 mode in this category (e.g. emulating an 8086 CPU in protected mode using hardware support), although in this case it's not a perfect emulation. The benefit here is pure speed (although it's slower than para-virtualisation because there's a lot of "traps" used). The disadvantages are that the host CPU must be the same as the guest CPU, and it only works if the host CPU supports it. Xen is an example of this too (both Intel and AMD added support for their extensions into Xen version 3.0).

The next fastest method is to scan the instruction stream and copy most instructions "as is" to a cache of translated instructions. Some (priveleged) instructions aren't copied "as is", but are replaced with special code sequences. The end result is that the cache of translated instructions can be run on the CPU directly. The benefit is that the CPU doesn't need to support it. The downside is that the host CPU must be the same as the guest CPU.

The next method is dynamic translation, where all instructions are parsed and converted into instructions suitable for the host CPU to run. This is similar to the last option, but allows the host CPU to be completely different to the guest CPU. Qemu is an example of this.

Lastly, there's "instruction interpretting", which is like Bochs....

Cheers,

Brendan

Candy · Post by **Candy** » Sat Nov 11, 2006 11:15 am

Also, bochs doesn't just emulate your CPU. It emulates the video card, timers, interrupt handling, APIC's, ACPI stuff, buttons, mouse, keyboard, network card, sound card etc. as well. You get a full emulated computer that is in no way related to your current computer. That's how I've been developing AMD64 stuff for a while now, although I do have a physical computer since a few months that can also do that.

earlz · Post by **earlz** » Sat Nov 11, 2006 11:52 am

ahh.. well it's not just the speed I don't like about bochs but also things like it not being straight foward to most users(with all the config file stuff)
one thing I like about my system is how it will be much more modular than bochs, as it will use external libraries for devices(I can't say dll because it's going to be cross-os and platform)
but anyway...

how would the dynamic compiling work!? what about self-modifing code?

Brendan · Post by **Brendan** » Sat Nov 11, 2006 11:55 am

Hi,

Candy wrote:Also, bochs doesn't just emulate your CPU. It emulates the video card, timers, interrupt handling, APIC's, ACPI stuff, buttons, mouse, keyboard, network card, sound card etc. as well. You get a full emulated computer that is in no way related to your current computer. That's how I've been developing AMD64 stuff for a while now, although I do have a physical computer since a few months that can also do that.

For all emulators (using any method), each piece of hardware could be either a real piece of hardware or an emulated piece of hardware, except for the chipset. What is actually supported is a different matter. AFAIK Qemu allows some real PCI devices to be assigned to the guest OS if you run Linux, and Xen allows some devices to be emulated in software.

Now what I'd like to see is a "mega-beast" emulator, that uses inter-changeable modules to support hardware virtualisation, dynamic translation and instruction interpretting, that also has support for emulating common virtual hardware and using real hardware devices (where possible).

Cheers,

Brendan

earlz · Post by **earlz** » Sat Nov 11, 2006 12:13 pm

Now what I'd like to see is a "mega-beast" emulator, that uses inter-changeable modules to support hardware virtualisation, dynamic translation and instruction interpretting, that also has support for emulating common virtual hardware and using real hardware devices (where possible).

emulating common virtual hardware

to support hardware virtualisation

isn't that the same thing or did I mis understand something?

Brendan · Post by **Brendan** » Sat Nov 11, 2006 12:31 pm

Hi,

hckr83 wrote:how would the dynamic compiling work!? what about self-modifing code?

The general idea is to parse the instructions (just like an interpretter), but instead of doing each instruction you add translated code to the current "block of translated code".

Each block of translated code usually ends when a jump, branch, call, ret or software interrupt occurs, so that each block is sequential code. Each block is "tagged" with certain properties - what CR3 and CR0 should be, the CS, the privilege level, the original address of the block, etc.

When the translated code is being run, the emulator searches for an existing block, compares the current CPU state with the block's "tag", and executes that block if everything matches. If a matching block isn't found it creates a new block.

Each block of translated code may have other information associated with it, like what sort of instruction caused the block to end, the address of the next block in the chain (if known), how many "virtual cycles" the block adds up to, etc.

For example, something like this:

Code: Select all

    mov esi,foo
    mov ecx,100
    xor eax,eax
.here:
    add eax,[esi]
    loop .here
    mov [edi],eax
    ret

Could be translated into 3 blocks, like this:

Code: Select all

BLOCK0:
   mov esi,foo
   mov ecx,100
   xor eax,eax
   add eax,[esi]
   ENDBLOCK, condition "LOOP"
   NEXTBLOCKA = BLOCK1 (used if ECX = 1)
   NEXTBLOCKB = BLOCK2 (used if ECX = 0)
   CYCLES = 12

Code: Select all

BLOCK1:
   add eax,[esi]
   ENDBLOCK, condition "LOOP"
   NEXTBLOCKA = BLOCK1 (used if ECX = 1)
   NEXTBLOCKB = BLOCK2 (used if ECX = 0)
   CYCLES = 3

Code: Select all

BLOCK2:
   mov [edi],eax
   ENDBLOCK, condition "RET"
   NEXTBLOCKA = NONE
   NEXTBLOCKB = NONE
   CYCLES = 3

Of course this is very basic example. Instructions could be optimised, and the dynamic translator can do part of it's work in the translated code. For example, that last block might end up more like this:

Code: Select all

BLOCK2:
   call MY_EMULATOR_check_dword_write_at_EDI
   mov [edi],eax
   pushfd
   add dword [MY_EMULATOR_time_stamp_counter],3
   adc dword [MY_EMULATOR_time_stamp_counter+4],0
   jmp MY_EMULATOR_end_block_via_RET_stub

Some things invalidate existing blocks, including writes to the same area of memory, changes in page tables/directories, chipset changes (e.g. the A20 gate), etc.

Emulated IRQs have to occur at "translated block boundaries", and things like single stepping are hard to implement, because the real CPU executes entire blocks at once.

Cheers,

Brendan

Brendan · Post by **Brendan** » Sat Nov 11, 2006 12:42 pm

Hi,

hckr83 wrote:
emulating common virtual hardware

to support hardware virtualisation
isn't that the same thing or did I mis understand something?

Hardware virtualisation = special stuff built into the CPU to make it possible to run a guest OS's code directly on the CPU and trap into a "hyper-visor" when something needs to be emulated (e.g. the "virtualization" extensions Intel and AMD's recently added to their CPUs).

Cheers,

Brendan

JAAman · Post by **JAAman** » Mon Nov 13, 2006 10:49 am

ot:

i like the fact that bochs is slow -- i install win95 in bochs, and then i can run xcom

earlz · Post by **earlz** » Fri Nov 24, 2006 10:07 pm

well speed shouldn't matter much for now as the 8086 was super slow.. somehow I think that 1000+ cycles isn't exactly completely right but anyway...

OSDev.org

making my program faster...

making my program faster...

Re: making my program faster...

Re: making my program faster...