developers for emulator

stlw · Post by **stlw** » Tue Jan 13, 2009 12:32 pm

JohnnyTheDon wrote:Try writing a VT-x (Intel) or AMD-V (AMD obviously) emulator. The processor does all the actual execution of code, you just have to point it in the right direction. It can also give you experience for developing an exokernel OS.

VT emulation in Bochs is in progress now. I am already passing few nice tests which do a VMENTE, VMEXIT on various events and VMRESUME back.
It is not yet in the stage that you could test it but I hope to reach that stage in 2-3 weeks or so.
Anybody intersted to be a testers ?

Stanislav

stlw · Post by **stlw** » Tue Jan 13, 2009 12:51 pm

bewing wrote:
berkus wrote:bewing, why not slowly rewrite bochs to better shape? you can do it step by step without major breakage and in a controllable manner, can't you?
If I were to take all the bochs sourcecode, rename it "rebochs" (with a shoe for an icon), and begin a rewrite -- some of it could be done in that way, yes. The configure script, the param_tree, the main cpu_loop that runs the sim, the breakpoints, the "BX_Events", and making the "devices" threadsafe.

The memory code is already good, and so is the disassembler (and the gui debugger, of course).

However, the main thing that needs the most help is that the model of the CPU is all stored in C structures. Structures (and C++ objects) are SLOW. I know that programmers hate to hear that, but it is true. Arrays are MUCH FASTER. So, the basis of the rewrite would be to recode the entire CPU model using arrays instead of structures. Sadly, this would completely and utterly break every bit of the code in the entire program. And that is such a major design change that it would need to be done FIRST. Which means that the program WOULD be broken for the first 3 months.

But, overall, the answer to your question is:
Yes, I could probably get a really good rewrite started on the thing. I have spent 4 months on the user interface already. If I were to spend another 6 on bochs itself, I could probably get it to the point where it was singificantly simplified, and significantly faster -- with only easy incremental changes required to further improve it.

And I am tempted to do it, too. But I am also tempted to get back to working on my assembler, and my OS. And I am also tempted to spend some time trying to make some money, too.

bewing, I hope this is not just a FLAME ON mode

Bochs is lack of developers, currently we are just two, sometimes tree active. We have our vision of how things should look like and need a push from outside to change it

The same way I pushed Bochs 2-3 years ago with rewriting of FPU code.
Or the same way as Darek Mihocka pushed me last year so we rewrote most of Bochs CPU code making it over than 3x faster than older versions.
The configure script, the param_tree, the main cpu_loop that runs the sim, the breakpoints, the "BX_Events", and making the "devices" threadsafe.

-> looks fine. you choosed list of truly undependent modules which could be rewritten separatelly without touching everything else !

I, personally, have nothing agains configure script, but if you could suggest any alternatives ...
The param tree I hate too, put I am not sure I can make it a lot better, my one will be ... just another version of the same param tree. Hope you could do it better.
Main CPU loop - will accept any ideas !
The breakpoints - it is a matter of few days of work, isn't ? You are welcome to do so or example me your ideas and it will be in CVS soon

Making devices threadsafe - this is my top priority thing. I dream of doing this !
But my point - do not start anything before you know till the end how do you finish it. Curently I don't ...
So two ways for you - to do it yourslef or to push me

> The memory code is already good, and so is the disassembler (and the gui debugger, of course).

The memory code will be changed very soon. I want to be able to emulate more physical RAM than you actually have.
So I want to allocate MEM by blocks of 1M or so only after it was touched. This way I could get 40 bit physical RAM on by WinXP host

BTW, thanks for disassembler, it was my rewrite

About the rest - looks like you have no expertise in CPU emulation.
But you have enough expertise in everything else so I'd like to accept your help !

Stanislav

stlw · Post by **stlw** » Tue Jan 13, 2009 12:55 pm

Creature wrote:Why not organize some sort of community project where there is one main developer (or a few main developers) and others can contribute code? I know community projects usually aren't a good idea, but if it works you'd have a nice emulator. You could use parts of the Bochs/QEMU source code (after they've been rewritten to be more efficient).

Would pay money for man (or woman) which will find a way to share devices code automatically between Bochs and QEMU.

My personal dream is UNIFIED device model between all emulators accross the world.
Once I suggested this to QEMU team but they didn't like to change the current situation ...

Stanislav

Love4Boobies · Post by **Love4Boobies** » Tue Jan 13, 2009 1:33 pm

Stanislav, you were saying (in a different thread) that a new version of Bochs was scheduled for this month a while ago. I'm not sure how development is going but is that still to happen now or will it be delayed?

Ready4Dis · Post by **Ready4Dis** » Tue Jan 13, 2009 2:37 pm

Well, I recently started work on an emulator, but it isn't to emulate an entire computer, simply to emulate real mode in my OS so I don't have to play around with ugly hacks to use int 10h (even though I already have my pmode -> rmode and back code complete and working). I plan on running 64-bit in the future and am writing a real-mode emulator so I can use a generic, int 10h, vbe driver straight in 64-bit without going back to 32-bit and using v86, or even worse dropping all the way back to 16-bit real mode. This will make my video driver thread safe, preemptively interruptible, and it can be located at any memory address, so there are no ugly sub 1mb hacks going on with my memory manager. So far it decodes about 80% of the one-byte opcodes, and only a handful of two-byte opcodes (no vector operations or sse/3d now instructions supported). It runs the program and outputs the opcodes to screen while running (which can be turned on/off for obvious reasons). Once I write my video driver, it will be given access to the ACTUAL ports of the hardware, so when doing the 16->bit emulation inp, outp will work without issue. Also, I plan on implementing the clock interrupt to run at 18.2hz just like it would default so anything relying on timing would still operate correctly. All opcodes are based on a function call table, CPU registers are arrays of dwords, it supports 16 and 32-bit operand and addressing (although, more testing will be required to see how much address space it requires in real mode!). I have not benchmarked it since it is only going to be used to switch video modes, and will not affect other process being run, speed isn't my primary concern, size and operability (stability) are my primary concerns. It is all written in ANSI C (since my OS is written in ANSI C), so should be easily portable to other platforms (although, for my use, this is unnecessary, but if I ever find another reason to have an emulator it'd be nice to be able to run under other hardware). Absolutely no inline assembly of any kind is being used (due to me wanting to keep it portable). It allocates 1mb (real-mode limit) on startup, loads a specified file and begins processing at the CS:EIP that you tell it to. I can currently set a break-point at a specific address or do step mode (that's how I figured out missing opcodes when it's running, filling them in as i find them while running actual bios code).

stlw · Post by **stlw** » Tue Jan 13, 2009 4:43 pm

Love4Boobies wrote:Stanislav, you were saying (in a different thread) that a new version of Bochs was scheduled for this month a while ago. I'm not sure how development is going but is that still to happen now or will it be delayed?

Actually we missed release time for MacWorldExpo2009 and now already there is no point in releasing anything before Gtk+ GUI debugger frontend is in CVS.
Very hope it will happen soon.

Stanislav

jal · Post by **jal** » Thu Jan 15, 2009 6:11 am

Let's fork DOSbox and make it into a real PC emulator :). Seriously though, there's enough emulators already, and the internal structure of things doesn't matter to me, as long as I don't have to change code myself.

JAL

Brendan · Post by **Brendan** » Thu Jan 15, 2009 8:45 am

Hi,

bewing wrote:However, the main thing that needs the most help is that the model of the CPU is all stored in C structures. Structures (and C++ objects) are SLOW. I know that programmers hate to hear that, but it is true. Arrays are MUCH FASTER. So, the basis of the rewrite would be to recode the entire CPU model using arrays instead of structures.

AFAIK the fastest way to decode instructions is with jump tables. For example:

Code: Select all


decodeNextInstruction:
    mov [CPU.currentInstruction],esi
    movzx eax, byte [esi]                 ;eax = next byte of opcode
    inc esi
    jmp [firstOpcodeByteTable + eax * 4]

handleFSprefix:
    mov [CPU.defaultSegReg],FSseg
    movzx eax, byte [esi]                 ;eax = next byte of opcode
    inc esi
    jmp [firstOpcodeByteTable + eax * 4]

handleOpSizePrefix:
    xor [CPU.decodeFlags],opcodeSize
    movzx eax, byte [esi]                 ;eax = next byte of opcode
    inc esi
    jmp [firstOpcodeByteTable + eax * 4]

handleFPUopcode:
    movzx eax, byte [esi]                 ;eax = next byte of opcode
    inc esi
    jmp [firstFPUOpcodeByteTable + eax * 4]

handleFPUopcode:
    movzx eax, byte [esi]                 ;eax = next byte of opcode
    inc esi
    jmp [firstFPUOpcodeByteTable + eax * 4]

handleCLD:
    mov eax,.doInstruction
    call put_this_into_the_translation_cache
.doInstruction:
    and [CPU.eflags],~FLAG_DIRECTION
    jmp decodeNextInstruction

bewing wrote:Sadly, this would completely and utterly break every bit of the code in the entire program.

That's why (IMHO) the CPU emulation code, the memory controller and the RAM connected to the memory controller should be combined into a seperate plugin (just like devices are plugins). That would allow several different "CPU plugins" (one for interpretted, one for dynamic translation, one for VMX, one for Itanium, one for PowerPC, etc) to all use the same virtual PCI bus and virtual devices.

Cheers,

Brendan

stlw · Post by **stlw** » Thu Jan 15, 2009 11:00 am

AFAIK the fastest way to decode instructions is with jump tables.

The fastest way is to cache and NOT decode. Hardware CPU has 99% hit rate in code caches so decode could be almost completely avoided and made almost as slow as you like.

That's why (IMHO) the CPU emulation code, the memory controller and the RAM connected to the memory controller should be combined into a seperate plugin (just like devices are plugins). That would allow several different "CPU plugins" (one for interpretted, one for dynamic translation, one for VMX, one for Itanium, one for PowerPC, etc) to all use the same virtual PCI bus and virtual devices.

Like the idea. I am thinking about iterface of CPU plugin in Bochs and what it needs to connect to outside world ...

Stanislav

bewing · Post by **bewing** » Tue Jan 20, 2009 8:30 am

Brendan wrote: AFAIK the fastest way to decode instructions is with jump tables. ...

That is interesting code, and your point may well be true -- but part of the point of the bochs code is that none of it is in CPU-specific ASM.
So I'd think the question would be "what is the fastest way to decode instructions that can be written (somewhat cleanly) in pure C?" Which is probably a translation of those jump tables into "switch" statements. I am also unsure whether instruction decoding, or instruction simulation, or the per-instruction cpu_loop overhead takes the most per-instruction time. I'm inclined to believe that it's the cpu_loop. If you singlestep through a single instruction loop, there are at least 500 instructions of overhead for every emulated opcode. I haven't bothered to count precicely, but it's a lot.

That's why (IMHO) the CPU emulation code, the memory controller and the RAM connected to the memory controller should be combined into a seperate plugin (just like devices are plugins).

Well, once you take out all the CPU emulation code, there's not much left to bochs. If you just delete all the code from the "cpu" directory, and replace it with drop-in replacement code for a different CPU, I think that would work just as well. I don't see that you'd buy anything for going to all the extra trouble to turn the code into an actual plugin.

stlw wrote: The fastest way is to cache and NOT decode.

Except that requires that you be running on the same type of CPU that you are emulating. And there are some instructions that will GPF if you do that, and you can't really know which ones they are until you decode, so you have to partially decode them all anyway.

JohnnyTheDon wrote: I had both do 1 billion random accesses of the same size (long).

The problem is with the words "random accesses".

Arrays are not magic bullets. They merely give you the opportunity to write more optimized code. The keywords being "locality" and "prefetch".

Brendan · Post by **Brendan** » Tue Jan 20, 2009 8:56 am

Hi,

bewing wrote:
Brendan wrote:AFAIK the fastest way to decode instructions is with jump tables. ...
That is interesting code, and your point may well be true -- but part of the point of the bochs code is that none of it is in CPU-specific ASM.

But if there were "CPU plugins"...

bewing wrote:So I'd think the question would be "what is the fastest way to decode instructions that can be written (somewhat cleanly) in pure C?" Which is probably a translation of those jump tables into "switch" statements. I am also unsure whether instruction decoding, or instruction simulation, or the per-instruction cpu_loop overhead takes the most per-instruction time. I'm inclined to believe that it's the cpu_loop.

I profiled Bochs ages ago, and (from memory) I think it was the CPU_loop that that consumed close to half the execution time. The rest of the execution time mostly came from a large number of infrequently used functions. Also, Bochs has been improved a lot since then, and old results don't work so well on new code...

bewing wrote:If you singlestep through a single instruction loop, there are at least 500 instructions of overhead for every emulated opcode. I haven't bothered to count precicely, but it's a lot.

Single-stepping might be a lot, but the performance of single-stepping itself doesn't matter and I'm not too sure how much of that overhead is from single-stepping (and how much is from decoding and emulation).

bewing wrote:
Brendan wrote:That's why (IMHO) the CPU emulation code, the memory controller and the RAM connected to the memory controller should be combined into a seperate plugin (just like devices are plugins).
Well, once you take out all the CPU emulation code, there's not much left to bochs. If you just delete all the code from the "cpu" directory, and replace it with drop-in replacement code for a different CPU, I think that would work just as well. I don't see that you'd buy anything for going to all the extra trouble to turn the code into an actual plugin.

That's the plan - rather than implementing one emulator, create a set of standards that ensure compatibility between a loose collection of modules, where anyone can write these modules and any existing emulator can support these modules (and then build it all on top of an OS where the host OS's device drivers allow real hardware to act as "device modules" within a compatible emulator or virtual machine).

bewing wrote:
stlw wrote: The fastest way is to cache and NOT decode.
Except that requires that you be running on the same type of CPU that you are emulating. And there are some instructions that will GPF if you do that, and you can't really know which ones they are until you decode, so you have to partially decode them all anyway.

It doesn't require the same type of CPU - it's sort of like dynamic translation (only less so). Also note that in my example code I hinted at this - there's a "call put_this_into_the_translation_cache" line (near the bottom).

Cheers,

Brendan

AJ · Post by AJ » Tue Jan 20, 2009 9:09 am

Hi,

First, I should point out that my knowlege of VM/Emulator implementation is little better than useless, but I use Emulators and VM's regularly (as do most hobbyist OS Devvers, I guess...).

My impression is that VirtualBox has a pretty modular design with well designed interfaces between modules. It's also open source. Assuming that the above is correct, wouldn't the best thing be to create an emulated CPU plugin to replace the virtualisation engine in VirtualBox? You would get hardware modules and a nice interface for free

Cheers,
Adam

JohnnyTheDon · Post by **JohnnyTheDon** » Tue Jan 20, 2009 9:30 am

bewing wrote:
JohnnyTheDon wrote: I had both do 1 billion random accesses of the same size (long).
The problem is with the words "random accesses". Arrays are not magic bullets. They merely give you the opportunity to write more optimized code. The keywords being "locality" and "prefetch".

? Could you give me an example where an array would make a difference with prefetching? When writing C code, do you prefetch array members before you access them?

Ready4Dis · Post by **Ready4Dis** » Tue Jan 20, 2009 11:08 am

I had this discussion a long time ago on Gamedev when talking about emulators, so I took the liberty of writing an assembler, emulator, and virtual machine just for testing. It supports very limited op-codes, that aren't supposed to simulate any true hardware, but it gives you a couple ways of doing systems calls (function tables, switch, inlined, non-inlined, etc). It has a high resolution timer (uses inline asm to do timing, but none of the code is asm), it is very plain C code, and should be pretty easy to understand. It is located at http://ready4dis.8m.com, the file is gdev_vm.zip. It was written and compiled in Jun of 2003. I did noticed that the compiler used had a large impact on the difference in speeds (make sure you compile as release build). Feel free to check it out and comment, add other methods and notify of results, etc.

jal · Post by **jal** » Tue Jan 20, 2009 3:12 pm

bewing wrote:So I'd think the question would be "what is the fastest way to decode instructions that can be written (somewhat cleanly) in pure C?" Which is probably a translation of those jump tables into "switch" statements.

Or, if you stick to gcc, use &&label stuff...

JAL

OSDev.org

developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator

Re: developers for emulator