Page 1 of 2

Hypervisors, emulators & device drivers

Posted: Sun Jul 03, 2005 9:19 am
by Brendan
Hi,

I've been thinking (not necessarily a good thing)...

Intel and AMD have both released full details of their virtualization support - "LaGrande" and "Pacifica" respectively.

Microsoft have announced their intentions of building a hypervisor into a "stripped down" version of Windows (possibly released as a service pack for Longhorn). IMHO this will lead to full featured versions of Windows with the ability to run other OSs under the inbuilt hypervisor. Then there's traditional emulators - Bochs, Qemu, VirtualPC, VMWare, etc.

Now, my OS is intended to (eventually) be ported to other platforms so that a cluster can consist of a variety of architectures. The things above make me wonder if it'd be best to build an "emulation API" into the OS design, where the user (or OS) could start up an emulation of any other supported architecture in order to run another OS in a window, or an application designed for my OS on a different architecture than what it was compiled for.

That way the "emulation API" would be able to make use of virtualization hardware if present while using traditional emulation if it's not. My normal "thread is an object" approach would be used to split the emulation into many modules, so that the "core" of the emulator could be changed (e.g. from Bochs/Qemu style to hardware supported hypervisor style) without effecting the other modules (virtual devices, etc).

I'd be able to write the code for the "virtual devices" at the same time (or soon after) writing the device drivers for my OS - reducing the amount of research required to get it all working. I'd also be able to build emulation support into the device drivers themselves, so that real devices could be assigned to the emulator instead of being used by the host OS (e.g. the guest OS would be able to use real hardware).

IMHO there's 3 types of projects that are absolutely huge - OSs, compilers and emulators. I've always intended to tackle each of these one at a time, but I'm not sure if tackling 2 at once is a good idea. While I'd find it a bit overwhelming, the more I think about it the more I think it'd be easier to do the emulator/hypervisor and OS at the same time.

Now for the questions :).

Has anyone else considered this approach?

Should I do it (or more realistically, should I attempt it)?

Any other comments, queries or suggestions?


Thanks,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Sun Jul 03, 2005 3:25 pm
by RetainSoftware
Hi,

I've also considered this approach... well without the LaGrande and Pacifica virtualization then but solely with the emulation approach. Ive written a playstation emulator 4EverPSX and EternalNGE they are not finished but gave me enough insight to the amount of effort an emulator takes.

The reason for the OS is that you can have control over the memory space and in the 64-bit memory space that would mean that you can emulate 32-bit and less platforms quite faster because memory fetches can be done faster, thus even the dynamic recompilation will be done faster so there are a lot of gains low-level. However if you wish to have a higher layer of emulation than the OS approach is less beneficial but still works.

Other problems will lie in graphics display and sound, etc. This will need a very transparent API and takes a lot of effort. In the end you want to make an API that on OS level chooses OpenGL, SDL, Bitmaps wathever but an EMU author shouldn't be woried about those issues.

But then comming to your question i've considered this and am still considering this. My OS COBOS stands for COntent Based OS and content meaning video, audio, games(emulator wise)
so i'll have a go at it but i'll now it will be extremly difficult and time consuming.

Should you do it? yes ofcourse!
Will you succeed? think so
Within five years? guess not.

Same goes for me. Though it would be very nice to see in the end. Maybe nice to do some brain storming about this.

Greets,

Rene

Re:Hypervisors, emulators & device drivers

Posted: Sun Jul 03, 2005 9:52 pm
by Brendan
Hi,

I've been doing more thinking!

The idea would be to have an common interface for all emulation (including hypervisors) so that any software from any architecture could be run. For LaGrande and Pacifica the hypervisor need to be running at CPL=0, which for my OS (a "micro-kernel-ish" thing where device drivers, etc run at CPL=3) implies that it would need to be built as a kernel module rather than as normal software.

For simplicity, my OS design would be extended such that the native emulation would run as a kernel module, and emulation of different architectures would run at CPL=3 through the same interface.

As for device driver support, I'm not too sure here. My original idea was to allow devices to be shared at the device driver level, but this seems less applicable after some thought. My device manager (with it's auto-detection) should suffice as the resources used by a given device can be identified and all of those resources assigned to a virtual computer. This leaves me with raw device support for the emulator (which is more useful in a hypervisor setting - the ability to split a large computer into independant smaller computers, mostly at the hardware level) and a set of virtual devices. A virtual device would be a plug-in module for the emulator/hypervisor (and possibly for the host OS, such that the host OS can use a matching device driver and virtual device without the emulator).
RetainSoft wrote:Should you do it? yes ofcourse!
Will you succeed? think so
Within five years? guess not.
Thanks!

As for the amount of time it'll take, I'd be very surprised if I can get anything useful (e.g. able to run Windows) working well in less than 10 years, but the sooner I start the sooner it'd be done :). In the short term I guess the main goal is to extend the OS design rather than having a complete implementation of it.
RetainSoft wrote:Maybe nice to do some brain storming about this.
Indeed it would :).

For me, I'm looking at several interfaces:
- the common interface to the hypervisors/emulators used by other software to start and monitor a virtual computer
- the interface between my device manager and the hypervisors/emulators that allows raw device assignment
- the interface between virtual device plug-ins and the hypervisors/emulators
- the interface between virtual networking devices and any network the host computer is connected to

Then there's the core of each hypervisor/emulator (one for each architecture), which would have 4 modes:
- fast, whole computer, using dynamic translation and any other tricks to get performance out of it
- slow, whole computer, using strict emulation for debugging purposes (not necessarily real time)
- fast, user mode, using dynamic translation and any other tricks to run applications written for my OS only
- slow, user mode, using strict emulation for debugging applications written for my OS only

Of these modes it would be necessary for the hypervisor/emulator to be able to switch between fast and slow modes at the users request. The "user mode" versions aren't needed for the native architecture, as the applications would run natively :).

To begin, I'd start with emulating an 8086 and then work up towards more recent CPUs. I'd also focus my efforts on getting the device manauger and emulator to handle raw device access. For example, on a computer with a floppy drive, hard drive and 2 video cards my OS would be able to boot from the floppy and use one video card while the emulator has direct use of the hard drive and second video card. This would save me from emulating the video can hard drive to begin with :)...

Then there's the BIOS itself - I want to do this twice! The first version will be the actual BIOS ROM image, which will be created dynamically so that things like the MP specification table can be auotmatically generated, rather than having seperate BIOS images for each combinination. On top of this the emulator should have "hard-coded" BIOS functions, so that if the guest OS calls a BIOS function it can be performed by the emulator without emulating the code in the ROM on an instruction by instruction basis.

I'm meant to be elsewhere, so a few quick questions:

Can anyone think of a free/downloadable OS that will still run on an 8086?

Has anyone got any advice on doing dynamic translation?


Cheers,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Mon Jul 04, 2005 12:23 am
by RetainSoftware
Hi,
To begin, I'd start with emulating an 8086 and then work up towards more recent CPUs. I'd also focus my efforts on getting the device manauger and emulator to handle raw device access. For example, on a computer with a floppy drive, hard drive and 2 video cards my OS would be able to boot from the floppy and use one video card while the emulator has direct use of the hard drive and second video card. This would save me from emulating the video can hard drive to begin with ...
emulation of the video card should not be part of the OS but of the emulator because the 8086 PC, Gameboy, SNES, PlayStation, etc all have different ways of doing graphics so i think emu authors want more of an API for drawing primitives. If the API uses the hardware directly you would gain maximum performance, so it it's a kind of HAL like DirectX.

Same goes for sound ofcourse.
Then there's the BIOS itself - I want to do this twice! The first version will be the actual BIOS ROM image, which will be created dynamically so that things like the MP specification table can be auotmatically generated, rather than having seperate BIOS images for each combinination. On top of this the emulator should have "hard-coded" BIOS functions, so that if the guest OS calls a BIOS function it can be performed by the emulator without emulating the code in the ROM on an instruction by instruction basis.
The first is called low-level emulation(LLE) and the latter high-level emulation (HLE) and in general HLE has the best performance and LLE is the highest accuracy.
Can anyone think of a free/downloadable OS that will still run on an 8086?
Only DOS comes to mind and GEOS but i thought that was just a graphics shell over DOS.
Has anyone got any advice on doing dynamic translation?
Check some opensource emulators for the playstation or nintendo 64 for instance most of them have dynamic translation and some have some document about them also. This should provide enough information.

The next two days i'm away for work but i'll be thinking more about this subject.
The inner flame has kind of rekindled :D

Greets,

Rene

Re:Hypervisors, emulators & device drivers

Posted: Mon Jul 04, 2005 12:51 am
by AR
I could be wrong as I only glanced through the specification, but doesn't the Intel emulation instructions only support 80386+ with PMode and Paging compulsorly in the emulated environment?

Edit:
VMX operation places restrictions on processor operation. These are detailed below:
VMX operation restricts the values that may be loaded in registers CR0 and CR4. The following bits must be 1: CR0.PE, CR0.NE, CR0.PG, and CR4.VMXE. VMXON fails if any of these bits are clear (see ?VMXON?Enter VMX Operation? on page 7-26). Any attempt to clear these bits during VMX operation (including VMX root operation) using the MOV CR instruction causes a general-protection exception. These bits cannot be cleared by VM entry or VM exit.

CR0.PE and CR0.PG restrictions imply that VMX operation is supported only in paged protected mode (including IA-32e mode). Therefore, guest software cannot be run in unpaged protected mode or in real-address mode. If a VMM is to support guest software that expects to run in unpaged protected mode or in real-address mode, the VMM must support emulation of these modes. A VMM can use ?identity? page tables to emulate unpaged protected mode and can use virtual-8086 mode as part of a strategy to emulate real-address mode.
(Intel Virtulization Technology Specification for the IA-32 Intel Architecture, Pg 11)

Re:Hypervisors, emulators & device drivers

Posted: Mon Jul 04, 2005 1:38 am
by Brendan
Hi,
AR wrote:I could be wrong as I only glanced through the specification, but doesn't the Intel emulation instructions only support 80486+ with PMode and Paging compulsorly in the emulated environment?
This is correct!

I think this implies that to run real mode code you've got 3 choices:
- have a V86 monitor built into the host hypervisor, used instead of Intel's virtualization (not an option if the host OS uses long mode)
- have a working V86 monitor within the guest's environment
- use full emulation (e.g. like Bochs) instead

Using virtual 8086 within the host OS is not an option for me. Running a V86 monitor in the guest's environment would solve the problem, even though it would be like running a guest inside a guest you'd still get almost full host CPU performance. The third option (full emulation) I need to support anyway to allow the guest to be single-stepped, debugged and manipulated.

For the current 32 bit version of my OS I'll be planning 3 seperate 80x86 "hypervisor/emulator" cores. The first will be full emulation, the second will be an Intel specific core and the third will be an AMD specific core (LaGrande and Pacifica are incompatible technologies that do the same thing :(). Both of the latter 2 versions will include most of the code from the "full emulation" version, so at any time I can switch between "hypervisor" and "emulator" modes.

For the time being I'll be focusing on the emulation core only - I can't test code written for a CPU that isn't available. I will be allowing for the hypervisor stuff in the design though, in the same way that I'm allowing for a 64 bit version of the OS without actually implementing any of it :).


Cheers,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Mon Jul 04, 2005 2:46 am
by Brendan
Hi,
RetainSoft wrote:emulation of the video card should not be part of the OS but of the emulator because the 8086 PC, Gameboy, SNES, PlayStation, etc all have different ways of doing graphics so i think emu authors want more of an API for drawing primitives. If the API uses the hardware directly you would gain maximum performance, so it it's a kind of HAL like DirectX.

Same goes for sound of course.
Part of the advantage of a hypervisor is that a real computer can be sub-divided into seperate virtual computers. Mainly this includes the devices themselves. For example, a dual CPU computer with 2 video cards, 2 USB controllers and 2 SCSI controllers could run 2 completely seperate OSs (1 CPU, 1 video and 1 scsi controller per virtual computer) with minimal hypervisor overhead. The idea is that a company can buy a single large server rather than multiple smaller servers.

I'm also hoping that the distributed nature of my OS could be used to do the reverse - multiple smaller computers being used to emulate a single large server.

In any case, allowing the guest OS exclusive use of an entire device does make sense when the host architecture is the same as the guest architecture (and may even make sense when the architectures are different, especially for PCI and USB devices which are intended to be cross-platform).

I was however wrong - support for this "exclusive guest OS access" must be built into the host OS's device drivers (I can't just let the device manager assign resources for the device to the hypervisor/emulator). The problem (at least when Intel's and AMD's new virtualization instructions aren't used) is DMA and bus mastering devices where conversion between physical addresses and emulated physical address would be a minimum requirement.
RetainSoft wrote:
Can anyone think of a free/downloadable OS that will still run on an 8086?
Only DOS comes to mind and GEOS but i thought that was just a graphics shell over DOS.
I've found FreeDOS, which claims to be able to run on pure 8086 (without EMM386.exe, etc). Currently checking it out (although it won't matter for a while yet :).
RetainSoft wrote:
Has anyone got any advice on doing dynamic translation?
Check some opensource emulators for the playstation or nintendo 64 for instance most of them have dynamic translation and some have some document about them also. This should provide enough information.

The next two days i'm away for work but i'll be thinking more about this subject.
The inner flame has kind of rekindled :D


I've read through a few whitepapers, and have begun searching for playstation, nintendo, etc documentation as suggested. The basic idea seems to be to convert "basic blocks" of instructions into native instructions, and then execute the native instruction block. I'm missing a huge number of details - I figure it's probably going to take me a few weeks to get a better understanding of it (and shall begin with the basic emulator's framework and BIOS LLE code in the meantime :) ).


Thanks,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Tue Jul 05, 2005 8:16 pm
by Brendan
Hi,

Ok, my emulator has gone in a completely unexpected direction already..

In order to make it multi-threaded, I've got a "master IO" thread (for physical RAM and devices) and a seperate thread for each virtual CPU. For a CPU to access RAM it needs to send a message to the master IO thread, who replies with the needed data. This led to emulating the CPUs cache to minimize messaging overhead.

Now I've got N-way set associative caching, implemented so that the cache parameters are set dynamically before virtual boot (eventually I'll do a pretty user interface for it).

I'm hoping that using seperate threads will be better for multi-CPU hosts (and allows the emulator to be distributed across multiple host computers later). It means the emulation is more accurate (e.g. it can be used to monitor cache misses). In addition, when I've got dynamic instruction translation working it will become the "L1 instruction cache".

It also means that each CPU thread works from data and instructions in cache lines, which have little to do with physical addresses. I'll be able to emulate a computer with a huge amount of RAM, and emulating NUMA can be done by adding "slave IO" threads. Both master IO and slave IO threads are limited to emulating less than 2 GB of RAM, but with NUMA emulation this becomes a "per emulated CPU" limit rather than a "per emulated computer" limit.

The problem is that every time I read or write from physical addresses I need to check for a cache miss, and any cache misses are expensive - it's going to have a heavy impact on performance.

While executing translated code I won't need to worry about cache misses at IP, and (while not running translated code) as long as IP is incremented sequentially (e.g. no calls, jumps or interrupts) I can avoid checking for cache misses unless the instruction crosses a cache line boundary.

For reading and writing data I can't figure out how to minimize the cache miss checks (except for the stack perhaps). Worse is that a misaligned read or write can cause 2 cache misses at the same time. Further, for a write I'm going to need to check for cache misses, then check for self-modifying code and evict any translated code that may have been effected.

This leads to a few questions:
a) are the benefits worth the performance problems?
b) what can I do to minimize the performance problems?
c) is there another way to do a multi-threaded emulator?


Cheers,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Tue Jul 05, 2005 11:00 pm
by smiddy
Brendan wrote: This leads to a few questions:
a) are the benefits worth the performance problems?
b) what can I do to minimize the performance problems?
c) is there another way to do a multi-threaded emulator?
For 'A' I suggest weighing the 'best valued' items and do an informal trade study, by doing a pair-wise comparison.

For 'B' and 'C' I'm no help...you've surpassed my conherence in this subject by your discussion. I'm learning from you posts, which is most exciting.

Perhaps I can walk you through a trade study so that you can make a decision on moving forward? How's your linear algebra?

Re:Hypervisors, emulators & device drivers

Posted: Wed Jul 06, 2005 1:07 am
by Brendan
Hi,
smiddy wrote: For 'A' I suggest weighing the 'best valued' items and do an informal trade study, by doing a pair-wise comparison.

For 'B' and 'C' I'm no help...you've surpassed my conherence in this subject by your discussion. I'm learning from you posts, which is most exciting.

Perhaps I can walk you through a trade study so that you can make a decision on moving forward? How's your linear algebra?
I'm willing and my linear algebra should suffice, but I'm also unsure what would be involved in an informal trade study. I assume it'd be a comparison of the total cost of using multi-threading for the emulator versus the total gain, including effects on performance, features, development time, etc?


Thanks,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Wed Jul 06, 2005 5:53 am
by smiddy
Brendan wrote:I'm willing and my linear algebra should suffice, but I'm also unsure what would be involved in an informal trade study. I assume it'd be a comparison of the total cost of using multi-threading for the emulator versus the total gain, including effects on performance, features, development time, etc?
Yes, for the most part. It also involves determining your alternatives too. Your schedule is a player in your determination, which can be equated into cost (value). I'll send you some info this evening when I get home. I assume you have a spreadsheet of some type? For the calculations it is nice to have a copy of MatLab as it does the matrix calculations quite well (hence the name).

Re:Hypervisors, emulators & device drivers

Posted: Wed Jul 06, 2005 7:37 am
by Brendan
Hi,
smiddy wrote:Yes, for the most part. It also involves determining your alternatives too. Your schedule is a player in your determination, which can be equated into cost (value). I'll send you some info this evening when I get home. I assume you have a spreadsheet of some type? For the calculations it is nice to have a copy of MatLab as it does the matrix calculations quite well (hence the name).
Thanks! It sounds like the sort of thing that would be quite useful for many decisions :).

For spreadsheets, I've got openOffice, starOffice, MS Excel & Works - no MatLab..


Cheers,

Brendan

Re:Hypervisors, emulators & device drivers

Posted: Sat Jul 09, 2005 12:18 pm
by smiddy
Sorry about my delay in getting back to you. I've been busy trying to get a job across country. Suffice it to say, I've made the next round and am being flown into their HQ for a meet and greet next week. Keep your fingers crossed. ;)

OK, yes, you are correct. There is a book out there on decision making anlaysis. Unfortunately, mine is boxed up in preparation for moving. In it there are differing way to make 'best value' decisions or trade studies. I was going to grab an example out of it to share, but it is boxed up. (My wife rocks in regards to getting things done around the house, but it can be a burden too ;)) So, I'm going to wing it a bit here, but should get you in the ballpark (polo-grounds, uhm rugby pitch?).

First determine your alternatives.

Second determine the performance factors, or specifications.

In otherwords, you have two cameras you would like to buy. One has a smaller field for its sensor, but has 5 megapixels. The other has a huge field sensor and has 5 megapixels. The first has fixed lense, but has 5 x optical zoom, as well as 6 x digital zoom. The second comes with a kit that has removeable lenses that are SLR and autofocus, though the zoom factor is no where near 5 x. One has miniture secure flash ROM. Two has composite Ultra-II flash ROM. Both use USB interface, both have exciting software to manage your pictures. One is only $500 and two is $900. One is software based shutter control. Two is mechanical based shutter control. One takes a second or two to focus, and then shoot. Two is immediate and can do 3.5 frames per second.

Now, line all your options up on a edge of a square pattern for comparison between each. (1 to n and 1 to m) The ones down the diagonal are 1, meaning they are equal. Comparing n to m, row 1 column 2 (or row 2 column 1) using the top diagonal, the significant one is on the row, give your strength of its significance from 2 to 9 (the inverse would be 1/2 to 1/9). The lower triangle should be the inverse of the upper triangle. Do this for all the items...for a pairwise comparison.

Next do an eigen vector of the matrix until there is no difference between your most significant digit. In otherwords, square the matrix, get the percentage for each performance factor based on the total. Continue to exponentiate the matrix until you are satisfied that the result is stable based on the eigen value. For instance, you want to be confident out to the fourth digit, you would do eigen vector until the difference between B^x and B^y doesn't change out to 0.0004 for the entire matrix of performance values.

Next you would determine each perfomrance value either based on its real world number multiplied by the stable eigen vector. Then add up the total and you will have a winner.

From this you can wiggle the values a bit to see if there is a statistical significance for any one performance item and see if it is a huge factor or not.

From this you should be able to make a quantified decision...

Let me know if this makes any sense. I will see if I can get an example for you that may provide some better insight.

Again, sorry about my delay, but I have a life too, or at least I try to. :D

Re:Hypervisors, emulators & device drivers

Posted: Mon Jul 11, 2005 4:44 am
by Brendan
Hi,
smiddy wrote:Sorry about my delay in getting back to you. I've been busy trying to get a job across country. Suffice it to say, I've made the next round and am being flown into their HQ for a meet and greet next week. Keep your fingers crossed. ;)
Good luck! :)
smiddy wrote:First determine your alternatives.

Second determine the performance factors, or specifications.
Ok, I figure there's only 4 alternatives:

a) With multi-threading, with one thread per virtual CPU and one thread per virtual memory/IO controller
b) With multi-threading, with one thread per virtual CPU only
c) A single-threaded emulator
d) No emulation

Everything else depends on choosing one of these alternatives. Performance factors, specification and features I'll discuss in the next post (it's where things get confusing).
smiddy wrote:Now, line all your options up on a edge of a square pattern for comparison between each. (1 to n and 1 to m) The ones down the diagonal are 1, meaning they are equal. Comparing n to m, row 1 column 2 (or row 2 column 1) using the top diagonal, the significant one is on the row, give your strength of its significance from 2 to 9 (the inverse would be 1/2 to 1/9). The lower triangle should be the inverse of the upper triangle. Do this for all the items...for a pairwise comparison.
Ok - you lost me :). With the alternatives above, would I be after a set of matices where each matrix looks like:

[tt] abcd
a1---
b-1--
c--1-
d---1[/tt]

And then have a different matrix for each factor being considered?

Unfortunately eigen values and eigen vectors also elude me. I've tried finding web sites that explain how to calculate them but only found sites explaining what they are used for (e.g http://149.170.199.144/multivar/eigen.htm) and sites that require a degree in mathmatics to understand (e.g. http://mathworld.wolfram.com/Eigenvector.html) - I was hoping for something aimed at games programmers (the typical matrix operations for 3D rotation, scaling, normalization, etc were much easier to find & understand) :).
smiddy wrote:Again, sorry about my delay, but I have a life too, or at least I try to. :D
I'm also responding later than I'd hoped (despite my continued attempts at avoiding a real life ;) ). For the last 4 days I've been spending what time I can trying to get the framework for "option A" working, in the hope of measuring some actual performance values (more on this in my next post).

[continued in next post]

Re:Hypervisors, emulators & device drivers

Posted: Mon Jul 11, 2005 4:46 am
by Brendan
[continued from previous post]

Ok - the performance factors or specifications. Generally, everything resolves to some form of compromise between features and performance, or single-CPU performance and multi-CPU performance, or performance and other factors.

The major compromise is between the features of multi-threading (making use of multi-CPU computers and the possibility of distributing the emulator across many computers) and the performance problems of doing this (messaging overheads, context switches, etc).

The second compromise, between single-CPU performance and multi-CPU performance, revolves around the first compromise. Basically if the emulator is multi-threaded then different threads can run on different CPUs to get things done in parallel. Single-threaded emulators (like most) gain little or no benifit from additional CPUs. This partly ties in with the design of my OS, where as much as possible is meant to be done in parallel - in fact the major advantage of all distributed OSs (IMHO) is that of performing a large amount of work on a large number of computers in a small amount of time (rather than doing a large amount of work on a single computer in a large amount of time while other computers are mostly idle).

Thirdly, the compromise between performance and other factors. These other factors include debugging, where performance isn't as important, but an increase in the accuracy of the emulation can increase the number of things it can be used for. An example here would be trying to find cache thrashing problems, or trying to improve re-entrancy lock contention on many-CPU computers. Another factor is that a multi-threaded emulator (if it works well) could highlight the features of my OS (distributed, asynchronious messaging, etc), but if the multi-threaded emulator doesn't work well (ie. performance is horrible), or if I decide to ignore the goals of my OS (everything in parallel) and do a single-threaded emulator, then it could make my OS design look impractical to people that don't have an understanding of the internals of the emulator. I guess I should state clearly that the main problem with doing a multi-threaded emulator is that it requires a lot of interaction between the different threads (regardless of how you assign the work to the threads).

[continued in next post]