OSDev.org

Posted: **Mon Jul 11, 2005 4:49 am**

[continued from previous post]

PERFORMANCE

Firstly, I chose to investigate "option A" (with multi-threading, with one thread per CPU and one thread per memory controller) because it offers the most multi-threading and the worst single-CPU performance. For single-threaded emulators there's plenty of examples (Bochs, QEMU, etc).

For test code I implemented a very basic emulator, capable of emulating 2 instruction prefixes, 16 "mov reg, immediate" instructions and 3 jump instructions ("jmp far", "jmp rel8" and "jmp rel16/32") in real mode only. The emulator consists of a "CPU thread" which emulates the CPU itself, and a "Master IO thread" which is meant to emulate everything else (physical RAM, ROM, PICs, PIT, devices, etc). The idea here is that additional CPU threads can be created to emulate multi-CPU computers. The emulator is currently a pure emulator in that it doesn't do any dynamic translation.

The most obvious problem here is that for every memory reference the CPU thread needs data that is controlled by the Master IO thread. Even for option B (multi-threaded with one thread per emulated CPU only) there's similar messaging overhead involved with synchronizing access to RAM. To reduce the overhead of messaging I chose to implement a full model of a CPUs L2 cache, complete with adjustable cache size, associativity and cache line size. To simplify the description of this emulated cache, the entire emulated physical address space is split into N "cache lines" that are M bytes each (total physical address size is N * M bytes) and the emulated cache itself only holds P cache lines (total cache size is N * P). Associativity can be ignored for this discussion.

When a emulated CPU needs to access memory it looks for it in it's emulated cache. If it's not there you get a emulated cache miss, and the CPU thread sends a message to the Master IO thread asking for the cache line. The Master IO thread retrieves the cache line from it's emulated physical address space and sends a return message containing the data in that cache line back to the CPU thread. As you can hopefully see, the main concern here is the cost of an emulated cache miss, which is going to be the single largest performance problem of "Option A".

[continued in next post]

Posted: **Mon Jul 11, 2005 4:51 am**

[continued from previous post]

From tests and measurements I've determined that the cost of a virtual cache miss is about 600000 cycles (measured on the host CPU, a 300MHz AMD-K6, using RDTSC). This cost is composed of some emulator overhead, 2 task switches, 2 "send messages", 2 "receive messages", plus the cost of my OS's "allocation on demand" (page fault, fill a page with zeros and map it into the address space). If an emulated cache miss causes existing data to be evicted from the emulated cache then the same host memory is recycled - the allocation on demand cost only occurs once for each 4 KB of the emulated cache.

This (plus a few other factors) makes the emulator roughly 1500 times slower than the host CPU, depending on how many emulated cache misses there are (which depends on the guest code, cache size, etc). Eventually it'll be doing dynamic translation which will significantly improve this - besides the obvious performance benefits, when it's "replaying" a pre-translated block of instructions there will be no cache misses caused by fetching instructions at RIP/EIP/IP (and no checking for cache misses). It also depends on the size and configuration of the emulated cache (mostly the cache line size), and eventually the size of the buffer used for translated code. Once dynamic translation is added I'm expecting it to be about 2 or 3 times as slow as QEMU on a single CPU computer (depending on a whole pile of things, like the number of emulated cache misses).

Emulating single-CPU virtual computers using a multi-CPU host computer is a bit of a different story, as the cost of a cache miss will be halved and all real CPUs will be able to contribute to the emulation. The only test I've done here is on a Compaq Dual 400 MHz Intel PII emulating a single CPU, where the results indicated that the emulator's CPU thread took almost one third of the cycles of the previous single-CPU tests. To be honest I find this difference unusual, as the majority of the hosts CPU time goes to emulating the single virtual CPU in both cases, and halving the cache miss time doesn't explain the improvement. This has led me to wonder if my own OS's memory manager is playing tricks on me - on a single CPU it doesn't get time to clean any physical pages before the emulation starts, while on a dual-CPU computer the second CPU will be cleaning pages while the emulator is running, which reduces the cost of "allocation on demand" for the emulator's CPU thread (it gets "pre-cleaned" pages instead of dirty pages).

Emulating multi-CPU virtual computers using a multi-CPU host computer is where I'm expecting the largest difference, but my little emulator needs more work before I can do any real tests for this - give me another 4 days!

.

Cheers,

Brendan

Posted: **Mon Jul 11, 2005 5:39 am**

I'm at work now so I can not address everything now, but I will when I am able this evening.

The reason I asked how your Linear Algebra was is because of the terms Eigen Values and Eigen Vectors. Matrix multiplication being the crux of either of those, as well as normalization (scaled down so the sum of all parts equals 1).

You have this part correct:

Code: Select all

 A B C D
A1
B   1
C     1
D       1

Though it should be for your performance factors. I had a power point presentation that explains this better. I'll see if I can find it tonight.

I will also get the opportunity to read through the rest of your posts too and ask more poignant question so we can get the big picture. This exercise allows you to see the value from a big picture too.

Posted: **Mon Jul 11, 2005 8:01 am**

A crash course in eigenvectors and eigenvalues:

Eigenvectors are vectors which, when multiplied by a matrix, yield a constant multiple of themselves. If A is a matrix, x is a vector, and c is a scalar, and Ax=cx, then x is an eigenvector of A and c is an eigenvalue (c is usually labeled lambda).

In graphics-like terms, if you have a certain matrix, its eigenvectors are vectors which will be scaled by it.

For example, multiplying the matrix
[tt]
|1 5|
|2 4|
[/tt]
by the vector
[tt]
|1|
|1|
[/tt]
yields the vector
[tt]
|6|
|6|
[/tt]
which is 6 times the original vector. Here, <1,1> is an eigenvector and 6 is an eigenvalue. In fact, all constant multiples of <1,1> are also eigenvectors with eigenvalue 6 for this matrix.

If Ax=cx holds true for eigenvectors and eigenvalues, then
Ax-cx=0 (0 here is the zero vector),
(A-cI)x=0 (I is the identity matrix, allowing the subtraction of scalar c from matrix A).

An easy way to evaluate c is by taking the determinant of A-cI and setting it equal to 0: |A-cI|=0.

With the matrix above, that expands to:
(1-c)(4-c)-(2)(5)=0, which simplifies to
c^2-5c-6=0,
(c-6)(c+1)=0, and solving for c:
(c-6)=0 or (c+1)=0
c1=6, c2=-1
gives you your two eigenvalues (note that one of them is 6, as stated earlier).

Once you have the eigenvalues, (A-cI)x=0 can be used to find the eigenvectors in a rather straightforward way, by replacing c with each eigenvalue:
(A-6I)x=0, (A+I)x=0

Solving the first will yield vectors k1*<1,1>, solving the second (by Gaussian elimination, for example) will yield vectors k2*<-5,2>. Vectors that are proportional to <-5,2> should be scaled by -1 when right-multiplying the matrix:
[tt]
|1 5||-5|
|2 4|| 2|
[/tt]
yields
[tt]
| 5|
|-2|
[/tt]
so I hope that statement is believable.

Note that eigenvalues yield sets of eigenvectors (for example, 6 here yields all vectors proportional to <1,1>).

Note that eigenvalues can be repeated or imaginary as well as unique and real, leading to multiple sets of eigenvectors per values and various other "irregularities". This was an example of right eigenvectors (the eigenvector was right-multipliying the matrix).

However, I am not sure, from smiddy's description, of how eigenvectors are being used in the process. Nevertheless, this will be useful one way or another (if my explanation is understandable).

Posted: **Mon Jul 11, 2005 8:40 am**

Nicely done Phugoid!

Just a comment:

Note that eigenvalues can be repeated or imaginary as well as unique and real, leading to multiple sets of eigenvectors per values and various other "irregularities". This was an example of right eigenvectors (the eigenvector was right-multipliying the matrix).

Irregularities are only that if they are unexpected. For example, in Digital Signal Processing (DSP) and dealing with the Frequency domain, it is expected to have pairings of real and complex numbers.

In this situation, there should not be any complex values.

Hey thanks for removing some of my cobwebs, it has been a while since I did work with matrices.

I will give a better explanation this evening.

Posted: **Mon Jul 11, 2005 9:06 am**

Agreed. They tried to make me into an electrical engineer at one point, and what I called irregularities were more like regularities to me

However, here these would indeed be irregularities to be generally avoided.

Posted: **Tue Jul 12, 2005 9:04 am**

Hi,

Thanks Smiddy & Phugoid - the eigans make much more sense

.

As you know I've been working on a prototype multi-threaded emulator, so that I can figure out if the performance is very badly effected by the syncronization/communication needed between the threads. My thinking here is that if the multi-threaded prototype emulator doesn't run faster than a single-threaded emulator on a multi-CPU computer then there's no point worrying about the multi-threaded options at all.

I fully expect that emulating a multi-CPU computer on a multi-CPU host computer is where most of the benefits are, so I've been going flat out trying to get that working. The good news is that I've implemented this now, and the emulated multi-CPU computer is working nicely on a real single-CPU computer.

The bad news is that my OS is having problems under load when run on a multi-CPU computer - it's failing to send an EOI in response to an IRQ somehow (the same bug mentioned in the "OS testing forums" that I thought was fixed). Because of this I can't get decent performance stastics from the emulator.

Anyway, I'm postponing emulator research until I find the problem with my OS. The last attempt at trying to find this problem took almost 2 weeks (and it seems this only reduced the chance of problems rather than fixing them).

As I've previously exhausted most possible causes of the problem I may end up rewriting half of the scheduler instead of doing another endless search (there's a few changes I want to make anyway, and I can't think of anywhere else the problem may be).

Thanks,

Brendan

Posted: **Thu Jul 14, 2005 5:14 pm**

Hi,

Ok, I got past my scheduler's "interrupts disabled too long" bug by reducing the frequency of IRQ 8 again (a temporary fix that allows me to get emulator performance results). Now I've got more results from the emulator.

To save everyone from in-depth details I'll summerize. Basically the proto-type emulator is running between 500 to 1500 times slower than the host CPU/s (depending partly on how I configure the emulator). The key for a multi-threaded emulator seems to be to keep all available host CPUs busy (I get my best results from emulating a 4 CPU computer on a dual CPU host computer) and to minimize the number messages where possible - this is a tricky balancing act considering you often need more messaging to keep the CPUs busy.

Considering that I'm getting performance close to Bochs one single-CPU computers (and exceeding Bochs on multi-CPU computers, where Bochs can't make use of the extra CPU/s) I figure it's worth the time it'd take to add dynamic translation to the proto-type emulator.

As for my pesky "interrupts disabled too long" bug, the part of my scheduler that looks for the best thread to run will need to be interruptable and partially rewritten - "variable frequency" scheduling is nice but "variable qTime" is more efficient. Those changes can wait until after I've experimented with dynamic translation - I'm just glad it doesn't appear to be a more difficult problem

.

Cheers,

Brendan

OSDev.org

Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers

Re:Hypervisors, emulators & device drivers