virtualization - single guest on multiple hosts

dc0d32 · Post by **dc0d32** » Tue Aug 12, 2008 11:56 pm

Hey!

--everything written below is my understanding about the Industry and virtualization today--

We typically run multiple guests on a single, powerful host. If the guest happens to host an important server, the host needs to run on a fairly high end configuration. This may not always be possible and affordable to small organizations.

So I was thinking if we could run one (or more) guest(s) on more than one hosts, we'd still get to deploy an important and huge server (which is not distributed) in a guest which happens to be distributed.

any comments/inputs/criticism?

Not sure of it has already been implemented. I am thinking on this, just wanted to toss the idea and see what gives.

Cheers
prashant

naiksidd_85 · Post by **naiksidd_85** » Wed Aug 13, 2008 1:10 am

yes this is already being implemented.
I am not really sure is it not clustering, or SMP, there is also concept of HA and DRS implemented.
i had also herd about server farms.

I may be wrong but have heard of this

AJ · Post by AJ » Wed Aug 13, 2008 2:41 am

Hi,

I think the person who is doing something closest to this on the OS Dev boards is probably Brendan at This Web Page. Unfortunately, he seems to be doing a site update at the moment and I can't access his specifications. I'm sure he was doing something similar to distributed computing that isn't quite traditional distributed computing - correct me if I'm wrong, Brendan

Cheers,
Adam

Brendan · Post by **Brendan** » Wed Aug 13, 2008 5:54 am

Hi,

AJ wrote:I think the person who is doing something closest to this on the OS Dev boards is probably Brendan at This Web Page. Unfortunately, he seems to be doing a site update at the moment and I can't access his specifications. I'm sure he was doing something similar to distributed computing that isn't quite traditional distributed computing - correct me if I'm wrong, Brendan

For me, processes run anywhere (on any computer within the cluster) and communicate with each other using messaging; where the kernel routes messages to the receiver (regardless of which computer the receiver is running on) and processes don't need to care if they're talking to something on the same computer or something on a remote computer. On top of this there's a "peer to peer" distributed virtual file system (where any file can be on any disk/s on any computer). However, I know that a good idea implemented poorly is useless, and have spent ages making sure I've got a good foundation to build everything else on (which is just another way of saying it doesn't work yet

).

I have experimented with the idea of a distributed emulator though; and sadly it mostly doesn't work well. The problem is finding a way to share the work involved without causing a massive amount of overhead trying to keep everything synchronized.

What I tried was one process that emulated RAM, with more separate processes that emulated CPUs (one process per emulated CPU). To avoid the need to use IPC for every emulated RAM access I also implemented emulated caches and my equivalent of MESI cache states, so that a process that emulates a CPU could (mostly) run without any IPC except for emulated cache misses. Despite this (and despite the fact that I was using processes running on the same computer without the additional overhead/latency of ethernet/networking hardware) it was slow. Also note that keeping emulated RAM in sync isn't the only problem - you need to keep the emulated CPUs and other emulated hardware (roughly) in time too, which for me meant time control messages that increased IPC (and reduced performance more).

There are other ways to share the work, but they either suffer from the same problem (IPC overhead costing more than potential gains) or don't share the work well (e.g. one process that emulates all CPUs and all RAM with a separate process that only emulates the I/O hub/devices, where one process does a huge amount of work while the other process does very little work, and where it won't scale to more than 2 processes).

Basically what I'm saying is that for distributed systems (and SMP for that matter) you get the best performance when there work done on one computer doesn't depend much on the work being done on another computer.

One way of doing this is "pipelining", where each computer does some stuff and sends the results to the next computer (which does more stuff and sends the results to the next computer, and so on). For an example of this, imagine a C compiler where the first computer parses the source code and compiles it into "intermediate language", the second computer optimizes the intermediate language, a third computer converts the intermediate language into assembly language, and the fourth computer creates the final binary. In this case each computer does a reasonable amount of work but there's very little communication between the computers (e.g. one "here's my output" message per computer/stage).

Another method is "farming", where you've got a controller that splits a huge job into smaller pieces and sends each piece to other computers to be processed, and then combines the results from these other computers. An example of this is video rendering farms, where a master computer asks slave computers to generate one frame each and combines these frames into a movie.

For SMP there's similar problems, but you can use shared memory to minimize the communication costs. For distributed systems it is possible to simulate shared memory (e.g. fetch a page from the a central "page manager" during page faults) but that needs to be implemented on top of lower level communication systems and therefore isn't a way to avoid the overhead of these lower level communication systems (it's still slow, and is typically even slower because processes having less control over it).

Cheers,

Brendan

dc0d32 · Post by **dc0d32** » Mon Aug 18, 2008 11:44 am

Brendan wrote: What I tried was one process that emulated RAM, with more separate processes that emulated CPUs (one process per emulated CPU). To avoid the need to use IPC for every emulated RAM access I also implemented emulated caches and my equivalent of MESI cache states, so that a process that emulates a CPU could (mostly) run without any IPC except for emulated cache misses. Despite this (and despite the fact that I was using processes running on the same computer without the additional overhead/latency of ethernet/networking hardware) it was slow. Also note that keeping emulated RAM in sync isn't the only problem - you need to keep the emulated CPUs and other emulated hardware (roughly) in time too, which for me meant time control messages that increased IPC (and reduced performance more).

imo in case of non-paravirtualized vm os, the RAM is not exactly emulated (we'd use the page faults); nor is the CPU. I imagine something like NUMA. Also, exposing each host as single CPU participating in this NUMA arch would make sense for the guest too.

Keeping CPUs and hardware reasonably in sync is undoubtedly a hard problem...

--
prashant

Brendan · Post by **Brendan** » Tue Aug 19, 2008 3:30 am

prashant wrote:
Brendan wrote: What I tried was one process that emulated RAM, with more separate processes that emulated CPUs (one process per emulated CPU). To avoid the need to use IPC for every emulated RAM access I also implemented emulated caches and my equivalent of MESI cache states, so that a process that emulates a CPU could (mostly) run without any IPC except for emulated cache misses. Despite this (and despite the fact that I was using processes running on the same computer without the additional overhead/latency of ethernet/networking hardware) it was slow. Also note that keeping emulated RAM in sync isn't the only problem - you need to keep the emulated CPUs and other emulated hardware (roughly) in time too, which for me meant time control messages that increased IPC (and reduced performance more).
imo in case of non-paravirtualized vm os, the RAM is not exactly emulated (we'd use the page faults); nor is the CPU. I imagine something like NUMA. Also, exposing each host as single CPU participating in this NUMA arch would make sense for the guest too.

RAM can be "unemulated" (where guest linear address space = host linear address space) in some situations, but not others. Situations where it doesn't work include "guest in long mode, host in protected mode", and situations where the guest can't have the entire linear address space (e.g. emulator code and data in the same address space causing conflicts when the guest needs to use the same addresses). Putting the emulator's code in a different address space would cost dearly (TLB flushes every time the emulator needs to do something). The other problem here is that it's not distributable - you can't really have "local guest linear address space = remote host computer's linear address space").

Using one process per NUMA domain is something I considered (for e.g. several processes emulating a computer with several NUMA domains, where each process emulates memory and CPUs that belong to it's NUMA domain, and possibly where each of these processes is multi-threaded with one thread per CPU core). The end result of this would be a very high NUMA ratio (e.g. a large performance difference between accessing "close" RAM and accessing "distant" RAM), so it would probably still give bad performance because most OSs aren't very well optimized for NUMA (mostly because the most common form of NUMA is AMD platforms, where the NUMA ratio is very low). Basically you'd still need to emulate MESI cache states and you'd still have a lot of IPC overhead.

IMHO the only option for decent performance is to have emulated RAM and CPUs using shared memory to communicate (e.g. emulate RAM and CPUs with a single multi-threaded process). This doesn't mean that you couldn't have a distributed emulator though. For e.g. one computer could emulate RAM and CPUs, while other computers emulate some or all devices. It would be unbalanced though (lots of CPU load on one computer with other computers doing very little).

Also note that I'm talking about emulation/virtualization here, and not simulation. For simulation, accuracy matters much more than performance. If you accurately simulate things like caches and performance monitoring counters then the overhead of the simulation will increase, which would make distributing the load more worthwhile (but still not necessarily worthwhile).

Now for my conclusion: *don't* write an emulator. Instead write a modular system with several "engines", where some engines might be designed for speed (SVM, VMX, V86 and/or dynamic translation) while other engines might be designed for accuracy/simulation (interpreted and/or dynamic translation); some engines might be "single process" and some might be multi-process/distributable; etc. Then you'd have a set of modules for emulated devices that these engines can use, where each "device module" is a process that could be run on a remote computer. In addition, you could add support for this in your OS's device drivers, so that a "device module" could also be a real device (for e.g. so that the emulator could use a real network card, video card or disk controller, instead of using an emulated network card, video card or disk controller). If it's done right it should also be possible to make it cross-platform and do "mix and match" - for e.g. you could write a "PowerPC engine" for 80x86 and write an "80x86 engine" for PowerPC, then use "device modules" written for both architectures with either engine.

The obvious next step would be a cross-platform distributed OS, where any platform can be emulated on any other platform, and where any virtual machine can use emulated devices running on any platform and any real devices on any platform. Of course you'd need something like this to write something like this (and a few thousand spare programmers wouldn't hurt either)....

Cheers,

Brendan

dc0d32 · Post by **dc0d32** » Tue Sep 09, 2008 8:32 am

It is, as Brendan said, a lot of network traffic to keep emulators in sync. Even then, I could not convince myself about correctness. It just doesn't seem practical.

Thanks.

OSDev.org

virtualization - single guest on multiple hosts

virtualization - single guest on multiple hosts

Re: virtualization - single guest on multiple hosts

Re: virtualization - single guest on multiple hosts

Re: virtualization - single guest on multiple hosts

Re: virtualization - single guest on multiple hosts

Re: virtualization - single guest on multiple hosts

Re: virtualization - single guest on multiple hosts