OSDev.org

Posted: **Wed Sep 16, 2009 1:41 am**

I've been thinking (yes, that sometimes happens) lately about trying to write an OS aimed at high-performance computing and/or so-called "cloud" computing.

The reason that got me started is that, in the case of clusters which basically don't do anything but execute a lot of chunks of code in a massively parallel manner, there is just no use for most common OS features: filesystems (arguably, a hash-based, distributed database is a filesystem of sort, but nothing like a "true" FS), multitasking (with the exception of a few core tasks), etc.

A single node should just execute whatever computations it has to run while avoiding context switches as much as possible. User tasks would be serialised on a node. If something needs to be sent/retrieved from the network, fine - the user task blocks until it is done. The same goes for any disk-related action.

What the kernel (or some high-level servers, depending on whether a microkernel is used or not - given the goals, I have to say I'd rather not go for that) needs to handle is:
[*]networking, which is the most important part,
[*]"raw" hard disk storage,
[*]interface with e.g. GPGPUs.

The intermediate layer would implement:
[*]a job queue,
[*]a distributed, hash-based database,
[*]an administration interface (either remote or through a console).

Finally, at the same level as user tasks, the system would include a compiler. Source code would be fed through the administration interface, signed using e.g. an x509 certificate. Of course, general and custom libraries would be made available to user tasks.

As the title said - this is a random idea, and it's still sketchy, even for me... so please discuss, flame, etc.

Posted: **Wed Sep 16, 2009 4:33 am**

Interesting idea. You seem to be describing what is essentially a batch system, which has been done, but one that is used in a massively parallel formation with others. It could work, but it would have to be done on an exclusively large scale. If you have even slightly more tasks than you have nodes, you will get major latency on a few of them, because you can run exactly the same number of tasks as nodes in parallel. You're (most likely) not going to get funds to build a massive cluster yourself, and it's pretty hard to develop something people with massive clusters will actually want to use, as a hobby.

Posted: **Wed Sep 16, 2009 4:45 am**

NickJohnson wrote:Interesting idea. You seem to be describing what is essentially a batch system, which has been done, but one that is used in a massively parallel formation with others.

Yes, that is the idea.

NickJohnson wrote:It could work, but it would have to be done on an exclusively large scale. If you have even slightly more tasks than you have nodes, you will get major latency on a few of them, because you can run exactly the same number of tasks as nodes in parallel.

True again. To reduce the latency, one could switch to the next task in the queue whenever the current task starts waiting for something, usually network resources (the waiting task would be put back in the queue, only starting to run again if the "new current task" is done or entered a waiting state).

NickJohnson wrote:You're (most likely) not going to get funds to build a massive cluster yourself,

While one can dream, this is probably never going to happen ;->

NickJohnson wrote:and it's pretty hard to develop something people with massive clusters will actually want to use, as a hobby.

Actually, if I have something that works, I can probably get access to a medium-size cluster. While it is not a really massive cluster, it's still a few hundred heterogenous nodes.

Posted: **Wed Sep 16, 2009 1:39 pm**

Hi,

Probably the biggest problem here are fault tolerance and scalability.

If you've got 100 computers working on a large problem for a few weeks, and one of those computers crashes after 20 days, then you don't want to discard all the work already done and restart the entire system.

For scalability, it's not too different to a good server OS. In a server with 4 CPUs, you don't want 3 CPUs to sit around doing nothing while they wait for something to come from the fourth CPU. Multiply that by several orders of magnitude and throw networking latency on top.

For file systems both of these problems are major. For example, if you've got 500 computers with 4 CPUs each, and all of these CPUs are trying to use the same file system at the same time, then a single expensive file server (otherwise known as a single point of failure) probably won't be able to handle the load - you might need a distributed file system (for fault tolerance and scalability).

TSeeker wrote:
NickJohnson wrote:You're (most likely) not going to get funds to build a massive cluster yourself,
While one can dream, this is probably never going to happen ;->

If you can get something working on a tiny cluster (e.g. 2 computers), then I could test it on a small cluster (e.g. 25 computers), and this might give you enough information to predict how it'd behave on a much larger cluster.

Also, don't dismiss thin clients - all you need is "CPU, RAM and networking" (for testing purpose, you don't need the fastest CPUs possible or lots of GiB of RAM per computer). When companies upgrade you can get batches of 20 to 30 second hand thin clients going cheap (like $10 each if you buy in bulk), and you'd probably be able to build a cluster of 50 thin clients for less than you'd pay for a one good/new computer without much problem. In this case, the biggest cost isn't the computers - it's all of the "smaller" things (e.g. like network and power cables, and ethernet switches) that tends to add up. If you work it out on an average of $25 per node ($10 for a second hand thin client and $15 for cabling and ethernet switches), then $5000 is very cheap for a 200 node cluster.

Cheers,

Brendan

Posted: **Wed Sep 16, 2009 2:10 pm**

Brendan wrote:Hi,

Probably the biggest problem here are fault tolerance and scalability.

Agreed. A task (that is a part of a larger job) crashing shouldn't cause the whole job to abort as it might be related to hardware / networking issues. However, if the exact same task crashes on different nodes after quite a few attempts should be dismissed, and the job along with it. Then there is also the problem of managing the jobs themselves, which should probably be replicated on multiple nodes.

Brendan wrote:For scalability, it's not too different to a good server OS. In a server with 4 CPUs, you don't want 3 CPUs to sit around doing nothing while they wait for something to come from the fourth CPU. Multiply that by several orders of magnitude and throw networking latency on top.

Yes, although I was considering testing using a few relatively old PCs (with different speeds and capabilities) instead, as it would be somewhat more heterogenous, not to mention "failure-prone".

Brendan wrote:For file systems both of these problems are major. For example, if you've got 500 computers with 4 CPUs each, and all of these CPUs are trying to use the same file system at the same time, then a single expensive file server (otherwise known as a single point of failure) probably won't be able to handle the load - you might need a distributed file system (for fault tolerance and scalability).

That's the reason why I wasn't even considering a filesystem per se but rather a distributed, hash-based DB in the style of Cassandra or BigTable. I don't think that having a hierarchical structure really matters in this case; as for access management, a job would be associated with a single DB, and any task in the job would have access to it.

Brendan wrote:If you can get something working on a tiny cluster (e.g. 2 computers), then I could test it on a small cluster (e.g. 25 computers), and this might give you enough information to predict how it'd behave on a much larger cluster.

I still need to write the thing (or, for that matter, give it a lot more thought) - but thanks for the offer

Brendan wrote:Also, don't dismiss thin clients - all you need is "CPU, RAM and networking" (for testing purpose, you don't need the fastest CPUs possible or lots of GiB of RAM per computer). When companies upgrade you can get batches of 20 to 30 second hand thin clients going cheap (like $10 each if you buy in bulk), and you'd probably be able to build a cluster of 50 thin clients for less than you'd pay for a one good/new computer without much problem. In this case, the biggest cost isn't the computers - it's all of the "smaller" things (e.g. like network and power cables, and ethernet switches) that tends to add up. If you work it out on an average of $25 per node ($10 for a second hand thin client and $15 for cabling and ethernet switches), then $5000 is very cheap for a 200 node cluster.

I had considered a similar solution for small-scale tests... but yeah, you're right - although, having had 20 computers running here at one point, I don't want to know the size of the number on the electricity bill with 200 thin clients and the appropiate amount of switches ;->

Thanks!

Posted: **Wed Sep 16, 2009 5:17 pm**

Hi,

TSeeker wrote:
Brendan wrote:For file systems both of these problems are major. For example, if you've got 500 computers with 4 CPUs each, and all of these CPUs are trying to use the same file system at the same time, then a single expensive file server (otherwise known as a single point of failure) probably won't be able to handle the load - you might need a distributed file system (for fault tolerance and scalability).
That's the reason why I wasn't even considering a filesystem per se but rather a distributed, hash-based DB in the style of Cassandra or BigTable. I don't think that having a hierarchical structure really matters in this case; as for access management, a job would be associated with a single DB, and any task in the job would have access to it.

IMHO anything that is used to store "groups of bytes" on storage devices that allows those "groups of bytes" to be found again later is a file system; regardless of all the other implementation details, and regardless of what terminology people use to describe it...

TSeeker wrote:
Brendan wrote:Also, don't dismiss thin clients - all you need is "CPU, RAM and networking" (for testing purpose, you don't need the fastest CPUs possible or lots of GiB of RAM per computer). When companies upgrade you can get batches of 20 to 30 second hand thin clients going cheap (like $10 each if you buy in bulk), and you'd probably be able to build a cluster of 50 thin clients for less than you'd pay for a one good/new computer without much problem. In this case, the biggest cost isn't the computers - it's all of the "smaller" things (e.g. like network and power cables, and ethernet switches) that tends to add up. If you work it out on an average of $25 per node ($10 for a second hand thin client and $15 for cabling and ethernet switches), then $5000 is very cheap for a 200 node cluster.
I had considered a similar solution for small-scale tests... but yeah, you're right - although, having had 20 computers running here at one point, I don't want to know the size of the number on the electricity bill with 200 thin clients and the appropiate amount of switches ;->

Average power consumption for one thin client would probably be around 30 watts. 240 of them (or 200 thin clients, a few desktop/servers and about 20 ethernet switches) would work out to about 7200 watts. For comparison, in Australia 7200 watts is the equivalent of having a typical toaster, kettle and blow heater running at the same time.

Cheers,

Brendan

Posted: **Thu Sep 17, 2009 9:36 am**

This is exactly where I intend to take BareMetal OS.

I would like to see a cluster of minimal (just CPU, RAM, MB, and Network/Infiniband) 64-bit computers with no multi-tasking booting via PXE and running whatever task is thrown at each node. A "controller" computer would also be on that network to manage the nodes/send the jobs/receive the results/etc..

Posted: **Thu Sep 17, 2009 9:58 am**

iseyler wrote:A "controller" computer would also be on that network to manage the nodes/send the jobs/receive the results/etc..

IMHO it would be safer to have multiple such computers. As Brendan said, you don't want to lose the results of 20 weeks of computation - which could (and therefore, would) happen if your controller went down.

Posted: **Thu Sep 17, 2009 11:03 am**

Agreed. The "controller" would probably be running Linux so HA won't be so bad to set up. The nodes should also hold on to the results until they are able to send them out successfully.

Also the nodes would talk to each other and the controller via raw Ethernet frames. No TCP/IP.

Posted: **Fri Sep 18, 2009 3:18 am**

An interesting topic! I have just completed developing a parallel programming language, part and parcel of that I had access to a (smallish) cluster of around 1000 nodes, I also ported quite a few parallel applications into this language and so I have a couple of points which might make for an interesting discussion on the OS topic.

Brendan wrote: Probably the biggest problem here are fault tolerance

If you've got 100 computers working on a large problem for a few weeks, and one of those computers crashes after 20 days, then you don't want to discard all the work already done and restart the entire system.

I am not sure this is so much an issue - all the none trivial parallel applications I have worked with have some mechanism to periodically store (and start from if required) their current state - not only is this a good backup mechanism but it also allows people to analyse the current state of the system. Of course if your OS could provided services which improved reliability then that would probably be advantageous (as long as the performance hit was not too great), I personally think this is more the job of the end parallel programmer rather than the OS developer.

TSeeker wrote: User tasks would be serialised on a node

I don't think that you can get away from multitasking on this one - there are a couple of reasons why. Firstly, (and this does happen), you may have an app which contains 100 processes, but your cluster is prety busy and you only have 50 processors free. Instead of waiting hours for all 100 to become available, it is often best to run the app on 50 processors, each with 2 processes. Additionally, if you are going to take advantage of SMPs (a must in this case) I would guess you would need to consider many multitasking issues anyway.

Portability is another issue I think here - I have seen clusters which contain many different types of machine (which can be an absolute nightmare from the end user's point of view), but even small differences (such as some of our machines are connected via gigabit ethernet and some by myrinet) will require extra drivers and development time to support.

A current, popular, solution is to run some *nix based system (we use SUSE Linux on our cluster, but an older one I used to use ran Solaris) with a queue submission. These queue submission systems are actually quite advanced, allowing jobs to "wrap" around the nodes etc... There are quite a few tools and libraries out there which are absolutely essential to parallel computing. You have mentioned a compiler, you will also need to port over libraries implementing standards such as MPI, and (annoyingly IMHO) supporting languages like Fortan, HPF etc for the scientists.

I myself have actually started work on a hosted parallel OS, but my aims and objectives are quite different. My "OS" runs ontop of the existing machines OS, and is aimed at research. The fast development time associated with this means that it is quite easy to experiment with different ideas (such as parallel filesystems) on this level, with the view being that the lessons learnt from my "OS" would then at a later date be able to be incorperated into a real OS. A secondary aim of my project (which is some way from being met) is that it could run on many machines and act as a "sandbox" in which parallel codes could be deployed and run without the machine owner's interaction (but obviously with their permission.)

As I say, I think you have a different aim - an increase in performance and saving of memory (both of which are equally important when considering parallel computing.) For your idea, I see that many cluster owners would be hard pressed to get rid of their existing OSes in favour of yours (unless yours had a great deal of support for things.) Instead I think it would be very interesting if it could answer a research question..... What is the saving, in performance and memory footprint, by deploying a specific tailoured parallel OS over a more general purpose one? If you can answer that question convincingly then I think it would be quite an attractive solution to some people. Making your OS flexible enough so that it is easy to add and remove things (such as change the filesystem as already discussed) I think would be a good point too.

A hugely interesting project anyway! I think that is all I have to say for the moment, sorry it is so long!

Nick

Posted: **Fri Sep 18, 2009 4:24 am**

I'm a bit sick - sorry if the quality of my reply is not on par with yours, I'm running at 5% capacity.

polas wrote:Of course if your OS could provided services which improved reliability then that would probably be advantageous (as long as the performance hit was not too great), I personally think this is more the job of the end parallel programmer rather than the OS developer.

Writing an OS specifically for such applications means, IMO, providing as much support as possible. I think having the OS provide additional reliability and redundancy, these capabilities being set through configuration, is a requirement - although maybe not for a first prototype.

polas wrote:
TSeeker wrote:User tasks would be serialised on a node
I don't think that you can get away from multitasking on this one - there are a couple of reasons why. Firstly, (and this does happen), you may have an app which contains 100 processes, but your cluster is prety busy and you only have 50 processors free. Instead of waiting hours for all 100 to become available, it is often best to run the app on 50 processors, each with 2 processes. Additionally, if you are going to take advantage of SMPs (a must in this case) I would guess you would need to consider many multitasking issues anyway.

I misused the term "node", as I actually meant actually a CPU or core. I see your point w.r.t. busy clusters - however, what could be done instead of running multiple tasks on the same CPU would be to have the cluster "self-balance", moving tasks between CPU queues. In addition, most tasks will need to wait for {storage,communications} at one point or another, in which case they will be requeued in favor of another task. I think I have failed to explain my idea correctly - it is scheduler-driven multitasking that I want to avoid.

polas wrote:Portability is another issue I think here - I have seen clusters which contain many different types of machine (which can be an absolute nightmare from the end user's point of view), but even small differences (such as some of our machines are connected via gigabit ethernet and some by myrinet) will require extra drivers and development time to support.

Agreed. On the other hand, there is no need to support exotic laptop hardware or video hardware. I don't see this as being too different from any relatively portable OS. The only real constraint is that the system should not rely too much on CPU-specific capabilities.

polas wrote:There are quite a few tools and libraries out there which are absolutely essential to parallel computing. You have mentioned a compiler, you will also need to port over libraries implementing standards such as MPI, and (annoyingly IMHO) supporting languages like Fortan, HPF etc for the scientists.

As far as communication standards are concerned - agreed. Regarding supporting various languages, I'd rather avoid that by having the OS-provided compiler support only one language and provide separate programs to be run on client systems allowing other languages to be "translated" into the OS's language.

polas wrote:I myself have actually started work on a hosted parallel OS, but my aims and objectives are quite different. My "OS" runs ontop of the existing machines OS, and is aimed at research. The fast development time associated with this means that it is quite easy to experiment with different ideas (such as parallel filesystems) on this level, with the view being that the lessons learnt from my "OS" would then at a later date be able to be incorperated into a real OS. A secondary aim of my project (which is some way from being met) is that it could run on many machines and act as a "sandbox" in which parallel codes could be deployed and run without the machine owner's interaction (but obviously with their permission.)

I would be quite interested in learning more about your research project. Any papers about it yet?

polas wrote:As I say, I think you have a different aim - [...] (unless yours had a great deal of support for things.)

I agree.

polas wrote:Instead I think it would be very interesting if it could answer a research question..... What is the saving, in performance and memory footprint, by deploying a specific tailoured parallel OS over a more general purpose one?

That clearly requires more brainpower than I have right now... and a working prototype.

polas wrote:Making your OS flexible enough so that it is easy to add and remove things (such as change the filesystem as already discussed) I think would be a good point too.

Hadn't given that part much thought... It'd be good if the OS came as a kit - you'd "prepare" it for a set of nodes, and install it the way you prepared it, rather than having a general blob that loads stuff/checks for stuff on the fly.

polas wrote:A hugely interesting project anyway!

I'm quite interested in your input on this one, as you are a specialist in that specific domain! Thanks!

polas wrote:sorry it is so long!

Not a problem - quite the contrary

Posted: **Fri Sep 18, 2009 6:39 am**

Hi,

polas wrote:
Brendan wrote:Probably the biggest problem here are fault tolerance

If you've got 100 computers working on a large problem for a few weeks, and one of those computers crashes after 20 days, then you don't want to discard all the work already done and restart the entire system.
I am not sure this is so much an issue - all the none trivial parallel applications I have worked with have some mechanism to periodically store (and start from if required) their current state - not only is this a good backup mechanism but it also allows people to analyse the current state of the system. Of course if your OS could provided services which improved reliability then that would probably be advantageous (as long as the performance hit was not too great), I personally think this is more the job of the end parallel programmer rather than the OS developer.

Consider something like file sharing, and file locking in a distributed OS (with a distributed file system). Avoiding single points of failure is hard (unless you don't do it and push the complexity up to a higher level - e.g. into middleware or into the applications).

TSeeker wrote:
polas wrote:
TSeeker wrote:User tasks would be serialised on a node
I don't think that you can get away from multitasking on this one - there are a couple of reasons why. Firstly, (and this does happen), you may have an app which contains 100 processes, but your cluster is prety busy and you only have 50 processors free. Instead of waiting hours for all 100 to become available, it is often best to run the app on 50 processors, each with 2 processes. Additionally, if you are going to take advantage of SMPs (a must in this case) I would guess you would need to consider many multitasking issues anyway.
I misused the term "node", as I actually meant actually a CPU or core. I see your point w.r.t. busy clusters - however, what could be done instead of running multiple tasks on the same CPU would be to have the cluster "self-balance", moving tasks between CPU queues. In addition, most tasks will need to wait for {storage,communications} at one point or another, in which case they will be requeued in favor of another task. I think I have failed to explain my idea correctly - it is scheduler-driven multitasking that I want to avoid.

I think the words you might be looking for are "batch processing"...

Cheers,

Brendan

OSDev.org

[Random idea] An OS for HPC and/or cloud computing

[Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing

Re: [Random idea] An OS for HPC and/or cloud computing