[Random idea] An OS for HPC and/or cloud computing
Posted: Wed Sep 16, 2009 1:41 am
I've been thinking (yes, that sometimes happens) lately about trying to write an OS aimed at high-performance computing and/or so-called "cloud" computing.
The reason that got me started is that, in the case of clusters which basically don't do anything but execute a lot of chunks of code in a massively parallel manner, there is just no use for most common OS features: filesystems (arguably, a hash-based, distributed database is a filesystem of sort, but nothing like a "true" FS), multitasking (with the exception of a few core tasks), etc.
A single node should just execute whatever computations it has to run while avoiding context switches as much as possible. User tasks would be serialised on a node. If something needs to be sent/retrieved from the network, fine - the user task blocks until it is done. The same goes for any disk-related action.
What the kernel (or some high-level servers, depending on whether a microkernel is used or not - given the goals, I have to say I'd rather not go for that) needs to handle is:
[*]networking, which is the most important part,
[*]"raw" hard disk storage,
[*]interface with e.g. GPGPUs.
The intermediate layer would implement:
[*]a job queue,
[*]a distributed, hash-based database,
[*]an administration interface (either remote or through a console).
Finally, at the same level as user tasks, the system would include a compiler. Source code would be fed through the administration interface, signed using e.g. an x509 certificate. Of course, general and custom libraries would be made available to user tasks.
As the title said - this is a random idea, and it's still sketchy, even for me... so please discuss, flame, etc.
The reason that got me started is that, in the case of clusters which basically don't do anything but execute a lot of chunks of code in a massively parallel manner, there is just no use for most common OS features: filesystems (arguably, a hash-based, distributed database is a filesystem of sort, but nothing like a "true" FS), multitasking (with the exception of a few core tasks), etc.
A single node should just execute whatever computations it has to run while avoiding context switches as much as possible. User tasks would be serialised on a node. If something needs to be sent/retrieved from the network, fine - the user task blocks until it is done. The same goes for any disk-related action.
What the kernel (or some high-level servers, depending on whether a microkernel is used or not - given the goals, I have to say I'd rather not go for that) needs to handle is:
[*]networking, which is the most important part,
[*]"raw" hard disk storage,
[*]interface with e.g. GPGPUs.
The intermediate layer would implement:
[*]a job queue,
[*]a distributed, hash-based database,
[*]an administration interface (either remote or through a console).
Finally, at the same level as user tasks, the system would include a compiler. Source code would be fed through the administration interface, signed using e.g. an x509 certificate. Of course, general and custom libraries would be made available to user tasks.
As the title said - this is a random idea, and it's still sketchy, even for me... so please discuss, flame, etc.