An OS without context switching

nulik · Post by **nulik** » Mon Oct 24, 2011 8:37 am

Context switching was expensive in 32 bit systems. In 64 bit systems it is more expensive because there were more registers added. Intel's Sandy Bridge architecture introduced AVX extensions where register size was doubled to 256 bits. Intel will soon release AVX with 512 bit registers. A context switch will cost even more if all your apps use AVX. Also, if you use IGP that is on chip in Core i7-2600, you would have to save IGP registers too. Some day context switch will be impossible because of its slowness.

So, then a question arises:

Would it be possible to implement an OS without context switching?

One approach would be to force the programmer to write small code snippets that would run for a short period of time and distribute the snippets evenly accross multiple cores. Lets say the programmers accepted this approach and started to write small code functions. But some code might present bugs, and could enter into infinite loop.

How would you deal with this problem?

One solution would be to:

1. Define maximum run time for a snippet. (here snippet=process)
2. Run sort of 'watchdog' code triggered by a timer interrupt that will run ever XXX microseconds and check if some snippet has ran more than it should and kill it.

This way the only context switch that happens is of watchdog code, and it won't be using more than a few registers.

This is the only solution that comes to mind, are there any other ideas?

Will appreciate your comments.
Regards
Nulik

TryHarder · Post by **TryHarder** » Mon Oct 24, 2011 8:50 am

I don't think that this is a promising approach to deal with the problem. How would you run server applications that
supposed to perform some kind of infinite loop? What about spinlocks? And what is considered "small" code? You'll have to
measure the time that your code will be running, and that might depend on some external factors.
The solution that comes in my mind is to keep the number of "heavy" processes (that makes use of a big set of registers)
moderate per processor. "Light" processes (which uses only general-purpose registers for example) may be switched
efficiently, without touching rest of the registers. If you have only 1 "heavy" process per processor, saving/restoring
a big part of registers is unnecessary.

However, IMO the biggest penalty of context-switch is a TLB flush, so saving/restoring even large registers
is not a bottleneck here.

rdos · Post by **rdos** » Mon Oct 24, 2011 8:54 am

It's called "cooperative multitasking".

I don't find cooperative multitasking good enough for an OS. Better then to develop smart algorithms for context save/restore. FPU state and other additional state for instance only should be saved/restored if a thread actually uses it.

egos · Post by **egos** » Mon Oct 24, 2011 8:55 am

nulik wrote:
Would it be possible to implement an OS without context switching?

DOS?

rdos · Post by **rdos** » Mon Oct 24, 2011 9:01 am

TryHarder wrote:However, IMO the biggest penalty of context-switch is a TLB flush, so saving/restoring even large registers
is not a bottleneck here.

Probably true.

Besides, the hit for saving registers was much larger with the 32-bit register set on a 386 processor than it is for 64-bit bit registers on any modern 64-bit processor. I calculated that using a 1ms time-slice on a 386 processor would cost 10% performance degradation. On a modern processor, the figure is much less than 1%. So, actually, the cost has decreased even if the number of registers and their widths has increased.

OSwhatever · Post by **OSwhatever** » Mon Oct 24, 2011 9:03 am

Even if I think a modern OS without context switching would be unusable as you quickly realize the need for it. There is an interesting question though, how many registers can we throw in the CPU before task switching become unbearable. Several modern CPUs has 128 64-bit registers today and they still are competitive performance wise. I have no idea where task switching starts so suffer. Do we have any papers on this topic?

Rusky · Post by **Rusky** » Mon Oct 24, 2011 9:42 am

Despite the lack of evidence to support the idea that adding more registers is slowing down context switching, the idea of "an OS without context switching" is still interesting. It's probably more of a processor design question, though.

Imagine a way to run many, many more hardware threads than we do currently- it would solve a lot of problems in other areas, but the OS would still need to deal with CPU-time allocation and cross-process security. Hardware capabilities, software isolation, and more aggressive NUMA would all be interesting things to explore.

What kind of features would everyone here prefer to handle massive hardware parallelism?

egos · Post by **egos** » Mon Oct 24, 2011 9:57 am

I think additional register sets are not used all time by all apps in reality. Because of it this problem is not so actual.

nulik · Post by **nulik** » Mon Oct 24, 2011 10:16 am

TryHarder wrote:I don't think that this is a promising approach to deal with the problem. How would you run server applications that supposed to perform some kind of infinite loop?

A server application would be decomposed in handling events (which will be handled by snippets in my case). A connect() is one event, a read() from socket is another, send() another, and so on. Since clients all have different connection speeds, you can have thousands of connections transfering data by a little bit. There is a trend in networking applications right now, to rewrtie them using event based model (with libevent or libev libraries). One of those, is nginx webserer for example, it is about 2 or 3x faster than apache because it uses event based networking.
And to perform an infinite loop, at the end of main() you would issue a syscall to run main() again in XX microseconds. I still do not have a clear idea of how it would work, but it definitely has to avoid context switching.

What about spinlocks?

Each snippet will have a local in and out queue , and the kernel will handle its own in/out queue same way like a normal process, but with privileges. Totally asynchronous.

And what is considered "small" code?

You would define that when you launch the process. So, if you are running a genetic algorithm application, you might use a single process to mutate the whole generation (which is considered to be time consuming job) and set it a maximum execution time of 5 min. If you are running a network driver code, it would be much smaller and run for milliseconds, only to check if there is a packet available on the ring and if so, mark a corresponding variable, so higher level app (a tcp or udp stack) will be running to process it. You would have to explicitly launch processes on different cores so they don't monopolize all the cores. This may be a problem with today's 4 core systems, but in a few years we could have 16 or 32 core system for a price of today's 4 core. AMD has this vision, that's why they redesigned their microchip with Bulldozer.

You'll have to
measure the time that your code will be running, and that might depend on some external factors.

with this approach will have to decompose the problem into independent tasks. This is required if you want to achieve high parallelism. For example, handling of one packet, might happen on core 2, another on core 8, because previous task read the network driver and stored the number of incoming packets, and issued , say, 2 syscalls to the kernel to process 2 incoming packets.

The solution that comes in my mind is to keep the number of "heavy" processes (that makes use of a big set of registers)
moderate per processor. "Light" processes (which uses only general-purpose registers for example) may be switched
efficiently, without touching rest of the registers. If you have only 1 "heavy" process per processor, saving/restoring
a big part of registers is unnecessary.

well, then it is similar to the cooperative multitasking as rdos mentioned. It is just you are doing it with a higher granularity. Actually I will have to group the processes to run small snippets on some cores and others on another cores.

However, IMO the biggest penalty of context-switch is a TLB flush, so saving/restoring even large registers
is not a bottleneck here.

Good point. Another reason to use cooperative multitasking.

Combuster · Post by **Combuster** » Mon Oct 24, 2011 10:51 am

nulik wrote:So, then a question arises:
Would it be possible to implement an OS without context switching?

You wanted an embedded system, right? Is it that difficult to write a single task or do you want some form of approval to call it an OS to show off to your friends some more? Right now I have the idea you're not telling us what you actually intend to do.

DavidCooper · Post by **DavidCooper** » Mon Oct 24, 2011 1:31 pm

I currently use co-operative multitasking, but I plan to use a mixture of both approaches so that if a program gets stuck in loop for too long a switch can be forced. A timer would indicate to a program that it should hand back control of the processor when it next checks a certain variable, but if it fails to do so soon enough control will be taken away from it by force. The costs of such a system is that each program has to check that variable periodically, and the amount of times it does so will be much greater on a fast processor - that could add up to a lot of wasted time and make it more efficient to avoid co-operative multitasking altogether. One thing you certainly need to avoid though is switching between programs/threads/processes too rapidly as you may end up replacing the contents of the cache many times more often than necessary, so that's why you will need a timer element to it (and I have yet to add one to mine). If I decide to get rid of co-operative multitasking altogether, all I'll need to do is eliminate all the co-operative return points in all the apps, but there may be a debugging advantage in reatining co-operative multitasking in my OS while building and testing an app within it, so I might decide to retain the mechanism for it just for that reason. I'll find out when I get there.

turdus · Post by **turdus** » Mon Oct 24, 2011 1:37 pm

nulik wrote:
However, IMO the biggest penalty of context-switch is a TLB flush, so saving/restoring even large registers
is not a bottleneck here.
Good point. Another reason to use cooperative multitasking.

Think it again.

nulik · Post by **nulik** » Mon Oct 24, 2011 6:31 pm

Thank you guys for the inputs.
Actually the query "cooperative multitasking operating system" (and its variants) drove me to a very interesting project at MIT called fos:

a Factored Operating System
http://groups.csail.mit.edu/carbon/?page_id=39

In case you don't want to read the papers, it is a:

new scalable portable operating system targeted at 1000+ core systems

And it's main points are:

current monolithic operating systems are designed for uniprocessor systems, the rise of multicore and cloud computing is drastically changing the tradeoffs in operating system design
each service that OS provides is build like a distribuited internet server.
better scalability than Linux for large multicores; at 32 cores, fos's page allocator performs 4.5x better than Linux, anf fos's network stack performs 2.5x better.
fos doesn't do context switching, instead it focues on distribution of the workload between cores

While message passing is not a new idea, and we don't have yet 1000+ cores in a system the idea of implementing a small "internet" inside a machine is cute, I probably will copy a lot from that for my own OS.

Best regards
Nulik

adsm · Post by **adsm** » Mon Oct 24, 2011 6:56 pm

rdos wrote:Besides, the hit for saving registers was much larger with the 32-bit register set on a 386 processor than it is for 64-bit bit registers on any modern 64-bit processor. I calculated that using a 1ms time-slice on a 386 processor would cost 10% performance degradation. On a modern processor, the figure is much less than 1%. So, actually, the cost has decreased even if the number of registers and their widths has increased.

That's the cost in time, but I read that and wondered what the cost in energy is... An odd few mW mightn't be much for a desktop but It'd matter more on a mobile platform, or, for a desktop/server platform if the installed base were very high.

ACcurrent · Post by **ACcurrent** » Mon Oct 24, 2011 7:37 pm

Anybody bother to check out BareMetal OS? It is context switching free, Each core has one task. This is great if you use your os for only one thing but for someone like me I would need at least a thousand cores to do anything. WAY TOO EXPENSIVE! On the other hand co-operative tasking has some serious glitches. What one could do though is write all apps in a VM byte code language and emulate context switching (though this would have other problems). This method could be used in a cloud computing OS like Chrome OS (why did they use the linux kernel for that? Wrong kernel design).

OSDev.org

An OS without context switching

An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching

Re: An OS without context switching