TryHarder wrote:I don't think that this is a promising approach to deal with the problem. How would you run server applications that supposed to perform some kind of infinite loop?
A server application would be decomposed in handling events (which will be handled by snippets in my case). A connect() is one event, a read() from socket is another, send() another, and so on. Since clients all have different connection speeds, you can have thousands of connections transfering data by a little bit. There is a trend in networking applications right now, to rewrtie them using event based model (with
libevent or
libev libraries). One of those, is nginx webserer for example, it is about 2 or 3x faster than apache because it uses event based networking.
And to perform an infinite loop, at the end of main() you would issue a syscall to run main() again in XX microseconds. I still do not have a clear idea of how it would work, but it definitely has to avoid context switching.
What about spinlocks?
Each snippet will have a local
in and
out queue , and the kernel will handle its own in/out queue same way like a normal process, but with privileges. Totally asynchronous.
And what is considered "small" code?
You would define that when you launch the process. So, if you are running a genetic algorithm application, you might use a single process to mutate the whole generation (which is considered to be time consuming job) and set it a maximum execution time of 5 min. If you are running a network driver code, it would be much smaller and run for milliseconds, only to check if there is a packet available on the ring and if so, mark a corresponding variable, so higher level app (a tcp or udp stack) will be running to process it. You would have to explicitly launch processes on different cores so they don't monopolize all the cores. This may be a problem with today's 4 core systems, but in a few years we could have 16 or 32 core system for a price of today's 4 core. AMD has this vision, that's why they redesigned their microchip with Bulldozer.
You'll have to
measure the time that your code will be running, and that might depend on some external factors.
with this approach will have to decompose the problem into independent tasks. This is required if you want to achieve high parallelism. For example, handling of one packet, might happen on core 2, another on core 8, because previous task read the network driver and stored the number of incoming packets, and issued , say, 2 syscalls to the kernel to process 2 incoming packets.
The solution that comes in my mind is to keep the number of "heavy" processes (that makes use of a big set of registers)
moderate per processor. "Light" processes (which uses only general-purpose registers for example) may be switched
efficiently, without touching rest of the registers. If you have only 1 "heavy" process per processor, saving/restoring
a big part of registers is unnecessary.
well, then it is similar to the cooperative multitasking as
rdos mentioned. It is just you are doing it with a higher granularity. Actually I will have to group the processes to run small snippets on some cores and others on another cores.
However, IMO the biggest penalty of context-switch is a TLB flush, so saving/restoring even large registers
is not a bottleneck here.
Good point. Another reason to use cooperative multitasking.