Do Threads Remain the Best Concurrency Abstraction?
Do Threads Remain the Best Concurrency Abstraction?
In my humble opinion, no.
In the beginning, we used processes -- address spaces, protection domains and units of concurrent execution packaged into one. Then we separated the concurrent execution from the address space and protection domain for performance reasons, yielding threads.
But processes and threads come with their costs: elaborate scheduling systems. No two operating systems have yet to employ exactly the same scheduler, and researchers invent newer, "better" scheduling systems all the time. They have to invent new ones, because no scheduler schedules perfectly for fairness, throughput and interactivity all at once. It doesn't help that we often lay upon our schedulers the responsibility of knowing the status of I/O operations and performing various kinds of optimized inter-process communication (such as synchronous message-passing with time-slice donation).
Finally, processes/threads lack elegance. Because such systems either have or simulate parallelism, operations on processes/threads must fall into one of two groups: those valid for suspended threads and those valid for running threads. However, the operating system must often try to present the set of all threads as homogeneous, with a single fixed set of operations valid at all times.
We have used kernel-level processes and threads for decades because it seemed the most obvious way to run a time-sharing system and even worked for truly parallel systems, and damn programmatic or theoretical elegance.
However, higher-level programming languages that continue to grow in popularity for their expressiveness suggest another possible solution. Time-sharing, after all, works by interrupting the running program and saving its run-time state (such as program counter and registers) into a data structure (often the kernel-level stack). Theoretically speaking, that saved state constitutes a one-shot continuation of the interrupted program. Indeed, switching threads in a standard system involves saving the continuation of the old thread into that thread's data structure, and then loading and throwing to a new continuation from the new thread.
And if a suspended thread becomes a saved continuation, what does one call the running thread? I propose that we create entities called virtual processors that abstract over a logical processor (ie: whatever the CPU says is a processor) and the data structures describing the currently-running program. Short routines can convert between saved continuations and the contents of a logical processor by saving and loading machine state. Also, note that the kernel can create any number of virtual processors to represent any number of logical machine processors while adding only optional complexity to the kernel-level scheduler (see below, sorry).
We can also simply add some kind of value passed through on a continuation invocation, possibly the calling continuation, ala Scheme. After that, we just need an operation to construct continuations of arbitrary states from scratch and copy continuations to simulate multi-shot continuations.
Scheme programmers have demonstrated time and time again, after all, that one can use continuations to build a threading system in about a page of code in which each thread acts as a coroutine of the scheduler. Operating-system developers haven't used this idea because we haven't yet found a way to implement a preemptive concurrency system using continuations.
To implement concurrency, we add a type of objects called events. These contain any number of saved continuations for handlers, and support an operation to "fire" or "trigger" the event. When something fires an event, it dequeues all of its handlers and hands them to a scheduler that only needs to decide what kind of event-handler continuations can preempt what kind of running virtual processors. When the scheduler decide to preempt, it simply invokes ThrowCC(PREEMPTING CONTINUATION) and possibly checks the return value for errors or something. Giving the handler the preempted continuation allows it to function as a scheduler.
Note that events can serve as a convenient abstraction over machine interrupts. Making them such give us a very convenient translation from machine-level preemption to preemption in our concurrency model.
However, adding a continuation as an event handler with the currently given operations will run far too slowly for real use. Instead, we introduce an operation HandleEvent that adds the current continuation to the given event's set of handlers and then invokes a given continuation with a given value. This primitive can run much more efficiently.
So we end up with these primitive operations:
Throw(continuation,value) --> Throw never returns and it destroys the calling continuation.
ThrowCC(continuation) --> ThrowCC saves the current virtual-processor contents as a continuation and passes that continuation as the second parameter of Throw. If the passed continuation is ever invoked, ThrowCC() returns the value passed to Throw() for that continuation.
Construct(machine state) --> A saved continuation of the given machine state.
Copy(continuation) --> Copy returns a saved continuation with machine state identical to that of the given continuation, because the kernel should only copy continuations when necessary, rather than on every Throw.
Fire(event) --> Fire returns if the current computation at its call time ever runs again, but does not return a value. Fire triggers the given event, which invokes the kernel scheduler for the handlers and current computation.
HandleEvent(event,continuation,value) --> HandleEvent returns the value passed to the handler when it runs (read: the preempted continuation).
These objects and operations, implemented on a single-core system, will allow the creation of nearly-arbitrary concurrency systems in user-space, removing unneeded complexity from the kernel.
Finally, adding a single additional primitive operation will serve to bring this concurrency system into the truly parallel world of multi-core and multi-processor machines. Since a continuation cannot terminate itself without throwing to another continuation, a virtual processor can even run itself forever once started. One only needs a primitive to start an inactive virtual processor with a given continuation.
]i]StartProcessor(virtual processor,continuation,value) --> StartProcessor returns a success/error code indicating whether the given virtual processor was inactive or not. When the virtual processor was inactive, StartProcessor performs the internal operations of the kernel and machine to make the logical processor corresponding to the inactive virtual processor start computing with the given continuation, passing it the given value.[/i]
I hope this sorta-kinda essay on threads, continuations, and concurrency has provided everyone with some new, interesting seeds of thought. Flames about my opinion on threads, however, can go to Gehenna.
In the beginning, we used processes -- address spaces, protection domains and units of concurrent execution packaged into one. Then we separated the concurrent execution from the address space and protection domain for performance reasons, yielding threads.
But processes and threads come with their costs: elaborate scheduling systems. No two operating systems have yet to employ exactly the same scheduler, and researchers invent newer, "better" scheduling systems all the time. They have to invent new ones, because no scheduler schedules perfectly for fairness, throughput and interactivity all at once. It doesn't help that we often lay upon our schedulers the responsibility of knowing the status of I/O operations and performing various kinds of optimized inter-process communication (such as synchronous message-passing with time-slice donation).
Finally, processes/threads lack elegance. Because such systems either have or simulate parallelism, operations on processes/threads must fall into one of two groups: those valid for suspended threads and those valid for running threads. However, the operating system must often try to present the set of all threads as homogeneous, with a single fixed set of operations valid at all times.
We have used kernel-level processes and threads for decades because it seemed the most obvious way to run a time-sharing system and even worked for truly parallel systems, and damn programmatic or theoretical elegance.
However, higher-level programming languages that continue to grow in popularity for their expressiveness suggest another possible solution. Time-sharing, after all, works by interrupting the running program and saving its run-time state (such as program counter and registers) into a data structure (often the kernel-level stack). Theoretically speaking, that saved state constitutes a one-shot continuation of the interrupted program. Indeed, switching threads in a standard system involves saving the continuation of the old thread into that thread's data structure, and then loading and throwing to a new continuation from the new thread.
And if a suspended thread becomes a saved continuation, what does one call the running thread? I propose that we create entities called virtual processors that abstract over a logical processor (ie: whatever the CPU says is a processor) and the data structures describing the currently-running program. Short routines can convert between saved continuations and the contents of a logical processor by saving and loading machine state. Also, note that the kernel can create any number of virtual processors to represent any number of logical machine processors while adding only optional complexity to the kernel-level scheduler (see below, sorry).
We can also simply add some kind of value passed through on a continuation invocation, possibly the calling continuation, ala Scheme. After that, we just need an operation to construct continuations of arbitrary states from scratch and copy continuations to simulate multi-shot continuations.
Scheme programmers have demonstrated time and time again, after all, that one can use continuations to build a threading system in about a page of code in which each thread acts as a coroutine of the scheduler. Operating-system developers haven't used this idea because we haven't yet found a way to implement a preemptive concurrency system using continuations.
To implement concurrency, we add a type of objects called events. These contain any number of saved continuations for handlers, and support an operation to "fire" or "trigger" the event. When something fires an event, it dequeues all of its handlers and hands them to a scheduler that only needs to decide what kind of event-handler continuations can preempt what kind of running virtual processors. When the scheduler decide to preempt, it simply invokes ThrowCC(PREEMPTING CONTINUATION) and possibly checks the return value for errors or something. Giving the handler the preempted continuation allows it to function as a scheduler.
Note that events can serve as a convenient abstraction over machine interrupts. Making them such give us a very convenient translation from machine-level preemption to preemption in our concurrency model.
However, adding a continuation as an event handler with the currently given operations will run far too slowly for real use. Instead, we introduce an operation HandleEvent that adds the current continuation to the given event's set of handlers and then invokes a given continuation with a given value. This primitive can run much more efficiently.
So we end up with these primitive operations:
Throw(continuation,value) --> Throw never returns and it destroys the calling continuation.
ThrowCC(continuation) --> ThrowCC saves the current virtual-processor contents as a continuation and passes that continuation as the second parameter of Throw. If the passed continuation is ever invoked, ThrowCC() returns the value passed to Throw() for that continuation.
Construct(machine state) --> A saved continuation of the given machine state.
Copy(continuation) --> Copy returns a saved continuation with machine state identical to that of the given continuation, because the kernel should only copy continuations when necessary, rather than on every Throw.
Fire(event) --> Fire returns if the current computation at its call time ever runs again, but does not return a value. Fire triggers the given event, which invokes the kernel scheduler for the handlers and current computation.
HandleEvent(event,continuation,value) --> HandleEvent returns the value passed to the handler when it runs (read: the preempted continuation).
These objects and operations, implemented on a single-core system, will allow the creation of nearly-arbitrary concurrency systems in user-space, removing unneeded complexity from the kernel.
Finally, adding a single additional primitive operation will serve to bring this concurrency system into the truly parallel world of multi-core and multi-processor machines. Since a continuation cannot terminate itself without throwing to another continuation, a virtual processor can even run itself forever once started. One only needs a primitive to start an inactive virtual processor with a given continuation.
]i]StartProcessor(virtual processor,continuation,value) --> StartProcessor returns a success/error code indicating whether the given virtual processor was inactive or not. When the virtual processor was inactive, StartProcessor performs the internal operations of the kernel and machine to make the logical processor corresponding to the inactive virtual processor start computing with the given continuation, passing it the given value.[/i]
I hope this sorta-kinda essay on threads, continuations, and concurrency has provided everyone with some new, interesting seeds of thought. Flames about my opinion on threads, however, can go to Gehenna.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Re: Do Threads Remain the Best Concurrency Abstraction?
I sort of agree with what you say, but not what you mean.Crazed123 wrote:In my humble opinion, no.
How does your idea solve this problem? It seems to me you're just pushing the scheduling logic somewhere else (e.g. -- user space).But processes and threads come with their costs: elaborate scheduling systems. No two operating systems have yet to employ exactly the same scheduler, and researchers invent newer, "better" scheduling systems all the time. They have to invent new ones, because no scheduler schedules perfectly for fairness, throughput and interactivity all at once.
This is never a good reason to do anything, because elegance is subjective.Finally, processes/threads lack elegance.
There have been a lot of problems identified with threads (the intractability of lock-based programming being the #1 issue that leaps to mind), but I've never seen this presented as being a major problem in need of a solution before. Can you give some examples?Because such systems either have or simulate parallelism, operations on processes/threads must fall into one of two groups: those valid for suspended threads and those valid for running threads. However, the operating system must often try to present the set of all threads as homogeneous, with a single fixed set of operations valid at all times.
I think the real reason is that CPU instruction sets are basically procedural (they could be data-flow oriented instead, but aren't), and mainstream programming languages have always been procedural. Threads are a just a way to capture procedural state and time- or place-shift it.We have used kernel-level processes and threads for decades because it seemed the most obvious way to run a time-sharing system and even worked for truly parallel systems, and damn programmatic or theoretical elegance.
The reason I said that I agree with what you say (but not what you mean), is that I actually think threads are here to stay (because they are firmly entrenched at the hardware level), but they don't make particularly nice high-level abstractions for programmers. Threads "in the raw" lead inexorably to locks and all their problems, trying to avoid the cost of spawning too many threads by pooling, etc. These are all things that IMO the system should take care of for the programmer.
Here are some abstractions that I think will reduce the need for programmers to think about threads:
- Asynchronous message passing
- Futures
- Transactional memory
- Automatic data parallelism
I honestly don't see how switching from a procedural concurrency abstraction (threads) to a functional concurrency abstraction (continuations) is going to solve any big problems. In fact, I think it creates some:However, higher-level programming languages that continue to grow in popularity for their expressiveness suggest another possible solution.
- Continuations are really hard for mere mortal programmers to understand.
- Continuations require some pretty sophisticated run-time support (does this mean that all code running on your OS would need to set up call frames in a garbage collected heap instead of on a stack?)
- Hardware still deals in threads -- something somewhere has to map between the two.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
- carbonBased
- Member
- Posts: 382
- Joined: Sat Nov 20, 2004 12:00 am
- Location: Wellesley, Ontario, Canada
- Contact:
Re: Do Threads Remain the Best Concurrency Abstraction?
Mr. Kernel It must be past 1am now, so I'm hoping you might elaborate on the above.Colonel Kernel wrote: Here are some abstractions that I think will reduce the need for programmers to think about threads:I'm sure there are more, but it's 1:00 AM.
- Asynchronous message passing
- Futures
- Transactional memory
- Automatic data parallelism
Async message passing; I completely agree, and spent some time developing optimal APIs for this into my OS. I'm vaguely aware of transactional memory, and am curious to know if anyone knows of any software implementations of this (or if its even practical without hardware support)?
I'm not so familiar with the other concepts you mention, though. What is meant by "futures" and "automatic data parallelism"?
Thanks,
Jeff
Re: Do Threads Remain the Best Concurrency Abstraction?
Hi,
Automatic parallelization is where the compiler figures everything out on it's own, and programmers just write code as if it's single-threaded.
Cheers,
Brendan
There's a list of STM implementations on Wikipedia. Most research papers include benchmarks showing that STM is faster than something else (traditional locking techniques and/or a different implementation of STM) for some algorithms and some level of concurrency. I'm not too sure how these benchmarks translate to full processes under real world conditions, but it is practical without hardware support.carbonBased wrote:Async message passing; I completely agree, and spent some time developing optimal APIs for this into my OS. I'm vaguely aware of transactional memory, and am curious to know if anyone knows of any software implementations of this (or if its even practical without hardware support)?
I'm not too sure about futures, but the idea seems to be to delay processing that isn't needed immediately. For example, if you do "X = Y * A + B ^ C", then the language might use another thread to calculate the result, and set X to "not known yet" until the result is calculated. If (later on) you need the value of X you'd wait until it's available (if it's not already).carbonBased wrote:I'm not so familiar with the other concepts you mention, though. What is meant by "futures" and "automatic data parallelism"?
Automatic parallelization is where the compiler figures everything out on it's own, and programmers just write code as if it's single-threaded.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Re: Do Threads Remain the Best Concurrency Abstraction?
It's always nice when someone steps in and does your job for you.
Just to add a bit more, this kind of automatic parallelization is much easier on algorithms that apply the same operations over and over to multiple sets of data. The typical example of such an optimization is loop unrolling -- if each iteration of a for loop doesn't have any data dependencies on previous iterations, then the compiler can split it into several parallel loops. That's why I called it automatic data parallelism. This technique works well for a certain class of problems, but is extremely difficult (if not impossible) in the general case.Brendan wrote:Automatic parallelization is where the compiler figures everything out on it's own, and programmers just write code as if it's single-threaded.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
Harder than raw threads?Continuations are really hard for mere mortal programmers to understand.
To your parenthetic question: Dear Lord, no! Doesn't the operating system already keep track of which pages contain a thread's stack segment? By tracking the stack-segment pages and stack pointer, a kernel can sufficiently implement continuations.Continuations require some pretty sophisticated run-time support (does this mean that all code running on your OS would need to set up call frames in a garbage collected heap instead of on a stack?)
I don't believe it does. What goes onto the stack when an interrupt fires?Hardware still deals in threads -- something somewhere has to map between the two.
* The program counter of the running program.
* The general registers of the current program.
* The protection registers of the current program.
* The original stack pointer (at time of interruption) of the current program.
* Any other random registers the processor happens to have (esi, edi, ebp).
You look at these and call them the state of the current thread. I look at these and call them the reified current continuation. Both interpretations work equally well depending on how you want to use that processor state.
Threads are persistent entities that, when in the "suspended" state, hold a saved continuation. When in the "running" state, threads generally only hold their scheduling information.
For example, one can only suspend a running thread, and one can only run a suspended thread. But both kinds of threads are still threads, so the operations are partial over the domain of threads. This happens because running and suspending really operate on the continuation (or lack thereof) stored in the thread rather than the thread itself.
Mu. Your question depends on the (IMNSHO) incorrect assumption that continuations have more complexity than threads.So besides scratching some personal itch, what justifies the complexity of using continuations instead of threads?
I contend that the processor, continuation, and event abstractions map more closely to the hardware of cores, interrupts, and hardware IRQs than OS-multiplexed threads.
In fact, continuations can also make IPC simpler. Message passing becomes easy when one sets up a receive using ThrowCC() and sends a message using either form of Throw, because...
* It's synchronous, so if your call to ThrowCC() ever returns you know that your message was passed, and you know that the target continuation receives your message as soon as you send it.
* Because any valid saved continuation is, by definition, waiting to receive a message so it can return from ThrowCC, neither messages nor processes queue to perform IPC.
Edit: First IPC benefit of continuations was a consequence of the third.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Yes! The hard thing to understand about threads are the consequences of lock-based programming. The mechanism itself is not that complicated. It's a perfect example of "simplexity". I have yet to see anything about continuations that convinces me that they don't suffer from the same problems if they're used in a programming environment with shared mutable state (the default for threads, but not for continuations).Crazed123 wrote:Harder than raw threads?Continuations are really hard for mere mortal programmers to understand.
I'm beginning to suspect that your idea of "continuation" and mine are different...
So what if I wanted to take the current continuation at some arbitrary point in my code, then execute it repeatedly later, without incurring the overhead of switching machine state each time? AFAIK that's how continuations usually work. They're like closures that you can create whenever you want and pass around.To your parenthetic question: Dear Lord, no! Doesn't the operating system already keep track of which pages contain a thread's stack segment? By tracking the stack-segment pages and stack pointer, a kernel can sufficiently implement continuations.Continuations require some pretty sophisticated run-time support (does this mean that all code running on your OS would need to set up call frames in a garbage collected heap instead of on a stack?)
IMO you're just using the word "continuation" to describe thread state.
What about terms like hyper-threading? Thread-level parallelism? Simultaneous chip multi-threading? The hardware guys are all over threads like white on rice. It's too late to turn back that terminological clock.I don't believe it does. What goes onto the stack when an interrupt fires?Hardware still deals in threads -- something somewhere has to map between the two.
* The program counter of the running program.
* The general registers of the current program.
* The protection registers of the current program.
* The original stack pointer (at time of interruption) of the current program.
* Any other random registers the processor happens to have (esi, edi, ebp).
You look at these and call them the state of the current thread. I look at these and call them the reified current continuation. Both interpretations work equally well depending on how you want to use that processor state.
Do you have something against abstractions that involve state transitions? I don't see what big problem this is causing.Threads are persistent entities that, when in the "suspended" state, hold a saved continuation. When in the "running" state, threads generally only hold their scheduling information.
For example, one can only suspend a running thread, and one can only run a suspended thread. But both kinds of threads are still threads, so the operations are partial over the domain of threads. This happens because running and suspending really operate on the continuation (or lack thereof) stored in the thread rather than the thread itself.
Ok, I'll buy that. I still don't see why the current kernel-level abstractions we have are so terrible. Like I said in my post, I think the big problems are in how threads are used by user-space programs, not so much in how they're implemented.I contend that the processor, continuation, and event abstractions map more closely to the hardware of cores, interrupts, and hardware IRQs than OS-multiplexed threads.
What you've described sounds to me like a typical implementation of synchronous rendezvous-style IPC in a kernel like QNX or L4. ThrowCC() sounds a lot like taking a snapshot of the current thread's state and message, then queuing it up for later delivery and resumption. I think you're just choosing different words for things that already exist in other kernels.In fact, continuations can also make IPC simpler. Message passing becomes easy when one sets up a receive using ThrowCC() and sends a message using either form of Throw, because...
* It's synchronous, so if your call to ThrowCC() ever returns you know that your message was passed, and you know that the target continuation receives your message as soon as you send it.
* Because any valid saved continuation is, by definition, waiting to receive a message so it can return from ThrowCC, neither messages nor processes queue to perform IPC.
Edit: First IPC benefit of continuations was a consequence of the third.
Whether synchronous IPC is a good thing is a whole other discussion (maybe Brendan can put his $0.02 in again). BTW, it is possible to have reliable asynchronous IPC with compile-time bounded queue sizes.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
To beat a dead horse: http://www.osdev.org/wiki/Continuation_Systems
Each kernel enforces its own design decisions about scheduling, concurrency and parallelism on its user-land, and if I have technology to make that unnecessary and restore choice to user applications, I think that justifies making the change.
And I keep trying to explain that they are in fact the same exact thing. They Are One. Calling it different names only changes how you're using it.IMO you're just using the word "continuation" to describe thread state.
The hardware guys are simply coming up with ways to make one X of hardware look like multiple Xim to software, ie: one processor with multiple cores presents itself as multiple cores. No matter what term anyone uses, the software just sees additional logical processors to use. The threading model will schedule threads onto these logical processors, and my model will abstract over them as additional virtual processors and start a continuation on them.What about terms like hyper-threading? Thread-level parallelism? Simultaneous chip multi-threading? The hardware guys are all over threads like white on rice. It's too late to turn back that terminological clock.
When the state transition means that certain operations only work when the object has a certain state, yes.Do you have something against abstractions that involve state transitions? I don't see what big problem this is causing.
Because current kernel designs make those abstractions mandatory. More specifically, Kernel X makes kernel-level threads/kernel-level processes preempted for the illusion of concurrency by Timer A (or possibly multiplexed across multiple logical processors) and scheduled by Algorithm B with time-slices of general range C-D milliseconds with user-level parameters E, F, and G mandatory.Ok, I'll buy that. I still don't see why the current kernel-level abstractions we have are so terrible. Like I said in my post, I think the big problems are in how threads are used by user-space programs, not so much in how they're implemented.
Each kernel enforces its own design decisions about scheduling, concurrency and parallelism on its user-land, and if I have technology to make that unnecessary and restore choice to user applications, I think that justifies making the change.
I think you misunderstood. ThrowCC() does not involve any queuing of anything by anyone, ever. ThrowCC() can only refer to (read: receive as an argument) saved continuations, what you would demand be called suspended threads. Any continuation to which a user program can generate a valid reference is already suspended and waiting for a message. Throw() and ThrowCC() just hand the suspended continuation a message and run it. ThrowCC() suspends the current continuation and passes it as part of the message. Nothing ever queues.What you've described sounds to me like a typical implementation of synchronous rendezvous-style IPC in a kernel like QNX or L4. ThrowCC() sounds a lot like taking a snapshot of the current thread's state and message, then queuing it up for later delivery and resumption. I think you're just choosing different words for things that already exist in other kernels.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Ok, we violently agree. I re-read your original post and I think I see the difference in your emphasis now.Crazed123 wrote:And I keep trying to explain that they are in fact the same exact thing. They Are One. Calling it different names only changes how you're using it.
But this happens all the time. Try reading from a closed file, or executing a database query on a closed connection. Statefulness is here to stay because in limited cases, it's easier to understand. I can sympathize with your objection, because I've seen how large-scale statefulness leads to simplexity. However, I think it's a bit dogmatic and impractical to think of a two-state abstraction as being overly complex.Crazed123 wrote:When the state transition means that certain operations only work when the object has a certain state, yes.Do you have something against abstractions that involve state transitions? I don't see what big problem this is causing.
Aha! The magic word you haven't been uttering is: Exokernel. I have my own thoughts on the practicality and need for pushing everything into user-space, but they're mainly product-focused economic arguments. If you're into exokernel research, more power to you. I just won't go there.Because current kernel designs make those abstractions mandatory. More specifically, Kernel X makes kernel-level threads/kernel-level processes preempted for the illusion of concurrency by Timer A (or possibly multiplexed across multiple logical processors) and scheduled by Algorithm B with time-slices of general range C-D milliseconds with user-level parameters E, F, and G mandatory.
Each kernel enforces its own design decisions about scheduling, concurrency and parallelism on its user-land, and if I have technology to make that unnecessary and restore choice to user applications, I think that justifies making the change.
Ah, I get it now. In that case, how does a... program fragment? (I can't call them threads) suspend itself until a message arrives? Does it have to ThrowCC() to a scheduler? Do you have any examples of what this would look like in code (or even what language your user-land would use to express this)?I think you misunderstood. ThrowCC() does not involve any queuing of anything by anyone, ever. ThrowCC() can only refer to (read: receive as an argument) saved continuations, what you would demand be called suspended threads. Any continuation to which a user program can generate a valid reference is already suspended and waiting for a message. Throw() and ThrowCC() just hand the suspended continuation a message and run it. ThrowCC() suspends the current continuation and passes it as part of the message. Nothing ever queues.
I have another question. I noticed on my second reading that you've been making a distinction between "one-shot" and "multi-shot" continuations. I asked earlier how you would capture a continuation and invoke it repeatedly. Is this what you mean by a "multi-shot" continuation? If so, do I understand correctly that you would use copying to implement it? If so, how would you deal with a multi-shot continuation where each invocation modifies some shared mutable state?
For that matter, you haven't addressed the issue of shared mutable state much at all... how do you deal with it in a continuation-based system?
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
- AndrewAPrice
- Member
- Posts: 2299
- Joined: Mon Jun 05, 2006 11:00 pm
- Location: USA (and Australia)
Yes and no.
It is a better option to use the processor threads over software threads. Then, you can take advantage of things such as simultaneous thread execution just like Windows and Linux do.
However, I have run into a problem with hardware threads. Say the Sync() function synchronized the two threads (waiting for the other one to catch up). If thread one contained the code:
and thread two contained the code:
After the above code executes, 0xABCDEF will have 15. I will explain why. We find out that 0xABCDEF cannot contain both values. So we'll have to examine this set (*acoolpainter = 15) closely.
The processor would read and tunnel the instructions (to move 10/15 into a register then into [0xABCDEF]) at the same time. Of course, it can't contain more than two values so we'll have to split this instruction up.
We know the computer are made up of bits, and to move a value takes 8 cycles (4 to move into register, 4 to move into memory). Thus we observe the following:
- cycle one: send the instruction to the processor
- cycle two: seek the memory location of values
- cycle three: read the memory location
- cycle four: send to register
- cycle five: send the next instruction to the processor
- cycle six: read the register
- cycle seven: seek the new memory location
- cycle eight: write value to the new memory location
But wait!! We are trying to write two values simultaneous. To set the memory, the processor goes through on a byte by byte basis. It first clears the byte, the sets it. We see that we do a binary OR operation.
An int takes up four bytes (or 32 bits). This is the new value of 0xABCDEF:
clear byte 1: 0000 0000
after thread 1: 0000 0000
after thread 2: 0000 0000
clear byte 2: 0000 0000
after thread 1: 0000 0000
after thread 2: 0000 0000
clear byte 3: 0000 0000
after thread 1: 0000 0000
after thread 2: 0000 0000
clear byte 4: 0000 0000
after thread 1: 0000 1010
after thread 2: 0000 1111 (or with 1111 (15) with 1010 (10)
Therefore, after the memory has been written, 0xABCDEF will contain an int value of 15.
This can cause complications, say, if a network program contains states 0 (0 in binary), 1 (1 in binary), and 2 (10 in binary). A simple binary OR on 01 and 10 will result in 11 (3 in decimal). However there is no program state 3.
There are two solutions to this problem;
- Use a stack, so state changes are only parsed one at a time.
- Place one program into a loop, until the other program has finished changing the state.
Solution A can cause difficulties within itself, since what if two threads try to simultaneously allocate memory, create a structure, and add it to the linked list. This will result in possibly a program overflow! Not to mention the *next pointer in the linked list will become corrupt. The only solution to this problem is the second solution.
The second solution is to call the scheduler, and telling placing the program in a infinite loop until the other process has finished. The dilemma over how to make sure which program should wait for the other or not, is removed by the thread manager keeping a list of threads and their state. When a thread calls the loop function, a well designed thread manager should know what thread called the function, and to set it's state to paused. The error of two threads writing to the same memory location is eliminated since each thread's state should be stored in a different location. The thread itself will be placed in a loop that will constantly loop while the thread's state is set to paused. The thread manager will have another loop that will go through each thread, and unpause them one by one. This will guarantee that two threads will unsynchronise and not try to write to the same memory location (the process of comparing a state, setting the state, comparing another state, setting the state, will cause one thread to become a few cycles behind the other thread, enough to eliminate the error).
Most processors, however, will generate an overwrite interrupt, giving the operating system a sufficient warning to place the offending thread into the unsync'ing loops detailed above.
This seems like a lot of work, but it is EXTREMELY unlikely two threads will try to write to the same memory location at once, so your operating system will rarely have to do anything. However, if you are using hardware threads, then there is a chance that it could happen, and you should always have a handler for these events. I find that this is a feature often overlooked.
It is a better option to use the processor threads over software threads. Then, you can take advantage of things such as simultaneous thread execution just like Windows and Linux do.
However, I have run into a problem with hardware threads. Say the Sync() function synchronized the two threads (waiting for the other one to catch up). If thread one contained the code:
Code: Select all
int *acoolpointer = 0xABCDEF
Sync();
*acoolpointer = 10;
Code: Select all
int *acoolpointer = 0xABCDEF
Sync();
*acoolpointer = 15;
The processor would read and tunnel the instructions (to move 10/15 into a register then into [0xABCDEF]) at the same time. Of course, it can't contain more than two values so we'll have to split this instruction up.
We know the computer are made up of bits, and to move a value takes 8 cycles (4 to move into register, 4 to move into memory). Thus we observe the following:
- cycle one: send the instruction to the processor
- cycle two: seek the memory location of values
- cycle three: read the memory location
- cycle four: send to register
- cycle five: send the next instruction to the processor
- cycle six: read the register
- cycle seven: seek the new memory location
- cycle eight: write value to the new memory location
But wait!! We are trying to write two values simultaneous. To set the memory, the processor goes through on a byte by byte basis. It first clears the byte, the sets it. We see that we do a binary OR operation.
An int takes up four bytes (or 32 bits). This is the new value of 0xABCDEF:
clear byte 1: 0000 0000
after thread 1: 0000 0000
after thread 2: 0000 0000
clear byte 2: 0000 0000
after thread 1: 0000 0000
after thread 2: 0000 0000
clear byte 3: 0000 0000
after thread 1: 0000 0000
after thread 2: 0000 0000
clear byte 4: 0000 0000
after thread 1: 0000 1010
after thread 2: 0000 1111 (or with 1111 (15) with 1010 (10)
Therefore, after the memory has been written, 0xABCDEF will contain an int value of 15.
This can cause complications, say, if a network program contains states 0 (0 in binary), 1 (1 in binary), and 2 (10 in binary). A simple binary OR on 01 and 10 will result in 11 (3 in decimal). However there is no program state 3.
There are two solutions to this problem;
- Use a stack, so state changes are only parsed one at a time.
- Place one program into a loop, until the other program has finished changing the state.
Solution A can cause difficulties within itself, since what if two threads try to simultaneously allocate memory, create a structure, and add it to the linked list. This will result in possibly a program overflow! Not to mention the *next pointer in the linked list will become corrupt. The only solution to this problem is the second solution.
The second solution is to call the scheduler, and telling placing the program in a infinite loop until the other process has finished. The dilemma over how to make sure which program should wait for the other or not, is removed by the thread manager keeping a list of threads and their state. When a thread calls the loop function, a well designed thread manager should know what thread called the function, and to set it's state to paused. The error of two threads writing to the same memory location is eliminated since each thread's state should be stored in a different location. The thread itself will be placed in a loop that will constantly loop while the thread's state is set to paused. The thread manager will have another loop that will go through each thread, and unpause them one by one. This will guarantee that two threads will unsynchronise and not try to write to the same memory location (the process of comparing a state, setting the state, comparing another state, setting the state, will cause one thread to become a few cycles behind the other thread, enough to eliminate the error).
Most processors, however, will generate an overwrite interrupt, giving the operating system a sufficient warning to place the offending thread into the unsync'ing loops detailed above.
This seems like a lot of work, but it is EXTREMELY unlikely two threads will try to write to the same memory location at once, so your operating system will rarely have to do anything. However, if you are using hardware threads, then there is a chance that it could happen, and you should always have a handler for these events. I find that this is a feature often overlooked.
My OS is Perception.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
By definition, its indeterminate. It can either have 10 or 15. it just depends on which thread completes last. Sync as provided only guarantees that it will not return until the other thread has reached the Sync statement.MessiahAndrw wrote:After the above code executes, 0xABCDEF will have 15. I will explain why. We find out that 0xABCDEF cannot contain both values. So we'll have to examine this set (*acoolpainter = 15) closely.
Next, there is read-write atomicity; 10 is written as a whole, then 15 is written as whole or vice-versa. There are no interleaved memory accesses possible.
Then, 8 cycles for a write does not sound like a desktop PC, and for most other processors the same holds. What kind of hardware are you referring to? Probably not the hyperthreading P4.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
There are a lot of misconceptions in your post. I will try to (briefly) address some of the big ones.MessiahAndrw wrote:Yes and no.
When a "software" thread runs on a processor, it is already a "processor thread" as well. It's not really a choice between the two. Taking advantage of things like hyperthreading and multi-core usually means that the scheduler has to be aware of when logical processors share certain physical resources such as the L2 cache, otherwise it will make sub-optimal scheduling decisions.It is a better option to use the processor threads over software threads. Then, you can take advantage of things such as simultaneous thread execution just like Windows and Linux do.
Nope. This is not even close to the way that modern CPUs work. To move a value to memory as in your example, the processor needs to read the target address out of a register, load the value to store from the immediate operand field of the instruction into another register, and send the address-value pair down to the store buffer to be scheduled for writing back to the L1 cache. Eventually it will get written to the L2 cache when it gets kicked out of the L1 cache, or when another processor tries to read from the same address. The same thing goes for it being written back to L3 (if it exists) and to main memory.We know the computer are made up of bits, and to move a value takes 8 cycles (4 to move into register, 4 to move into memory).
Writes from the processor's store buffer back to the L1 cache are on a machine word granularity, not a byte granularity. In your example that means that 10 and 15 will be sent to the store buffer all at once, not one byte at a time. Also, the write is just a write. It is not a read-OR-write operation, which would be extremely slow (reading/writing memory is orders of magnitude slower than reading/writing from the L1 cache).But wait!! We are trying to write two values simultaneous. To set the memory, the processor goes through on a byte by byte basis. It first clears the byte, the sets it. We see that we do a binary OR operation.
What this means is that whether 10 or 15 becomes the final value at 0xABCDEF is non-deterministic.
Find yourself a good book on CPU architecture or read some of the articles by Jon Stokes at Ars Technica.
I don't know what you mean by the first solution. The second solution sounds like spinlocks, which are a tried-and-true way to implement mutual exclusion in the kernel. Your example is a bit weird though, because it isn't clear what the correct result should be -- 10, or 15? Depending on what you're trying to do, you might be able to accomplish it with an atomic exchange (or compare-and-swap) rather than spinlocks. This sort of thing is done with the LOCK prefix and usually the CMPXCHG instruction on x86 machines.There are two solutions to this problem;
- Use a stack, so state changes are only parsed one at a time.
- Place one program into a loop, until the other program has finished changing the state.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
I believe he is describing Transaction based operation in his first point.Colonel Kernel wrote:I don't know what you mean by the first solution.There are two solutions to this problem;
- Use a stack, so state changes are only parsed one at a time.
- Place one program into a loop, until the other program has finished changing the state.
Most of these (ideas, concepts, erm blabblings?) are simply redescribing the problems with have with concurrent running threads and mostly things that will not change even when that abstraction does.
It's not really overly complex; it's just that there's now something simpler.Colonel Kernel wrote: But this happens all the time. Try reading from a closed file, or executing a database query on a closed connection. Statefulness is here to stay because in limited cases, it's easier to understand. I can sympathize with your objection, because I've seen how large-scale statefulness leads to simplexity. However, I think it's a bit dogmatic and impractical to think of a two-state abstraction as being overly complex.
It doesn't really need to go into an exokernel. Current Linux kernels, for example, are rather un-modular with regards to scheduling algorithms: one must manually patch the kernel to change the scheduler. Wouldn't it be nice if you could choose from several different scheduling algorithms and concurrency styles (real-time vs. preemptive round-robin vs. cooperative) when you compiled your kernel? Building continuation-based concurrency in at the lowest level (possibly even implementing as much as possible in macros) would support that in perfectly normal macro-kernels.Colonel Kernel wrote: Aha! The magic word you haven't been uttering is: Exokernel. I have my own thoughts on the practicality and need for pushing everything into user-space, but they're mainly product-focused economic arguments. If you're into exokernel research, more power to you. I just won't go there.
ThrowCC() is the primitive for a running program (Scheme and denotational semantics just call these the Current Continuation, but I digress. It's a CC.) to suspend itself waiting for a message, handing control of the processor to a suspended continuation. In practice, it could look like this...Colonel Kernel wrote:Ah, I get it now. In that case, how does a... program fragment? (I can't call them threads) suspend itself until a message arrives? Does it have to ThrowCC() to a scheduler? Do you have any examples of what this would look like in code (or even what language your user-land would use to express this)?
Code: Select all
throw_cc(scheduler,&message_to_scheduler,sizeof(message_to_scheduler),&received_message,sizeof(received_message));
Since ThrowCC() was invoked instead of Throw(), the kernel transforms the running CC into a suspended continuation and prepends the identifier to that continuation to the intended message, creating the message it finally passes to the invoked scheduler continuation. In the case of plain Throw(), the kernel would just pass the message along directly as it switches control of the processor, destroying the CC that called Throw().
I'll address these issues in reverse order.Colonel Kernel wrote:I have another question. I noticed on my second reading that you've been making a distinction between "one-shot" and "multi-shot" continuations. I asked earlier how you would capture a continuation and invoke it repeatedly. Is this what you mean by a "multi-shot" continuation? If so, do I understand correctly that you would use copying to implement it? If so, how would you deal with a multi-shot continuation where each invocation modifies some shared mutable state?
For that matter, you haven't addressed the issue of shared mutable state much at all... how do you deal with it in a continuation-based system?
In a thread-based system, many threads run in one shared context, typically called a process. The process contains, at minimum, a virtual address space and often houses other resources or security properties. Threads running in the same process can treat the shared code and data sections of the process as shared mutable state. A continuation-based system works the same way in this regard, with many continuations of one process sharing the process as mutable state.
I distinguish between one-shot and multi-shot continuations purely for implementation reasons. Saving a continuation involves saving which process owns it, its machine state, and its stack at minimum. Once this state has been restored, the continuation is running, and the stack in use again, references to the suspended continuation no longer make sense.
However, copying the saved continuation before invoking it creates an identical clone that can itself be invoked or copied. But copying every continuation on every invocation creates the need for a garbage collector to identify continuations that will never be used again. So to optimize for memory usage and work without G.C., a continuation system uses single-use ("one-shot") continuations that can be cloned when needed, to simulate continuations that can be invoked any number of times, or "left to die" when no longer needed.
If you have a multi-shot continuation in which each invocation changes a piece of mutable state, the result you obtain depends on where the mutable state lays: in the process or in the continuation. Practically speaking, changes to global variables are visible between continuations (just like threads) while changes to local stack frames are only visible within one continuation (just like threads).
But I don't think that was your question... If I copy a continuation A to create B, invoke B, and then change some mutable state on the stack of B, A does not see the change. This is because continuations come from functional programming languages, in which they are immutable objects.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Sure, but it could probably be done without throwing away threads as an abstraction... It probably doesn't matter though, because at that low level the implementation would look almost the same either way. It becomes mostly an argument about terminology at that point.Crazed123 wrote:It doesn't really need to go into an exokernel. Current Linux kernels, for example, are rather un-modular with regards to scheduling algorithms: one must manually patch the kernel to change the scheduler. Wouldn't it be nice if you could choose from several different scheduling algorithms and concurrency styles (real-time vs. preemptive round-robin vs. cooperative) when you compiled your kernel?
No, my question was: If you copy continuation A to create B, invoke B, then change some mutable state in the same process that owns A and B, then suspend B and invoke A, and A tries to work with that same global mutable state, what happens? I guess the answer is, it behaves a lot more like thread-based concurrency than like a continuation-based system would work in a "pure" functional language run-time environment.But I don't think that was your question... If I copy a continuation A to create B, invoke B, and then change some mutable state on the stack of B, A does not see the change. This is because continuations come from functional programming languages, in which they are immutable objects.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager