Tasks, concurrency, and inter-process communication

AndrewAPrice · Post by **AndrewAPrice** » Tue Aug 18, 2015 3:23 pm

I've written parallel software before, and I really like the thread pool pattern. There is one execution thread per processor core and a queue of tasks, and each thread processes away at the queue of tasks. I feel that it is a superior model for parallel processing than the alternative, which is to create one thread per component (what if you have more cores than components? what about threads that block?)

A task is a piece of work - it doesn't need to be run continuously. It does something and then quits. Examples of tasks:
- Handle a mouse click or key press.
- One one update of the game loop (e.g. the timer could trigger this 60 times a second), which could then spawn a subtask to update each of the dynamic objects in the game world.
- Write some bytes to a file.
- Ray trace a tile.
- One iteration of a "parallel for" loop.

It's a great when you have a single application in focus that is trying to use all CPU resources (a ray tracer, a video game, etc.) On existing operating systems when you try to mix multitasking and the thread pool pattern, you end up with multiple thread pools. So, I've been thinking for a while - would it be possible to have a thread pool at the OS level, and a global queue of tasks?

The more I've been thinking about this, I've wondered if it would be possible to eliminate the abstraction of threads completely, and describe all code as a series of tasks. Most programs are event based, so tasks will be spawned when events occur, and that is how programs will execute code. Starting a program can be considered an event, and so the initialisation logic runs as a task. A timer firing x times a second in a real time application can be considered event which runs a task. You can describe everything worth using processor time as a task.

Tasks are also much more lightweight than threads - creating a task is as simple as adding a message to the end of a queue rather than creating a thread by allocating a page and setting up a stack and registers.

Each processor core essentially runs a loop:

Code: Select all

while(true) {
   if(Task task = task_queue->get_next())
         task.run();
   else
      __asm("hlt");
}

task.run() is essentially a function call into the body of the task. (In reality, task.run() is slightly more complicated - it will skip the task if the process is no longer running, switch to the address space, set up the parameter list - but still very light weight.)

For such a system to work we can't have blocking tasks, which means that all IO has to be done asynchronously. I feel that Node.js has done a great job at bringing asynchronous IO mainstream (example.) In my system, readFile() spawns a task to read the file, and once the file is loaded, another task is spawned with the loaded file passed in as a parameter. There should be no reason a task should consume CPU processing time just idling and waiting.

We also have to handle long running tasks which could potentially make the system unresponsive. There are two reasons a task takes a long time to run: 1) A programming bug and we get stuck in a loop, or 2) A long running algorithm that isn't divided up into sufficiently fine grained sub tasks. It would be very easy to detect if a task is taking more than one time slice (set a flag every time the timer interrupt is called, clear it every time you start processing another task, and if the flag already set when you enter the interrupt timer you know the currently running task has taken more than one time slice), and I can think of two ways to handle this:

1. Simply kill the task. Tasks are supposed to be light weight and it forces good programming practice to increase the granularity of your tasks.
2. Turn the long running task into a thread and preemptively schedule it like a traditional thread.

There's also the matter of forcing developers to write concurrency-safe code by at least being aware that tasks may be executed concurrently. I don't think this is an unreasonable request - I'm making my own applications programming language, so if they're going to have to learn a new language to use my system, they may as well pick up good habits from the start by forcing them to write concurrent, asynchronous, event-based code. As long as I provide a language that makes it very natural to do this.

In my standard library I'll provide developers with, I think it'll be useful to provide some types to simplify their lives. Such as:

- A consecutive queue. This will be a queue that will have a handler that is called when an item is added to the queue, however only one copy of the handler will be executed at once in sequence, even though you can add items to the queue concurrently. This will make it possible to write algorithms that will handle events that may occur concurrently, but you only want to handle them one at a time.

- A barrier. A handler is called when the barrier reaches 0. Developers can use this when they want to process many items in parallel, but then want to run a piece of code once all items have finished executing.

Tasks can also be the basis of inter-process communication.

Very much like how a mouse event will create a task in your process to handle the event, you can also create a task to run in another process. This will be the primary means of inter-process communication. Every running process will have a list of handlers. This list of handlers will essentially be a list of functions and their parameters. The parameters to these functions will be in a binary format similar to a protocol buffer message.

In my language, I will have a special kind of interface that you can cast processes to, and all you can add to these interfaces are function calls which take a protocol buffer message as an argument. Then at run time, you can attempt to cast a process to an interface:

Code: Select all

IVideoDriver driver = get_process(_some_pid) as IVideoDriver;
if(driver != null)
   driver.set_screen_resolution(SetResolutionParams(1024, 768));

This will queue a task in the other process:

Code: Select all

void setResolution(SetResolutionParams p) {
  // task!
}

You can also use a consecutive queue on top of this if you want to process messages one at a time.

What are your thoughts?

tom9876543 · Post by **tom9876543** » Thu Aug 20, 2015 3:33 am

If you are going to invent a new programming language that is task based / asynchronous, your new language MUST run on Windows / OSX / Linux before trying to implement it anywhere else. If it is not available on the 3 most popular platforms, it will never have any success.

Also you need to thoroughly test your new language and benchmark it. You need to write equivalents to existing complex applications and confirm your language has performance similar to existing C code.
- transcoding video files: ffmpeg
- database: postgresql
- http server: apache
- web browser: firefox or chromium

If you achieve all of the above, that would be an outstanding major achievement.

With all due respect, implementing a new programming language on your niche OS, IMHO, is a waste of time.

AndrewAPrice · Post by **AndrewAPrice** » Thu Aug 20, 2015 8:27 am

tom9876543 wrote:If you are going to invent a new programming language that is task based / asynchronous, your new language MUST run on Windows / OSX / Linux before trying to implement it anywhere else.

Already working on that. My goal is to have an IDE (editor and compiler) that runs on my operating system, but all programs that run on my operating system must be implemented in my language, and so I am writing the first version of my IDE to run on my existing desktop OS.

tom9876543 wrote:If it is not available on the 3 most popular platforms, it will never have any success.

I'm not looking for a job, to gain a huge user base, or to get rich. I already work on software that impacts millions of people with a good salary. Success to me, in the context of the work I talk about on osdev.org, is giving myself a challenge and having something that I'm proud to represent my abilities with.

tom9876543 wrote:With all due respect, implementing a new programming language on your niche OS, IMHO, is a waste of time.

With all respect, so is implementing your own operating system.

AndrewAPrice · Post by **AndrewAPrice** » Fri Aug 21, 2015 2:11 pm

I've been thinking about scheduling. Scheduling being defined as the problem of determining which task to execute next.

My initial design was a global task queue that I process through on a first-come-first-serve basis, but I've been thinking that it might be better to have a per-process task queue. The scheduler would cycle through each process with queued tasks and only run 1 task from each at a time.

1. Execution would be split between running applications at a finer granularity (rather than tasks being clustered together based on when they were queued), giving a better illusion of multitasking.

2. It stops one application from holding up the system by queuing millions of tasks, as other applications will still get their tasks executed.

3. It would allow prioritisation - only execute application tasks if there are no driver tasks queued.

4. Singletasking applications could opt out of concurrent execution - and have their tasks executed serially instead.

5. Easier cleanup - if a process is killed, there's no need to scan the task queue for any references to it, as we could just release that process's task queue with interfering with the rest of the system.

Potential downsides:
1. More overhead from more context switching than if we allowed tasks to cluster.

dschatz · Post by **dschatz** » Mon Aug 24, 2015 4:30 pm

I'm trying to understand so correct me if any of these assumptions are incorrect:
1) Tasks and Threads both have state (registers, stack, etc.)
2) Tasks and Threads should not have access to the state of other Tasks and Threads in other processes, likewise you must preempt (in the worst-case to prevent a long running task/thread).
3) The distinction between task and thread seems to be whether the state is long-lived (thread) or started from a well-known point (task) each time it executes

If 2 and 3 are true, then what is the advantage of mandating the task model vs letting userspace choose how it manages the processor time it receives?

AndrewAPrice · Post by **AndrewAPrice** » Mon Aug 24, 2015 5:44 pm

The differences between a task and a thread are in it's implementation and how you think about them.

Implementation:
Tasks are not preempted or interrupted (the exception is what to do with long running tasks that hold up the system.) They do not have their own stack or registers, they share it with the worker that's running the task. As such, they'll be implemented as a queue of function calls - an entry point address and any parameters. It should be very light weight to queue up thousands of tasks.

How you think about them:
I generally think of a thread as execution agents - they have state, they can sleep, they are generally long running. Tasks are short, they can't block, and they don't have state other than being a function call.

It's possible to emulate one with the other, but in my mind tasks feel superior because they're a natural way to write highly parallel code, while also being easy and intuitive, and incredibly light weight to implement. You can have millions of tasks queued with little penalty on memory.

Can you think of a use case where tasks would be inferior to threads?

Rusky · Post by **Rusky** » Mon Aug 24, 2015 10:04 pm

If tasks can't block then it sounds like you'll hit some of the same problems as cooperative scheduling, where a long-running computation can block other tasks. But once they can block, things get a lot less light-weight, since you can't just drop the old stack and processor state (possibly still better than threads though, as I imagine fewer tasks would be preempted).

AndrewAPrice · Post by **AndrewAPrice** » Tue Aug 25, 2015 8:23 am

Rusky wrote:If tasks can't block then it sounds like you'll hit some of the same problems as cooperative scheduling, where a long-running computation can block other tasks.

I'm anticipating a few of these things. For example, copying the screen buffer to the video card. What if it takes more than one time slice? What sort of granularity should I split the copy functionality into it's own task? (How do I know how many pixels I can copy <10ms?) If I make it too fine grained (e.g. a task per line), then the overhead of many function calls will start to outweigh the benefits. On many algorithms you could ask the question - "How much can I do in one task?"

That's why I need a solution to long running tasks. Continued below, but first I'll address this:

Rusky wrote:But once they can block, things get a lot less light-weight, since you can't just drop the old stack and processor state (possibly still better than threads though, as I imagine fewer tasks would be preempted).

That's why I want to make them non-blocking by making all IO asynchronous. It's possible you can still block by implementing your own spinlocks, but I feel that even when you want to write non-concurrent code it should be possible to do so without mutual exclusion. Rather than a spinlock, you could have mutual exclusion object that provides a lock functionality (by queuing a task when no handler is either queued or executed, and upon finishing, queue another task to run the next handler - it should loop through all handlers sequentially), and the handlers are executed one at a time.

It might look something like this:

Code: Select all

Mutex mutex;

someTask() {
   mutex.lock(function() {
       // exclusive use
   });
}

someOtherTask() {
   mutex.lock(function() {
       // exclusive use
   });
}

Back to talking about long running tasks. If we had no mechanism to handle long running tasks, my system would have the same downfalls as cooperative scheduling. But, we have the ability to detect when this happens and act upon it. It's possible to detect a long running task by setting a flag in the timer interrupt, clear it when we change tasks, and if it's still set next time we enter the timer interrupt we know a task has taken more than on time slice. The question is what do we do with this?

Could we be strict, and simply not tolerate it, blame the programmer for making their algorithms too coarse grained, and kill the task? In my opinion, this would be unacceptable because it could result in the loss of user data or diminish the functionality of applications.

The other mechanism, which I'm leaning towards (my mind is not made up if you can think of a better solution), is inside of the second timer interrupt, we promote the task into a thread. This will naturally be heavier as we will then have to allocate and copy the registers and stack over. But I'm hoping that this will be the odd case (the user runs a calculation, applies a filter in a photo editor, etc.) and most events can be handled quickly without taking up more than 1 time slice. (Even something that seems heavy like saving a document to disk would actually be very fine grained because you only write small chunks to disk at once.)

What about for something like optimizing a copy routine - like in a video card driver? Maybe a simple variable that says how long a task has been executing for and another variable that says if a task was pre-emptied (i.e. is a thread.) If we can detect that programmatically then algorithms can increase their granularity (do smaller blocks at once.)

There will always be the case of the naive or malicious programmer that writes slow software where every task is long running and becomes a thread. But, slow software is slow software and anyone can ignore any platform's best practices and write slow software. No operating system is immune, and users will realise that software is slow and switch to an alternative or complain to the developer to optimise it, just as you do with any software you use now.

I'm not concerned about too many tasks being promoted to threads, because:
- Very few tasks are going to be computationally heavy. Since all of the IO and IPC APIs will be asynchronous, most algorithms will naturally be divided up into tasks just by using the provided APIs.
- Since developers are learning a new language to write code for my OS, we can instill good behaviour of dividing algorithms up into tasks from the start.

Rusky · Post by **Rusky** » Tue Aug 25, 2015 10:23 am

Async IO is a really good solution for most uses of blocking, especially if your language includes some sort of continuation-passing style transformation (e.g. yield, C#'s async/await, although perhaps more implicit here) to avoid deeply-nested callbacks.

I'm not sure about automatically promoting tasks to threads- a well-written program will already know whether a task or a thread will be optimal. Perhaps the lowest-level interface could include both, and any kind of automatic promotion could be moved up into the language runtime so the ease of use isn't lost.

It might even make sense to let the language runtime or even some higher level library do any necessary context switching, as with scheduler activations- with a per-process task queue and the right kernel interface, it would be possible to implement threads entirely in the language, which would leave room to experiment with other concurrency primitives in the future.

willedwards · Post by **willedwards** » Thu Aug 27, 2015 1:04 am

(Just a ref about existing task/thread systems)

Many popular scripting languages have popular frameworks to do async in a single thread e.g. python, javascript etc. These stay single-threaded because the language implementations have Global Interpreter Locks (GIL). The programmer has to split long computations up into a sequence of subtasks to avoid starvation of other tasks; they are cooperative multitasking.

Other languages with multi-threading also have async frameworks, e.g. Java and C. You can use multi-threading, and in each thread use async cooperative multi-tasking.

Recently, Go (which compiles) has been making waves by being async but with a runtime that automatically does the tasks across some number of threads (GO_MAX_PROCS). Early on Go added preemption, using the checkpoints used to coordinate GC between threads to also do task accounting. That is, these are cooperative tasks but the compiler litters the code - including inner loops - with checks to see if it is time to yield. (You can balk at this, but those sync checks are in all mainstream precise multi-threaded GC languages already.)

Classically, when you have some number of tasks and some other number of threads, you use a "work stealing queue" system to consume tasks.

When you have worker threads, the classic 'trick' is to have a 'sentinel' thread that is running at 'idle' priority. The OS will only schedule this when there is free CPU. The runtime can use this to spin up a new worker thread.

I've worked on frameworks that have spread computationally expensive tasks across a large cluster of disparate hardware with dynamic joining and leaving of the cluster. The master server tracked the tasks with some estimate of the cost of each task, and tried to benchmark each node as results came in. As the list of outstanding tasks grew shorter, the final tasks might be executed on more than one node and the results from the first to return being used.

OSDev.org

Tasks, concurrency, and inter-process communication

Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication

Re: Tasks, concurrency, and inter-process communication