OSDev.org

Posted: **Thu Dec 23, 2010 3:25 pm**

Current trend in computing for quite a while now is towards more fine grained parallelism. Systems are shipped with more and more CPU cores in them and this trend seems to continue and 16 or more cores will be common sooner than later. Still most operating systems have the usual process/thread model, meaning that the application itself is handling its own threads in a way it thinks would be the best. Though the application doesn't really have much knowledge about the system it's running on and what CPU resources it has. Let's say you have a server that creates a pop up thread for each request so that it can handle them in parallel. At one time it might get 200 requests creating 200 threads, though the system has only 16 cores. This is like you have a company where you hire 200 employees but only 16 can work at the same time, the others are on standby.

A "job" is a little like a thread but no thread resource allocated to it. Associated to the job is an entry point just like with a thread, arguments should be passed to the entry point so that it can quickly begin with its task. A job has a queue where request can be stacked. As soon a job gets a request, the kernel gives the job a thread which eventually will be scheduled to a physical CPU. A job can get several requests but the kernel decides how many jobs that are supposed be assigned a thread and queued for execution. The kernel knows exactly how many cores it has and can pick jobs so that it keeps the CPUs busy without over-allocating thread resources. When the job is done, the thread resource is given back to the kernel, cached for later use.

The motivation for jobs rather than user space managed threads are:

Applications don't know anything about the system you're running on. Kernel can optimize pop up threads better. Apps don't need to deal with threads just to get parallelism.
Most threads are allocated for doing *nothing* most of the time, in usual OSes. Most jobs are really short.
Preemption happens rarely, requests is what is what usually drives task switching. Usually no need to remember thread state.
The job model encourages programmers to design concurrent friendly programs. Programmers simply create a job by passing a function pointer in user space to the kernel interface.
The job entity can be expanded to include delegation of jobs to alien processing units like DSPs or CPUs of another architectures transparently.

This model is quite simple but strangely enough I haven't seen this being a central part in a kernel. I see this model more in run-time systems for high level languages and DSP operating systems. The question is why the job model isn't more adopted in operating systems and there is usually a catch I never though about.......

Posted: **Thu Dec 23, 2010 5:36 pm**

I don't see the major difference between jobs and properly kernel-scheduled threads. Are you just saying (jobs = kernel threads) > user threads?

Posted: **Thu Dec 23, 2010 6:37 pm**

NickJohnson wrote:I don't see the major difference between jobs and properly kernel-scheduled threads. Are you just saying (jobs = kernel threads) > user threads?

They are quite similar. The difference is how resources are handled.

Basically, normal threads can be used to emulate the job model. However, the proposal was to move this behaviour into the kernel and make it a fundamental part of the system design instead. The motivation was that we wanted a system that encouraged fine grained parallelism that is optimized for the hardware it is running on without applications have to worry about the resource allocation.

One way at looking at it "functions that can run in parallel", kernel handles the rest (well almost, user programs still has to deal with reentrance issues). A way to provide a convenient interface for the programmer.

Jobs could be executed in kernel mode or in user mode depending who is the creator.

Posted: **Thu Dec 23, 2010 6:48 pm**

I don't really see the difference either. If the kernel implements something like the pthread interface then is that the same as a job? The pthread interface allows each thread to have an entry point and arguments.

Pthreads don't usually have any resources (files, devices, memory) apart from a few hundred bytes of state information. This information is usually the kernel state of the thread.

I see that a 'job' has a queue associated with it. The problem with this is that the queues may become imbalanced and some threads may have a long queue while others have nothing to do. In my OS I associate queues with processes so that threads that aren't busy can get work from a single process queue. The queue is managed by the kernel.

Posted: **Thu Dec 23, 2010 7:44 pm**

gerryg400 wrote:I see that a 'job' has a queue associated with it. The problem with this is that the queues may become imbalanced and some threads may have a long queue while others have nothing to do. In my OS I associate queues with processes so that threads that aren't busy can get work from a single process queue. The queue is managed by the kernel.

In what you described, you have already allocated a set of threads, right? They pick from a common queue of job messages or something and get underway with the job. However, how do you know how many threads you're supposed to create that is an appropriate number for your system?

With jobs, threads are assigned first when it is suppose to execute and released when the job is done. A job always starts on a fresh thread.

I got this idea from a simple DSP system where a DSP is assigned simple short jobs taken from a queue. The DSP just executes the jobs in order without preemption. With a newer hardware there was suddenly several DSP cores but that adaption to that was very simple since each core just pick the next job in the queue. So the the scaling to several cores was basically already done. There is no concept of threads since the thread is basically the hardware itself since preemption isn't supported.

Now, I wanted to see if this concept can be used in a kernel for general purpose CPUs and if we can scale just as nicely to several cores as with the DSP system. I can see that many of the tasks are just small jobs that needs to be executed and when it is done you don't need the thread anymore just like in the DSP case.

In reality this becomes a bit more complicated for general purpose CPUs since you might want to have preemption, separate stacks and that's where the thread context come back in again compared to the simple DSP system. However, only undergoing jobs needs a thread thus by average less thread instances needs to be created since you reuse the existing ones.

Posted: **Thu Dec 23, 2010 8:37 pm**

Okay, I think I understand what you're saying now. The thread is the execution unit, so you actually only need one thread per core ? And the queue of work that is to be done already really a queue of jobs?

BTW, have you seen QNX Neutrino ? It has the idea of a thread pool. The thread pool manager (in user space) looks at the queue (they are synchronous messages from clients in Neutrino) and looks at how many cores are available and creates and destroys threads as required. If there is work to be done and there are spare cores (because perhaps some of the running threads have blocked waiting for IO or something) it creates more threads and as the work is complete and the queue shrinks it allows the excess threads to exit. It is built on top of the kernel thread interface. All that's required for this is a fast implementation of pthread_create, pthread_exit and pthread_join.

So the extra thing to think about is that many jobs on a general purpose OS cannot run to completion because they need to wait for other things to happen. And you must use that waiting time to run other jobs, even if you don't have pre-emption. This possibly happens less on a DSP system where jobs are more likely to be computational.

Posted: **Thu Dec 23, 2010 10:00 pm**

gerryg400 wrote:Okay, I think I understand what you're saying now. The thread is the execution unit, so you actually only need one thread per core ? And the queue of work that is to be done already really a queue of jobs?

BTW, have you seen QNX Neutrino ? It has the idea of a thread pool. The thread pool manager (in user space) looks at the queue (they are synchronous messages from clients in Neutrino) and looks at how many cores are available and creates and destroys threads as required. If there is work to be done and there are spare cores (because perhaps some of the running threads have blocked waiting for IO or something) it creates more threads and as the work is complete and the queue shrinks it allows the excess threads to exit. It is built on top of the kernel thread interface. All that's required for this is a fast implementation of pthread_create, pthread_exit and pthread_join.

So the extra thing to think about is that many jobs on a general purpose OS cannot run to completion because they need to wait for other things to happen. And you must use that waiting time to run other jobs, even if you don't have pre-emption. This possibly happens less on a DSP system where jobs are more likely to be computational.

The thread is needed for saving the state of the job if the job is preempted or is blocked due to waiting for IPC. Also stacks needs to be assigned to the thread. As soon the job is scheduled to execute, the job is actually a thread.

In practice can the queue of jobs be a normal message queue, where the messages are "the initial message" for the job. Rather than doing a receive in the beginning of the job function, the job gets the message as function call parameter in order to save kernel calls.

The use case is often a server call, get a request, do something, respond with the result, after that the job is done. During the life time of the job there is likely to be some IPC blocking waiting for requests from other servers. During that time other jobs will execute. The amount of thread objects allocated can roughly be calculated as, number of cores + undergoing jobs are that blocked or preempted.

QNX thread pool is similar to the job model I'm talking about "emulated" using existing well known thread interface.

The job model works also quite well for processors like cell found in PS3. Jobs doesn't need to be run on PowerPC core but can also be scheduled to run on an SPE. Possibilities arise that both the PowerPC and the SPEs have code doing the job. The kernel can then choose which core to use for the time being. GPUs have ridiculous amounts of processing units and the job concept would be quite suitable for this kind of hardware. As these systems are more and more used for general computing, the DSP like concept becomes more and more interesting.

Posted: **Fri Dec 24, 2010 4:42 am**

I don't think this is a matter of how to program userspace programs. IIRC a thread only consumes the context struct (saved register set) as far as resources go, there should be no requirement that a thread has it's own stack space as you should be allowed to change the stack pointer arbitrarily, which means that you can in userspace postpone the threads memory consumption until the actual job is to be started.

Posted: **Fri Dec 24, 2010 9:14 am**

It really seems that this could be implemented as a particular scheduler policy if you have kernel threads. If the kernel scheduler creates a set of running threads and only pulls in new ones to run when those threads exit (which would be terrible for response time btw), that is essentially the same thing as having a set of jobs in a queue to be assigned to threads. If you implement threads properly (i.e. with demand-allocated stack space), they are pretty inexpensive to create, so the memory/time overhead would be insignificant.

Posted: **Fri Dec 24, 2010 10:28 pm**

I found out that Apple has implemented something similar called Grand Central Dispatch in Mac OS X.

http://en.wikipedia.org/wiki/Grand_Central_Dispatch

Posted: **Sat Dec 25, 2010 3:54 am**

In reality, the job model is just a special case of kernel threads. The job model is basically implemented like this with threads:

for (;;)
{
WaitForJob();
job = GetJob();
ExecuteJob(job);
}

When there is multiple cores, the thread model will just create more threads in order to keep the cores busy when there is enough load.

In order for a job-model to be (potentially) more efficient, the following must hold:
* There must be multiple cores
* Jobs must be independent of each others

In my kernel, there are lots of examples of the above thread code. For instance, in the IDE-driver there is one thread per physical disc. This driver will not benefit from the job model, simply because the physical disc does not allow multiple operations on different cores.

The best examples where the job model would be useful is on servers, like database servers or webservers running on multicore processors.

As for the implementation, I could see different ways.
* It could be implemented in a user-library. This implementation would try to use all cores in the current process only, which is a fair approximation. The server threads would be created in user-space. Light-weight threads could be used.

* It could be implemented in kernel, and then threads could potentially be reused between processes. Reusing threads between processes could potentially be a costly operation as these threads must be created in kernel, and job scheduling would need to start in kernel with a user-ret-frame. CR3 and other info in kernel threads might also need to be updated, resulting in TLB flushes. There is also a possiblity that different processes would contain widely different jobs. Prioritization could solve that though.

Posted: **Sat Dec 25, 2010 10:28 am**

The main reason for the existence of threading is not for parallelism, exactly. Many APIs (eg. POSIX) contain blocking function calls. If the basic logic of your app requires that it block on multiple API calls at the same time, or blocks and yet somehow still continues to run, then you need threads. The main purpose of the threads is to provide asynch capabilities -- to manage/avoid blocking, or for calculations that may reduce responsivity and need to be processed "in the background". You are correct that if you have more unblocked threads than you have cores, then you are not buying any efficiency benefits. In fact, this will be true on almost any real world system -- there will usually be lots more threads than cores.

But the point is that the system's kernel cannot know how many threads that the basic logic of an app demands -- in basically the same way that an app cannot really know how many cores are available for migrating threads to. And all APIs (AFAIK) need to provide some sort of blocking mutex. Which means that some logic in some apps will eventually need local threading capabilities. Of course, it's possible to artificially separate an app into autonomous pieces that communicate via IPC, but that sort of thing gets ridiculous very quickly if you only need to thread one line of code.

Posted: **Sun Dec 26, 2010 8:28 am**

rdos wrote:* It could be implemented in kernel, and then threads could potentially be reused between processes. Reusing threads between processes could potentially be a costly operation as these threads must be created in kernel, and job scheduling would need to start in kernel with a user-ret-frame. CR3 and other info in kernel threads might also need to be updated, resulting in TLB flushes. There is also a possiblity that different processes would contain widely different jobs. Prioritization could solve that though.

Not necessarily, you could force the programmer to use an "ExitJob" supervisor call when they are done with the job. Also if you want to support that the job function just returns from the function you could in the kernel forge the return address to be an impossible address causing a trap call. If there was an instruction fetch fault on that impossible address, the page fault handler can call the exitjob call.

Finding a thread for the job can potentially be costly if you're out of threads and another one needs to be created. For time critical jobs you could make the thread "sticky", meaning after the thread has been allocated it will stick on the job forever so that it never have to find new threads for it.

Posted: **Sun Dec 26, 2010 1:34 pm**

Moving the job system into the kernel would, to me, seem to be counterproductive in terms of increasing performance. After all, this implies a kernel round trip per job; an expensive proposition.

In fact, I'm working on a system where even most of the thread support is moved into user space. The user space scheduler is responsible for creating up to N threads per priority level per process (where N = core count) and then scheduling userspace (i.e. app level) threads between them; though it must be noted that I am working on an aggressively asynchronous microkernel (where the goal is to reduce the number of system calls an app makes to an absolute minimum; for CPU bound app without latency requirements, this could be quite easily be none in an average time slice) where anything which requires additional kernel involvement is my antithesis.

Posted: **Wed Jan 05, 2011 1:57 pm**

Hi,

OSwhatever wrote:
rdos wrote:* It could be implemented in kernel, and then threads could potentially be reused between processes. Reusing threads between processes could potentially be a costly operation as these threads must be created in kernel, and job scheduling would need to start in kernel with a user-ret-frame. CR3 and other info in kernel threads might also need to be updated, resulting in TLB flushes. There is also a possiblity that different processes would contain widely different jobs. Prioritization could solve that though.
Not necessarily, you could force the programmer to use an "ExitJob" supervisor call when they are done with the job. Also if you want to support that the job function just returns from the function you could in the kernel forge the return address to be an impossible address causing a trap call. If there was an instruction fetch fault on that impossible address, the page fault handler can call the exitjob call.

Finding a thread for the job can potentially be costly if you're out of threads and another one needs to be created. For time critical jobs you could make the thread "sticky", meaning after the thread has been allocated it will stick on the job forever so that it never have to find new threads for it.

I researched this for my masters thesis - Instead of rehashing the ideas here in a subpar way, reading the literature review and hypothesis of said thesis might help either cement your own ideas or show you some new ones?

http://www.jamesmolloy.co.uk/rubbish-bin/project.pdf

The general idea is that finely-grained parallelism can be best exploited by use of a safe language - a bytecode or virtual machine. That way all tasks can run in kernel mode and the creation, switching and destruction of microthreads can be extremely fast.

The link in my signature below links to the compiler and language I designed for it; again, if you're interested.

OSDev.org

Thread vs job model for fine grained parallelism

Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism

Re: Thread vs job model for fine grained parallelism