Finished new scheduler

mystran · Post by **mystran** » Mon Feb 14, 2005 4:07 am

So... I finished writing a new scheduler for my kernel, and thought I'd share the "final" design with you, as it's (IMHO) about as simple as you can get without sacrificing functionality. Comments are welcome.

The basic unit of scheduling is a thread. A thread is really nothing more than a kernel stack. All thread data block lives in the beginning of the stack space (like in Linux). The stack-pointer of inactive stacks is the first field of the thread data. Stacks are switched directly with a cpu_switch_task(old,new) function. Initially thought about setjmp()/longjmp() but thought that was just extra complexity.

The threads are completely kernel-level things; the datablock contains no fields for any userlevel stuff (except pointer to process). When a trap/interrupt occurs from userspace, the ESP is set (by hardware) to the end (bottom) of the kernel threads stack, the entry-code allocates from stack an user_registers structure by pushing the registers, and passes a pointer as argument to whatever C function was registered with the trap.
On return to userspace, the registers are popped again from the stack, or a specially requested area; the initial user_registers can be anywhere in stack when the thread first enters userspace.

Thread datablocks are linked directly into queues, and queues themselves are basic doubly-linked lists. There are two types of queues: the general thread_queue type, and timer's internal queues. Each thread can be queued (at one time) in one normal queue, and one timer-queue. More of the timers later.

Scheduler itself is basic FIFO priority scheduler; there's one thread_queue per priority level, and on scheduler_run(), the first thread from the highest-priority (non-empty) queue is removed from the queue, and it's stack switched to current. Scheduler keeps track of current threads with a array (one pointer per CPU). After switching stack, the (new) thread's status field is read, and returned as the result of the scheduler_run() call.

Similarly, semaphores contain a queue (could be extended to have priorities too, but I thought that's unnecessary), into which threads are inserted before calling scheduler_run(). Semaphores are the only thing (internally anyway) you can wait on. If one wants to wait for multiple objects, then those multiple objects can be redirected to post() on a single semaphore.

There are two complications to this basic model: interruptions and timers. The interruptions are simple: simply check if the thread is queued somewhere (in a wait-queue). If so, then the thread is removed from that queue, and inserted into the correct runqueue, and it's status is set to ERROR_INTERRUPTED.

Timers act the same but they are managed from timer interrupts. Each thread has a "fire time" (if the timer is active) and when the current "ticktime" is the same as the "fire time", the thread is interrupted (with ERROR_TIMER instead). A reschedule is forced if the priority is higher.

Now, by using double-linked list for the queues (no need for malloc though, because a threads is only in 1+1 queues), all operations are constant time, except timer interrupt handler. To make timer interrupt handling fast, there's a circular buffer of timer queues (currently 256). When a timer is set, the thread is put in timer_queue[fireoff % no_queues]. On each timer interrupt, ticktime is incremented, queue[ticktime % no_queues] is checked, and all threads with fireoff == ticktime are interrupted. The rest are simply skipped and left in the queue.

Now, while the above scheme does not make timer really constant time, it's quite efficient in any case. As long as all timers are shorter than no_queues * length_of_tick, the handler is linear in threads to be released! Obviously can't do better than that. For longer timers, we take a hit because we need to skip them, but at least they are distributed into the queues. This doesn't sound that great, but consider: my left of tick is 10ms -> with 256 queues, timers of less than 2.56 seconds cause no penalty!

I think I'm quite happy with this one. Oh, and sorry for the long post.

mystran · Post by **mystran** » Mon Feb 14, 2005 4:08 am

Showing that one can construct all the fun stuff from this set of primitives is left as an exercise for the reader. ;D

Pype.Clicker · Post by **Pype.Clicker** » Mon Feb 14, 2005 4:22 am

is there a specific differences between your "thread datablocks" and the usuals "thread control blocks" or is that just a word you preferred ?

mystran · Post by **mystran** » Mon Feb 14, 2005 5:08 am

I probably was going to call it a thread control block.

Pype.Clicker · Post by **Pype.Clicker** » Mon Feb 14, 2005 8:59 am

there's something that troubles me with TCBs "living in thread's stack"
1. couldn't VeryBadThings(tm) occur if user code messes up its stack or did i miss some protection mechanism ?
2. when linking threads in queues, how can you access the thread X in another address space if the TCB is not in a shared system-global memory area ?

Candy · Post by **Candy** » Mon Feb 14, 2005 10:31 am

Pype.Clicker wrote: there's something that troubles me with TCBs "living in thread's stack"
1. couldn't VeryBadThings(tm) occur if user code messes up its stack or did i miss some protection mechanism ?
2. when linking threads in queues, how can you access the thread X in another address space if the TCB is not in a shared system-global memory area ?

1. I'm going to guess he's making the page with the TCB supervisor-only, hence causing page-faults when the userlevel code overwrites it. Plus the page above it, to guard against evil kernel calls.

mystran · Post by **mystran** » Mon Feb 14, 2005 11:03 am

1. Don't confuse kernel stacks and user stacks. TCB lives in the KERNEL stack. In fact, as far as the scheduler is concerned, there is no userspace. If a KERNEL thread wants to go into userspace, it's entirely it's own problem (from the POV of the scheduler anyway). Kernel won't do unbounded recursion, so kernel stack-use is bounded. In fact, kernel stack-use (for normal threads) is so small that I'm considering to drop the default stack-size from 4kb to 2kb.

If a thread needs more stack, it can use a temporary stack anyway. Finally, specific kernel tasks can be given bigger stacks from the beginning.

2. Again, all threads I'm talking about are kernel level (like I said in the original description). All operating systems really allocate kernel stacks from system-global non-pageable, because you don't want to switch VM-contexts just to schedule a thread that shortly after blocks again without ever going to userspace.

In my case, I allocate thread stacks with a normal malloc() since there aren't really any specific alignment requirements.

Hmmh.. anyway, I guess I have to write a small document about this thing as I might have to explain it to someone sometime.

OSDev.org

Finished new scheduler

Finished new scheduler

Re:Finished new scheduler

Re:Finished new scheduler

Re:Finished new scheduler

Re:Finished new scheduler

Re:Finished new scheduler

Re:Finished new scheduler