Designing a kernel to be preemptible

Brendan · Post by **Brendan** » Thu May 12, 2005 3:40 am

Hi,

Colonel Kernel wrote:Does this make sense, or am I making a bad assumption about the 8259A's priority scheme?

It does make sense, and your are correct. First let's make up some names: "pre-EOI" for masking the IRQ, sending the EOI, handling the device/s and then unmasking the IRQ, and "post-EOI" for handling the device/s and then sending the EOI.

For some devices (like serial ports, parallel ports and the PS/2 keyboard) if the device doesn't get attention soon enough you end up with buffer overflows and lost data. For other devices (like IDE/hard drive and floppy drive controllers) if the device doesn't get attention for ages it doesn't really matter much - you might get worse transfer speeds but you won't lose data or anything. The whole idea of the IRQ priorities is to make sure that higher priority devices get the attention they need before lower priority devices get any attention. What's important is the order that devices receive attention - how this is acheived doesn't matter much (whether it's a result of thread priorities and "pre-EOI", or PIC/APIC priorities and "post-EOI).

IMHO it's not unreasonable to expect all IRQ handlers to send an EOI (or an "unmask me please") in a timely manner. If a device driver fails to do this you end up with problems - at least one device stops working (I agree that with "post-EOI" the problems are more severe). In practice this sort of failure won't make it into completed drivers, and you can use some sort of timer to terminate the driver or generate an error if the EOI (or the "unmask me please") takes too long.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Sun Sep 16, 2007 12:51 am

Normally I would apologize for such egregious thread necromancy, but since this was my very first thread on the OS Dev forum (back when it was at Mega-Tokyo), I don't feel too bad about it.

beyond infinity wrote:Anyway - it would be very interesting to have a sniff at QNX source code. But alas, ere this thing gonna happen the sun shines at midnight over the aequator.

And lo, the sun shone at midnight over the equator.

Here is the QNX source code, as announced on another thread. More importantly, here is an awesome wiki article that answers my three-year old questions about how to achieve pre-emption in a kernel that was not designed for SMP (I won't say "non-reentrant" anymore, since I was confused before and mis-using the term).

The answer is so deceptively simple, and it was under my nose all along, but I didn't see it:

Kernel locking
In a uniprocessor system, only one thread is allowed to execute within the microkernel at a time. Most kernel operations are short in duration (typically a few microseconds on a Pentium-class processor). The microkernel is also designed to be completely preemptable and restartable for those operations that take more time. This design keeps the microkernel lean and fast without the need for large numbers of fine-grained locks. It is interesting to note that placing many locks in the main code path through a kernel will noticeably slow the kernel down. Each lock typically involves processor bus transactions, which can cause processor stalls.

In an SMP system, QNX Neutrino maintains this philosophy of only one thread in a preemptable and restartable kernel. The microkernel may be entered on any processor, but only one processor will be granted access at a time.

The magic word that I always wondered about but never figured out is "restartable".

In all other pre-emptible kernels I know of, each thread has its own kernel stack. Pre-empting the kernel in these systems is tantamount to switching threads, where the thread being suspended just happens to be running in kernel mode. It was conventional wisdom (or so I thought) that kernels with only a single kernel stack (per-processor, usually) are not pre-emptible. What confused me about the publicly available QNX documentation was that it suggested (but never confirmed) that QNX has only a single kernel stack, which would be protected by this global lock they kept hinting at.

QNX does in fact have only a single kernel stack per CPU, and most of its kernel is protected by what I guess I would describe as a custom spinlock (basically a bitmask that other CPUs can spin on). When a thread makes a system call on a CPU, it traps to the kernel and obtains the lock by setting the "in kernel" bit. It also disables pre-emption briefly by setting the "lock" bit while its context is saved to its TCB and the CPU switches to the kernel stack. Then, it clears the "lock" bit while doing work that is safe to be pre-empted. If a thread on another CPU makes a system call, it will trap and then immediately spin waiting for the "in kernel" bit to clear. Note that interrupts are enabled nearly all the time while this is happening (they are only disabled while switching to and from the kernel stack and while running the scheduler), and a CPU that is spinning waiting to enter the kernel can still handle interrupts. IPIs are used to notify the CPU that is running the kernel of any changes to the readiness of threads that might require the kernel to be pre-empted.

The tricky part is how pre-emption is implemented. If I'm reading their Wiki article correctly, every system call is implemented in two phases, so that it makes all modifications to kernel data structures around the same time, and pre-emption is disabled during that time. If an interrupt occurs (could be an IPI from another CPU, or just a device IRQ) while the kernel is pre-emptible, the ISR will notice that the "lock" bit is clear and, if it notices that a higher-priority thread has just become ready to run, it will simply run the scheduler and switch to that thread, effectively discarding the current contents of the kernel stack. This is safe because if the system call had started to modify kernel data structures, it would have set the "lock" bit first and the ISR never would have run in the first place.

The trick is, before the kernel (running in the context of the ISR) switches to the higher-priority thread, it modifies the program counter of the thread whose system call is being pre-empted (EIP/RIP in x86-speak) to point to the instruction that trapped in the first place (sysenter/syscall in x86-speak)! This is so simple, it's brilliant. When the thread eventually gets re-scheduled, it will just trap again, and re-start the system call that got aborted previously.

Once a system call finishes changing kernel data structures, it sets an "exit" flag so that if it is pre-empted, it will not be re-started.

I know there were some differences of opinion about the scalability of such a scheme compared to fine-grained locking... The issue is still not settled obviously. I just wanted to mention this technique because I think it's quite brilliant in its simplicity, and because it gives me a lot of closure since I've been wondering how the #@$!#%# QNX implements pre-emption for the better part of a decade now (since long before I found these forums)...

distantvoices · Post by **distantvoices** » Sun Sep 16, 2007 1:57 am

Ah, this just shows that one better takes care of his own predictions ... they could be shoved back into his own throwt *swallow*.

Hm. I'm keeping a dedicated kernel stack in blueillusionos all the time, saving the eip of the interrupted or preempted thread in its TCB so it can be restarted at exactly that point later on. Never given it any further fuss coz it simply works. Yeah, and the kernel stack is set anew each time a syscalls occurs.

so in case of smp I'd just need to carry around some switches which say: CPU X is currently in kernel mode doing some important stuff, all the others just keep their dirty claws outside. - yeah and mimick qnx by doing so. But, heck, it's like Design Patterns. Why reinvent the wheel, if a good idea is already lurking around.

STay safe

Candy · Post by **Candy** » Sun Sep 16, 2007 3:33 am

All user calls are handled on the user stack up to and including putting a request in a queue and blocking for it to return. The kernel has a number of threads that are dynamically created to match outstanding events requiring handling, number of CPUs present and number of threads blocked.

Colonel Kernel · Post by **Colonel Kernel** » Sun Sep 16, 2007 10:30 am

distantvoices wrote:Ah, this just shows that one better takes care of his own predictions ... they could be shoved back into his own throwt *swallow*.

As someone once said, "640K ought to be enough for anybody".

Hm. I'm keeping a dedicated kernel stack in blueillusionos all the time, saving the eip of the interrupted or preempted thread in its TCB so it can be restarted at exactly that point later on.

At which point though? At the instruction after the syscall or at the syscall instruction itself? The latter is what makes QNX syscalls "restartable"...

Yeah, and the kernel stack is set anew each time a syscalls occurs.

With or without unwinding it first? The trick with QNX is that it will not even unwind the stack, just bail out of the kernel when switching to the pre-empting thread.

OSDev.org

Designing a kernel to be preemptible

Re:Designing a kernel to be preemptible

Re:Designing a kernel to be preemptible