Normally I would apologize for such egregious thread necromancy, but since this was my very first thread on the OS Dev forum (back when it was at Mega-Tokyo), I don't feel too bad about it.
beyond infinity wrote:Anyway - it would be very interesting to have a sniff at QNX source code. But alas, ere this thing gonna happen the sun shines at midnight over the aequator.
And lo, the sun shone at midnight over the equator.
Here is the QNX source code, as announced on another thread. More importantly,
here is an awesome wiki article that answers my three-year old questions about how to achieve pre-emption in a kernel that was not designed for SMP (I won't say "non-reentrant" anymore, since I was confused before and mis-using the term).
The answer is so deceptively simple, and it was under my nose all along, but I didn't see it:
Kernel locking
In a uniprocessor system, only one thread is allowed to execute within the microkernel at a time. Most kernel operations are short in duration (typically a few microseconds on a Pentium-class processor). The microkernel is also designed to be completely preemptable and restartable for those operations that take more time. This design keeps the microkernel lean and fast without the need for large numbers of fine-grained locks. It is interesting to note that placing many locks in the main code path through a kernel will noticeably slow the kernel down. Each lock typically involves processor bus transactions, which can cause processor stalls.
In an SMP system, QNX Neutrino maintains this philosophy of only one thread in a preemptable and restartable kernel. The microkernel may be entered on any processor, but only one processor will be granted access at a time.
The magic word that I always wondered about but never figured out is "restartable".
In all other pre-emptible kernels I know of, each thread has its own kernel stack. Pre-empting the kernel in these systems is tantamount to switching threads, where the thread being suspended just happens to be running in kernel mode. It was conventional wisdom (or so I thought) that kernels with only a single kernel stack (per-processor, usually) are not pre-emptible. What confused me about the publicly available QNX documentation was that it suggested (but never confirmed) that QNX has only a single kernel stack, which would be protected by this global lock they kept hinting at.
QNX does in fact have only a single kernel stack per CPU, and most of its kernel is protected by what I guess I would describe as a custom spinlock (basically a bitmask that other CPUs can spin on). When a thread makes a system call on a CPU, it traps to the kernel and obtains the lock by setting the "in kernel" bit. It also disables pre-emption briefly by setting the "lock" bit while its context is saved to its TCB and the CPU switches to the kernel stack. Then, it clears the "lock" bit while doing work that is safe to be pre-empted. If a thread on another CPU makes a system call, it will trap and then immediately spin waiting for the "in kernel" bit to clear. Note that interrupts are enabled nearly all the time while this is happening (they are only disabled while switching to and from the kernel stack and while running the scheduler), and a CPU that is spinning waiting to enter the kernel can still handle interrupts. IPIs are used to notify the CPU that is running the kernel of any changes to the readiness of threads that might require the kernel to be pre-empted.
The tricky part is how pre-emption is implemented. If I'm reading their Wiki article correctly, every system call is implemented in two phases, so that it makes all modifications to kernel data structures around the same time, and pre-emption is disabled during that time. If an interrupt occurs (could be an IPI from another CPU, or just a device IRQ) while the kernel is pre-emptible, the ISR will notice that the "lock" bit is clear and, if it notices that a higher-priority thread has just become ready to run, it will simply run the scheduler and switch to that thread, effectively discarding the current contents of the kernel stack. This is safe because if the system call had started to modify kernel data structures, it would have set the "lock" bit first and the ISR never would have run in the first place.
The trick is, before the kernel (running in the context of the ISR) switches to the higher-priority thread, it
modifies the program counter of the thread whose system call is being pre-empted (EIP/RIP in x86-speak) to point to the instruction that trapped in the first place (sysenter/syscall in x86-speak)! This is so simple, it's brilliant. When the thread eventually gets re-scheduled, it will just trap again, and re-start the system call that got aborted previously.
Once a system call finishes changing kernel data structures, it sets an "exit" flag so that if it is pre-empted, it will not be re-started.
I know there were some differences of opinion about the scalability of such a scheme compared to fine-grained locking... The issue is still not settled obviously. I just wanted to mention this technique because I think it's quite brilliant in its simplicity, and because it gives me a lot of closure since I've been wondering how the #@$!#%# QNX implements pre-emption for the better part of a decade now (since long before I found these forums)...