Page 1 of 1

Kernel Preemption

Posted: Fri Sep 11, 2009 5:14 pm
by NickJohnson
Currently, my kernel is set up to be non-preemptible - interrupts are disabled on all system calls until they return. There is one kernel stack per processor (and because I haven't added SMP support, there is only one stack) on which most of the system call code is run, as well as one special stack per task (there can only be one thread running per task) used for saving task state images. This makes things easier by allowing the kernel stack to persist through task switches, but the image saving stack to switch with the new task. This is important in my kernel because multiple operations can perform task switches (timer interrupts AND certain IPC types).

Up until now, I haven't really worried about this design, because all of the system calls are relatively fast (drivers are out of the kernel), and unlikely to slow things down much. But from this quote by Brendan:
One kernel stack that's used by everything would mean that only one thread can be in the kernel at a time, which has serious performance implications. For example, you might have 16 CPUs where one of the CPUs is running kernel code, and a second CPU might get an IRQ. In this case, does the second CPU need to wait until the first CPU leaves the kernel before it can use the kernel stack?

Maybe you need one kernel stack per CPU, or one kernel stack per process, or one kernel stack per thread? One kernel stack per thread is much easier to do, especially if your kernel can be pre-empted.
I'm now afraid I may eventually incur the wrath of "serious performance implications". Should I be worried, or is this a minor problem for me? I don't really plan to run my OS on servers, but I would also like to be as future proof as possible while keeping things reasonably simple. And how exactly would you implement one stack per task if you have multiple task switch points?

Also, is there any way to figure out exactly how much time a system call takes to preform? From the wiki, it seems like the RTC needs interrupts on, so it would be impossible to use it. Is there some sort of Bochs feature that could do this?

Re: Kernel Preemption

Posted: Fri Sep 11, 2009 9:04 pm
by tantrikwizard
NickJohnson wrote:Should I be worried, or is this a minor problem for me? I don't really plan to run my OS on servers, but I would also like to be as future proof as possible while keeping things reasonably simple.
It depends on the purpose of the kernel. Multi-core desktops are now the standard with more and more cores becoming common place. multiple cores have always been around in mainframes but as the cost comes down and they become more common on desktops it should be taken advantage of. There's a lot of material written on parallel and concurrent programming, mostly revolving around mainframes but much is relevant for desktops. Ideally no core is ever idle. If mutual exclusion is required then only a single core can execute the section or access the shared resource at a time. This can be very taxing depending on the requirements. Ideally a kernel is lock-free, allowing all cores simultaneous access to all resources and the possibility to contention is reduced to zero. This isn't always possible but is ideal.


NickJohnson wrote:And how exactly would you implement one stack per task if you have multiple task switch points?
reduce your task switch points to 1 so that only 1 can occur at a time. I ran into a similar problem by using one timer IRQ to switches threads while another timer IRQ switches processes. In my case the IRQ handlers disabled interrupts until the thread or process switch was finished and interrupts enabled so there is no chance of conflict. Of course this is implementation specific, just depends on your design

NickJohnson wrote:Also, is there any way to figure out exactly how much time a system call takes to preform? From the wiki, it seems like the RTC needs interrupts on, so it would be impossible to use it. Is there some sort of Bochs feature that could do this?
Read a timer on entry and exit and compare them. RDTSC works pretty good for rough estimates, rough estimates in the millionths of a second's precision is well enough for most cases.

Re: Kernel Preemption

Posted: Sat Sep 12, 2009 2:31 am
by Brendan
Hi,
NickJohnson wrote:Currently, my kernel is set up to be non-preemptible - interrupts are disabled on all system calls until they return. There is one kernel stack per processor (and because I haven't added SMP support, there is only one stack) on which most of the system call code is run, as well as one special stack per task (there can only be one thread running per task) used for saving task state images. This makes things easier by allowing the kernel stack to persist through task switches, but the image saving stack to switch with the new task. This is important in my kernel because multiple operations can perform task switches (timer interrupts AND certain IPC types).

Up until now, I haven't really worried about this design, because all of the system calls are relatively fast (drivers are out of the kernel), and unlikely to slow things down much. But from this quote by Brendan:
One kernel stack that's used by everything would mean that only one thread can be in the kernel at a time, which has serious performance implications. For example, you might have 16 CPUs where one of the CPUs is running kernel code, and a second CPU might get an IRQ. In this case, does the second CPU need to wait until the first CPU leaves the kernel before it can use the kernel stack?

Maybe you need one kernel stack per CPU, or one kernel stack per process, or one kernel stack per thread? One kernel stack per thread is much easier to do, especially if your kernel can be pre-empted.
I'm now afraid I may eventually incur the wrath of "serious performance implications". Should I be worried, or is this a minor problem for me? I don't really plan to run my OS on servers, but I would also like to be as future proof as possible while keeping things reasonably simple. And how exactly would you implement one stack per task if you have multiple task switch points?
The way I see it, there's 3 separate problems:
  • Can the OS support multiple CPUs? With one kernel stack for everything, the answer is no. For a desktop/server OS this is extremely bad (e.g. OS is only capable of using 25% of a modern quad core CPU). For something like an embedded system (where there's only one CPU anyway) it can reduce complexity though.
  • Can the kernel be preempted if a higher priority task becomes ready to run? With one kernel stack per CPU, the answer is no. For an example, imagine the kernel does something that could take 100 ms. If something causes a higher priority task to become ready to run (e.g. the user presses a key or moves the mouse, or the network card receives a packet, or a disk drive completes an operation, etc) but the kernel just started this lengthy operation, then it will take at least 100 ms before the higher priority task can be given CPU time. For something like a HTTP/FTP server (something that's constantly waiting for network packets and disk I/O) this could cripple performance. It can also make a GUI feel "sluggish" and non-responsive, and for real time systems it can be a major problem. However, it can be a reasonable approach if all kernel operations are extremely fast anyway (e.g. a small nano-kernel or micro-kernel).
  • Is the kernel interruptable? This has similar effects to "can the kernel be preempted" - if an IRQ occurs that should cause a higher priority task to preempt a lower priority task, but the CPU is running kernel code, then the IRQ isn't handled until after the CPU leaves the kernel (and the low priority task isn't preempted by the high priority task until after that happens). It also creates major problems on SMP systems, because "interrupts disabled" also means that CPUs don't respond to IPIs either. For an example consider "TLB shootdown" - two CPUs could be waiting for each other to handle an "invalidate a TLB entry" IPI, and because neither CPU returns to user mode (or enables interrupts) both CPUs wait forever. It is possible to prevent IRQs while allowing IPIs (or to allow some IPIs and some IRQs while masking other IRQs) by masking all/some IRQs in the interrupt controller rather than disabling interrupts in the CPU (e.g. "CLI"), and if you're using the I/O APIC you can use the local APIC's "task priority register" to block lower priority IRQs while allowing higher priority IRQs, but in general it's more hassle than it's worth (better to have an interruptable kernel, even if it's not a preemptable kernel). However, it may be reasonable (in some cases - e.g. embedded systems) to have a kernel that is non-interruptable, non-preemptable and single-CPU only.
From what you've said above, it sounds to me like your micro-kernel counts as "half a kernel stack per task"; where you could insert special "preemption points" in any lengthy operations (if necessary). If the "worst case length of time spent at CPL=0 before returning to CPL=3" for your micro-kernel is very fast (and realizing that for SMP this includes the maximum amount of time the kernel may need to wait before successfully acquiring any reentrancy locks), then your biggest problem may be that the kernel is non-interruptable (mainly, the IPI handling).
NickJohnson wrote:Also, is there any way to figure out exactly how much time a system call takes to preform? From the wiki, it seems like the RTC needs interrupts on, so it would be impossible to use it. Is there some sort of Bochs feature that could do this?
The RTC isn't precise enough anyway - you'd find that all kernel functions almost always take "0 ticks" and occasionally some of them might take "1 tick"; and you'd need to analyze how frequently each kernel function takes "1 tick" (and take into account how often each kernel function is used) and then use this information to estimate probabilities.

Probably the best/easiest way would be to use get the TSC count when you enter the kernel (e.g. RDTSC), and then do "current_TSC_count - TSC_count_on_entry" to calculate how many cycles you spent in the kernel each time. You could use a table to keep track of this information. For example:

Code: Select all

/*** Similar code at every kernel entry point ***/

    entry_time = RDTSC;
    reason_for_kernel_entry = kernel API function number, or IRQ number, or whatever

Code: Select all

/*** Similar code at every kernel exit point ***/

    time_taken = RDTSC - entry_time;
    cycle_table[reason_for_kernel_entry].total_cycles += time_taken;
    cycle_table[reason_for_kernel_entry].count++;
    if(cycle_table[reason_for_kernel_entry].worst_case < time_taken) cycle_table[reason_for_kernel_entry].worst_case = time_taken;

Code: Select all

/*** Code used when kernel is preempted ***/

    time_taken = RDTSC - entry_time;
    cycle_table[reason_for_kernel_entry].total_cycles += time_taken;
    cycle_table[reason_for_kernel_entry].count++;
    if(cycle_table[reason_for_kernel_entry].worst_case < time_taken) cycle_table[reason_for_kernel_entry].worst_case = time_taken;
    entry_time = RDTSC;
    reason_for_kernel_entry = TASK_PREEMPTION;
That way you can find out the average time everything takes ("total_cycles / count") and the worst case time everything takes.


Cheers,

Brendan

Re: Kernel Preemption

Posted: Sat Sep 12, 2009 1:00 pm
by NickJohnson
Thanks for the responses!
Is the kernel interruptable? This has similar effects to "can the kernel be preempted" - if an IRQ occurs that should cause a higher priority task to preempt a lower priority task, but the CPU is running kernel code, then the IRQ isn't handled until after the CPU leaves the kernel (and the low priority task isn't preempted by the high priority task until after that happens). It also creates major problems on SMP systems, because "interrupts disabled" also means that CPUs don't respond to IPIs either. For an example consider "TLB shootdown" - two CPUs could be waiting for each other to handle an "invalidate a TLB entry" IPI, and because neither CPU returns to user mode (or enables interrupts) both CPUs wait forever. It is possible to prevent IRQs while allowing IPIs (or to allow some IPIs and some IRQs while masking other IRQs) by masking all/some IRQs in the interrupt controller rather than disabling interrupts in the CPU (e.g. "CLI"), and if you're using the I/O APIC you can use the local APIC's "task priority register" to block lower priority IRQs while allowing higher priority IRQs, but in general it's more hassle than it's worth (better to have an interruptable kernel, even if it's not a preemptable kernel). However, it may be reasonable (in some cases - e.g. embedded systems) to have a kernel that is non-interruptable, non-preemptable and single-CPU only.
I don't quite see the difference between "preemptible" and "interruptible" here. Is is just that with "interruptible" you only turn off interrupts that could cause task switches?
tantrikwizard wrote:NickJohnson wrote:
Also, is there any way to figure out exactly how much time a system call takes to preform? From the wiki, it seems like the RTC needs interrupts on, so it would be impossible to use it. Is there some sort of Bochs feature that could do this?

Read a timer on entry and exit and compare them. RDTSC works pretty good for rough estimates, rough estimates in the millionths of a second's precision is well enough for most cases.
I didn't even realize the x86 had a built in timestamp. I'm sure that millionths of a second will be more than accurate enough when using a "clock=slowdown ips=1000000"-set Bochs. In theory, wouldn't that let me count individual clock cycles?

Now that I think about it, the reason I have one stack per CPU is that it makes task switching possible at multiple points - which implies that I need that stack *only* when actually performing a task switch. I could easily have a per-task kernel stack that jumps to a non-interruptible per-cpu stack only at the key moment of a task switch. Although this would increase the number of stacks in my system to 4 :shock: (user stack, task image stack, kernel stack, task switch stack), I could keep things fully preemptible *and* lock-free outside of a few instructions in task switches.

Time for some refhacktoring!

Re: Kernel Preemption

Posted: Sat Sep 12, 2009 4:13 pm
by Brendan
Hi,
NickJohnson wrote:
Is the kernel interruptable? This has similar effects to "can the kernel be preempted" - if an IRQ occurs that should cause a higher priority task to preempt a lower priority task, but the CPU is running kernel code, then the IRQ isn't handled until after the CPU leaves the kernel (and the low priority task isn't preempted by the high priority task until after that happens). It also creates major problems on SMP systems, because "interrupts disabled" also means that CPUs don't respond to IPIs either. For an example consider "TLB shootdown" - two CPUs could be waiting for each other to handle an "invalidate a TLB entry" IPI, and because neither CPU returns to user mode (or enables interrupts) both CPUs wait forever. It is possible to prevent IRQs while allowing IPIs (or to allow some IPIs and some IRQs while masking other IRQs) by masking all/some IRQs in the interrupt controller rather than disabling interrupts in the CPU (e.g. "CLI"), and if you're using the I/O APIC you can use the local APIC's "task priority register" to block lower priority IRQs while allowing higher priority IRQs, but in general it's more hassle than it's worth (better to have an interruptable kernel, even if it's not a preemptable kernel). However, it may be reasonable (in some cases - e.g. embedded systems) to have a kernel that is non-interruptable, non-preemptable and single-CPU only.
I don't quite see the difference between "preemptible" and "interruptible" here. Is is just that with "interruptible" you only turn off interrupts that could cause task switches?
"Non-interruptible" means interrupts are disabled (e.g. "CLI" instruction). "Non-preemptable" means that task switches are disabled. It's possible to have an interruptable kernel that is not preemptable (e.g. if an IRQ should cause a task to be preempted, then the preemption is postponed until the task attempts to return to user-space). It's also possible (in theory) to have a non-interruptable kernel that is preemptable, although in practice it's almost always insane (as IRQs are almost always the only things that can cause preemption). :)
NickJohnson wrote:I didn't even realize the x86 had a built in timestamp. I'm sure that millionths of a second will be more than accurate enough when using a "clock=slowdown ips=1000000"-set Bochs. In theory, wouldn't that let me count individual clock cycles?
In practice, the CPU's time stamp counter may count cycles (which changes when the CPU's speed changes) but it may also count "ticks" (which don't change when the CPU's speed changes). You'd need to see the documentation for your CPU to determine what it actually does count (it's different for different CPUs).

For Bochs, the every instruction takes exactly 1 cycle (for e.g. a complex instruction like "IDIV" takes the same time as a very simple instruction like "CLD"), and the actual speed of the CPU makes no difference.
NickJohnson wrote:Now that I think about it, the reason I have one stack per CPU is that it makes task switching possible at multiple points - which implies that I need that stack *only* when actually performing a task switch. I could easily have a per-task kernel stack that jumps to a non-interruptible per-cpu stack only at the key moment of a task switch.
If you've got a per-task kernel stack, then why do you need to bother with a non-interruptable per-CPU stack at all?
NickJohnson wrote:Although this would increase the number of stacks in my system to 4 :shock: (user stack, task image stack, kernel stack, task switch stack), I could keep things fully preemptible *and* lock-free outside of a few instructions in task switches.
It might increase the number of types of stack to 4, but if there's 1000 tasks running and you've got one user stack per task, one task image stack per task, one kernel stack per task and one task switch stack per CPU, then that adds up to 3001 stacks (where most OS's would just have 2000 stacks for 1000 tasks)...


Cheers,

Brendan

Re: Kernel Preemption

Posted: Sat Sep 12, 2009 5:38 pm
by NickJohnson
Brendan wrote:Hi,
NickJohnson wrote:Now that I think about it, the reason I have one stack per CPU is that it makes task switching possible at multiple points - which implies that I need that stack *only* when actually performing a task switch. I could easily have a per-task kernel stack that jumps to a non-interruptible per-cpu stack only at the key moment of a task switch.
If you've got a per-task kernel stack, then why do you need to bother with a non-interruptable per-CPU stack at all?
It's because in my kernel there are two different circumstances (scheduling and a type of IPC) under which a task switch can occur, which both have to run different pieces of code before and after the switch. Also, one (the IPC) has to store some data that it will access from the other side of the switch, which can't fit into the registers. It's easy to make this work with a small stack that is preserved through task switches (i.e. 1:1 ratio with the CPU), which stores the data and keeps things consistent during the small sections surrounding the switch. Because there is only one of these stacks per CPU, and there is an extremely small amount of stuff that happens on it in a guaranteed number of cycles, it's (imho) best to stop all interrupts while using it to make things simpler. I'm more than fine with a microsecond or two of no interrupts.