SMP and hardware task switching technique

Matthew · Post by **Matthew** » Sun Jul 05, 2009 7:47 pm

I've been hacking on a small hobby OS to add SMP support. The OS currently uses hardware task switching on a single processor. The way it works is that when the timer IRQ fires, it places the current task at the end of a runqueue, pulls off the next task, and uses a far jump to the TSS selector to switch tasks (if necessary).

This seems fairly simple, so I thought I would generalize it to SMP. I broadcast an IPI and all the processors (with a lock) place their current task on the queue, and get a new task and far jump to it. I know it's not a good protocol, but I thought it would be a simple way to start. It actually works in Bochs, sometimes, though I have not handled exiting back to the idle state well yet.

A common issue that arises is when Processor 1 is running Task 1, and Processor 2 is running Task 2, and they want to swap tasks. The problem is that Task 1 is busy until the far jump executes on Processor 1, and Task 2 is also busy until the far jump executes on Processor 2. This causes a GPF on both processors because both tasks are busy.

I have looked through the Intel System programming manual and I don't see any discussion about proper technique for task switching in SMP configurations. I've thought about manually clearing the Busy bit, but this seems like a dangerous hack, since you would at some point have the task being run in two processors. I could possibly also create some kind of "trampolining" task, but this seems like an awful lot of overhead. Does anyone know a better way?

pcmattman · Post by **pcmattman** » Sun Jul 05, 2009 8:03 pm

One of the easiest solutions is to keep a "run queue" per CPU, and lock threads to that CPU. That way you don't have to worry about threads switching across CPUs.

If you do need to change a thread's CPU, you would need to lock the run queues of both CPUs (so the other CPU will block should it attempt to access the queue) and then you simply unlink the thread from the first queue and link it into the second queue.

Keep in mind this isn't necessarily the best solution, but it's probably the easiest

.

Matthew · Post by **Matthew** » Sun Jul 05, 2009 9:04 pm

I was hoping to avoid locking threads to a CPU but I see it is probably necessary in some form. I am thinking about some kind of automatic migration protocol where there are per-CPU queues as well as a central runqueue -- and a thread may only be moved to the central runqueue if it is currently not busy and sitting on a per-CPU queue.

Thanks

Brendan · Post by **Brendan** » Mon Jul 06, 2009 1:34 am

Hi,

Matthew wrote:I was hoping to avoid locking threads to a CPU but I see it is probably necessary in some form. I am thinking about some kind of automatic migration protocol where there are per-CPU queues as well as a central runqueue -- and a thread may only be moved to the central runqueue if it is currently not busy and sitting on a per-CPU queue.

Locking threads to specific CPUs isn't necessary (although "CPU affinity" can be a useful feature for other purposes), but avoiding unnecessary task switches (e.g. if there's 2 CPUs and only 2 threads running) will improve performance, and trying to keep CPU caches warm will help too (e.g. trying to schedule a thread on the same CPU it used last time in the hope that some of it's data is still in that CPU's cache).

To improve performance further, for SMP it's generally good to avoid doing the same thing on many CPUs at the same time, to avoid having CPUs fighting for the same re-entrancy locks, cache lines, etc. For this reason using a broadcast IPI probably isn't a good idea - better to have a separate timer for each CPU where each CPU's IRQ occurs at different times (which is probably part of the reason why there's a local APIC timer built into each CPU).

Lastly, I wouldn't use a fixed timer IRQ for scheduling (especially if the local APIC's timer can be used), as you end up with many more IRQs than necessary. For example, you might do a task switch every 100 ms but have an IRQ every 1 ms, where 99% of the IRQs aren't needed. The PIT and the local APIC timers have a "one shot" mode, where you'd set the delay until the next IRQ occurs during the task switch. This can help to improve performance and has other advantages. For example, you can disable a CPU's timer when there's only one task for that CPU to run (to avoid the overhead of handling any unnecessary IRQs), which also helps power management (e.g. avoid taking an idle CPU out of a sleep state to handle an unnecessary IRQ).

In addition "one shot" mode could be used to gain a huge amount of precision in the scheduler. For example, imagine if a task wants to sleep for exactly 123456 ns - you could program the timer to generate an IRQ just before the task is meant to stop sleeping, then do a task switch, then busy loop for the remaining time. In this case a more precise timer means doing more useful work before the task switch and less busy waiting after the task switch. For the PIT you'd have up to 838 ns precision (rather than the 1000000 ns precision you'd get by setting the PIT to 1 KHz, for e.g.), and the local APIC is far more precise (it's typically able to get better than 10 ns precision).

A fixed frequency timer IRQ does make it easier to keep track of real time, but keeping track of real time has almost nothing to do with scheduling/task switching, and often there's better ways to track real time (e.g. using RDTSC where possible to get extreme precision, in conjunction with some other timer to compensate for any drift). Single CPU OSs (that don't support APICs) tend to use the same timer (e.g. PIT) for both purposes because there isn't enough timers; but IMHO that's more of an excuse than a reason, as the CMOS/RTC's periodic IRQ makes a reasonable alternative for tracking real time when nothing better is present.

Cheers,

Brendan

Matthew · Post by **Matthew** » Mon Jul 06, 2009 11:16 am

These sound like good suggestions. Right now I am trying to just get to the point of having it work, so I have lots of ugly decisions like a "big kernel lock" and the scheduling algorithm is pretty bad. I was wondering about better ways to do the timer interrupt, and I think I will go ahead and use the LAPIC timer.

Matthew · Post by **Matthew** » Mon Jul 06, 2009 11:30 pm

I seem to have gotten it working acceptably well for now. In my original implementation, fork() created a new task that began in userspace. This created a problem with the kernel lock -- because the new task could not unlock it. So I unlocked it before the task switch -- but this led to the problems above because there was a brief period of time where the task could have 2 CPUs try to run it. I modified fork() to start the new task in kernelspace and then IRET to the new task. So the kernel lock can now be safely unlocked after the old CPU has relinquished the task.

I've also investigated the LAPIC timer and I've managed to adapt some of your code to obtain the CPU bus frequency, so hopefully I can use that to create per-CPU scheduler timer interrupts.

Brendan · Post by **Brendan** » Tue Jul 07, 2009 4:26 am

Hi,

Matthew wrote:Right now I am trying to just get to the point of having it work, so I have lots of ugly decisions like a "big kernel lock" and the scheduling algorithm is pretty bad.

It's common (at least for me) to use many locks instead of one, to reduce the chance of lock contention (or, the chance of CPUs getting nothing useful done while trying to acquire a lock that's held by a different CPU).

For running tasks, the idea of using separate queues or linked lists for each CPU has already been mentioned. It's something I use for my scheduler designs, as it mean more locks and less chance of lock contention.

In general, when the scheduler is trying to decide which task to run next, it takes the first task off the current CPU's "ready to run" queue and runs that. When a task uses all of it's time slice, the scheduler decides which CPU the task should run on next (which is where CPU load balancing is done) and puts the task on the end of that CPU's "ready to run" queue. If a task blocks then it's not put on any CPU's "ready to run" queue, until it becomes ready to run again (for e.g. if a task calls "sleep()" then the scheduler switches to the first task on the current CPU's ready to run queue, and when the task finishes sleeping the scheduler decides which "ready to run" queue to put the task on).

This means that for a computer with 64 CPUs you've got 64 "ready to run" queues, with one lock each (64 locks total); where inserting a task or removing a task from a queue can be very quick, and the chance of the CPU not being able to acquire a lock immediately is very small (especially if CPU's are also using separate timers, and not all trying to acquire these locks at the same time).

Of course this is just a beginning. For example, the scheduler might have 256 task priorities, with 256 "ready to run" queues for each CPU (one for each priority), where the scheduler takes the first task off the highest priority "ready to run" queue for the current CPU (which means that for a computer with 64 CPUs, you end up with 16384 queues and 16384 locks, and very little chance of lock contention). Also note that the queues themselves don't necessarily need to operate as FIFO queues, and when a task is put on the "ready to run" queue you don't have to put that task at the end of the queues. For example, you could implement a real time scheduler by sorting the tasks on each queue in order of deadline, so that the task with the earliest deadline is always at the top of the queue. You can also use a different method for each queue - for example, you might have 4 queues for each CPU, where the first queue is in FIFO order, the second queue is in earliest deadline first order, the third queue is in "most likely to block" order, and the fourth queue is in "time when last run" order.

Also note that scalability has become a real concern for OS design. In theory, with twice as many CPUs it might be possible to get twice as much work done, but in practice this will never happen (due to both hardware and software limitations). For example, an OS with extremely good scalability might be able to get 90% more work with twice as many CPUs, and an OS with bad scalability might only get 50% more work done with twice as many CPUs, and the good OS might be able to get more work done with less CPUs. For e.g. with 4 CPUs the first OS might get 3.61 times as much work done than it'd do with one CPU, while with 8 CPUs the second OS might only get 3.375 times as much work done than it'd do with one CPU.

I'd assume we've all realized that CPU manufacturers have (almost?) reached the limit of how much performance they can get out of one core, and have been increasing the number of cores (and the number of logical CPUs) instead. As a rough guide, here's a timeline for both Intel and AMD chips:

January 2002, Intel's first single core with hyper-threading (Northwood, Pentium 4)
May 2005, Intel's first dual core (Smithfield, Pentium D)
May 2005, Intel's first dual core with hyper-threading (Smithfield, Pentium EE)
August 2005, AMD's first dual core (Manchester, Athlon 64 X2)
January 2007, Intel's first quad core (Kentsfield, Core 2 Quad)
March 2008, AMD's first quad core (Agena, Phenom II X4)
November 2008, Intel's first quad core with hyper-threading (Bloomfield, Core i7)
June 2009, AMD's first six core (Istanbul, Opteron 24XX)

And here's the same thing as a graph:

Now, if it takes 10 years to write an OS, then based on this information how many CPUs would you expect after another 120 months?

Of course I should point out that this is "CPUs per chip", and ignores computers with multiple chips (e.g. 2-way, 4-way and 8-way motherboards)...

Cheers,

Brendan

Matthew · Post by **Matthew** » Tue Jul 07, 2009 6:03 pm

I got it to work like a charm with the LAPIC timer interrupt, which is nice. I also discovered that QEMU re-routes the PIT IRQ to IRQ2 in APIC mode. I still need to figure out why it won't fully boot under VMWare.

My scheduler uses a set of 32 queues, one for each priority level, and there is a bitmap which tracks which queues are populated. Then each CPU grabs a kernel lock in turn before checking the bitmap and dequeuing the next task to run. That is my adaptation of a simple single-CPU scheduler.

Edit: VMWare actually routes the 8254 timer to two IRQs in APIC mode, IRQ0 as an ExtINT and IRQ2 as a fixed int. Go figure. Working now.

OSDev.org

SMP and hardware task switching technique

SMP and hardware task switching technique

Re: SMP and hardware task switching technique

Re: SMP and hardware task switching technique

Re: SMP and hardware task switching technique

Re: SMP and hardware task switching technique

Re: SMP and hardware task switching technique

Re: SMP and hardware task switching technique

Re: SMP and hardware task switching technique