How to organize threads in an SMP scheduler

rdos · Post by **rdos** » Tue Apr 26, 2011 1:23 am

Well, I've tried to extend my single core scheduler to a multi core environment, and it didn't do well performance-wise (it isn't even stable with 3 cores or more). The idea was to keep the scheduler-lists in a common segment, and to extend the scheduler-locking to work on multiple cores. The problem with this design is that it doesn't scale well, and that IO-based threads will waste a lot of time by locking-out other cores from the scheduler. Only one core at the time can be in the scheduler, which creates a lot of lock-contention. It also seems like video-based code should be run on the same core, and this really isn't a problem in the OS-design as the usual way to handle the GUI is to have a dedicated thread for GUI-redraw.

So, it seems to be time to fix threads to cores. In this design, there would be no need for a spinlock in the scheduler. In fact, the single-core locking should be fully usable provided cores does not manipulate each others thread-lists during scheduling. The only thing required is the multi-core locking of the scheduler-lists that is already present and that uses a spinlock. The problem with supporting signals (and waking up threads) on another core could be solved by IPIs if the target thread is not executing on the current core. A special list (that uses a spinlock) could be created for pending wakeups much like is the case today when ISRs send signals, and the scheduler is executing.

Load balancing could be implemented with a kernel thread. It could take a thread from the core with the shortest time in the null-thread (this information is readily available), and move it to the core with the longest time in the null-thread. This could be done once a second, or possibly more frequently. Another optimization that would be possible is to move ISRs that send signals to the core where the signal handler thread is located (this would eliminate an IPI), or to move threads that send signals to each others to the same core when possible (would also eliminate an IPI).

Since timers are already per-core, threads waiting for timeouts can be handled fully locally (when the thread starts a timer, it will start it on the current core and will itself be assigned to the current core), and when the timer expires, the thread can be woken-up locally without a need for an IPI, provided the thread has not been moved by the load-balancer thread.

EDIT: Some further study of the current scheduler indicates that the only primitives that reschedules a thread is a thread waiting for a timeout, a thread waiting for a signal, and a thread being blocked in a specified (shared) list. Threads waiting for a timeout are not in any lists, and thus cannot be moved by the load-balancer, so they can always be handled without IPIs. Threads waiting for a signal are moved to a special list when signalled. These can be reactived from ISRs. Threads being blocked in a shared list are also moved to a special list when being reactived, and thus can be reactivated from ISRs as well. This means the current primitive for locking the thread lists can be used to move threads into another core's thread list, followed by an IPI that in principle only does a scheduler lock/unlock on the target core to handle the potentially new threads wanting to execute (this looks exactly like a a signal/wakeup from an ISR, which already works).

An interesting question is if shared lists and core thread lists should utilize the same spinlock or not. Typically, the code to remove threads from a list and insert it into another list is fast (double-linked lists), so this locking will be brief. Still, non-shared locks will scale better.

Brendan · Post by **Brendan** » Tue Apr 26, 2011 7:24 am

Hi,

rdos wrote:Well, I've tried to extend my single core scheduler to a multi core environment, and it didn't do well performance-wise (it isn't even stable with 3 cores or more).

Are you sure it's just the scheduler? I have a suspicion that the problems in the scheduler are hiding other problems elsewhere; and if you got the scheduler working very well then you'd start noticing problems with locking/scalability in the memory management, in the IPC, in the VFS, in the drivers and in everything else.

I'm wondering whether it'd be more practical to revert RDOS back to "single CPU only" and do little more than maintenance and bugfixes (no new features); while starting "RDOS 2" from scratch with a complete redesign (and limited backward compatibility or no backward compatibility at all).

I mean, from what I remember about the original RDOS, there's:

Device driver API/framework from early 1990s
Hardware task switching (now unsupported/deprecated by CPU manufacturers)
Segmentation (now unsupported/deprecated by CPU manufacturers)
No consideration for multi-CPU during the design or implementation of anything
No consideration for power management during the design or implementation of anything

This is probably just the beginning - the visible part of the iceberg that is above the water. Trying to get it working well on modern (multi-CPU, 64-bit) systems might be like trying to carve a large sculpture out of granite using nothing more than your forehead.

Cheers,

Brendan

rdos · Post by **rdos** » Tue Apr 26, 2011 8:16 am

Brendan wrote:Are you sure it's just the scheduler? I have a suspicion that the problems in the scheduler are hiding other problems elsewhere; and if you got the scheduler working very well then you'd start noticing problems with locking/scalability in the memory management, in the IPC, in the VFS, in the drivers and in everything else.

Yes. I think it is only the scheduler. All the problems are located in the scheduler code. The current code base is deployed on 40+ installations right now, so bugs in the scheduler should have shown up in those (single-CPU) installations, or in my test installations. I cannot be sure that all the code is SMP-safe (I think some still might not be), but much of these issues have been solved (gate patching, real-time clock).

Brendan wrote:I'm wondering whether it'd be more practical to revert RDOS back to "single CPU only" and do little more than maintenance and bugfixes (no new features); while starting "RDOS 2" from scratch with a complete redesign (and limited backward compatibility or no backward compatibility at all).

Possibly, but I'm not so keen on this idea.

Brendan wrote:Hardware task switching (now unsupported/deprecated by CPU manufacturers)

This has been removed (and successfully so).

Brendan wrote:Segmentation (now unsupported/deprecated by CPU manufacturers)

Yes, but I'm not likely to drop segmentation anytime soon.

Brendan wrote:No consideration for multi-CPU during the design or implementation of anything

True, but the code is implemented with multitasking in mind from the start, and uses a very limited number of basic synchronization primitives that the scheduler needs to implement. So this is not a big issue. There probably are a number of cli/sti used for blocking out thread switches still, that needs to be changed to spinlocks, but those are very few.

Brendan wrote:No consideration for power management during the design or implementation of anything

I don't see this as a big issue either. It can be added without much hassle if needed / wanted.

Besides, the major rewrite will be in supporting C/C++ in kernel, which is a project underway. This will eventually replace parts of the device-drivers with something that is more readable and maintainable, especially for people other than myself. It is possible that after many of the drivers have been ported to C/C++, that alternative environments based on a non-segmented memory model could be created, but that is quite some time into the future.

Brendan · Post by **Brendan** » Tue Apr 26, 2011 9:49 am

Hi,

rdos wrote:
Brendan wrote:Are you sure it's just the scheduler? I have a suspicion that the problems in the scheduler are hiding other problems elsewhere; and if you got the scheduler working very well then you'd start noticing problems with locking/scalability in the memory management, in the IPC, in the VFS, in the drivers and in everything else.
Yes. I think it is only the scheduler. All the problems are located in the scheduler code. The current code base is deployed on 40+ installations right now, so bugs in the scheduler should have shown up in those (single-CPU) installations, or in my test installations. I cannot be sure that all the code is SMP-safe (I think some still might not be), but much of these issues have been solved (gate patching, real-time clock).

I'm not talking about bugs, I'm talking about scalability problems.

For an example; imagine some file system code that can't handle more than one (read/write/open/close) request at a time, and uses a single mutex (or queue or something) to postpone addition requests while a request is already in progress. This can work correctly on systems with 128 CPUs, despite being a massive bottleneck that cripples the performance when 2 or more CPUs are trying to use it at the same time. In this case, you can't fix the massive performance problem in the scheduler - you'd have to redesign the file system code to handle an arbitrary (unlimited?) number of concurrent requests (and make sure that all the underlying storage device driver/s don't have similar problems).

Cheers,

Brendan

Owen · Post by **Owen** » Tue Apr 26, 2011 9:55 am

On the other hand, there is a lot of merit to making an orderly transition from an old to a new architecture:

Where needed, define the interfaces that the new code will use
Implement the new interfaces, whether as a wrapper for the old interface or the old wrapping the new (Alternatively, you could make them independent but build a bridge between them)
Incrementally migrate the system to the new interfaces
Remove the old interfaces

I think many people would agree that a VFS layer which is a bottleneck is still better than no VFS layer at all. There is no point in throwing away all of your investment in the old architecture at once.

rdos · Post by **rdos** » Tue Apr 26, 2011 12:05 pm

The VFS layer has been present for quite a while, and it was designed with buffering and concurrency from the start. It was not designed to use an aynchronous interface because of design-choices (the asynchronous interface is messy). It was also designed to use server-threads to handle disc-IO requests, and has the possibilty for using disc-scheduling algorithms. It also buffers file data and directory entries in RAM. There is also a provision for read-ahead. So, the absence of main-stream SMP-systems did not affect design choices in any significant way. If SMP is present, handler threads could run on multiple cores.

However, the VFS layer is not from the late 80s/early 90s. It was implemented around 2000 or something like that. At that time I compared performance with Windows and it outperformed Windows on most accounts after adding file caching. As a by-effect of the file caching feature, memory mapped files was developped.

I'm sure the design could have been better adapted for large servers, if this had been a likely target, but this was not the case. The likely target was (and still is) embedded systems.

EDIT: The VFS does not have a single lock, nor is it based on a single request. There are locks per file (for file-content related accesses) and locks per directory (that come in read-only and read-write variants). The server threads are per disc, and works against a list of requests. So, the reason why there is no asynchronous IO interface has to do with a design policy, not the possibility to add such an interface, which is very much quite possible to add. Besides, some file-systems also use a variant of asynchronous caching by creating many requests, queueing & starting them, and then waiting for them all to complete. This is possible to do in the interface against the physical disc, but not against file-data.

OTOH, there is a 4GB limitation for file sizes, and a 2TB limitation for partition sizes, which might soon become questionable limitations.

Another feature that makes SMP-porting easier, is how server-threads that rely on interrupts are implemented. The server-thread blocks waiting for a signal. The ISR will unblock the server thread. The signal feature is a safe way to wait for an event, regardless if it has already happened, if it happens as a thread enters "wait-for-signal", or if it happens while already blocked. These drivers do not need spinlocks. The spinlocks are instead implemented in the signal/wait-for-signal functions when needed in the scheduler. The multiple-wait interface is also implemented on top of the signal/wait-for-signal interface.

rdos · Post by **rdos** » Tue Apr 26, 2011 12:11 pm

Owen wrote:On the other hand, there is a lot of merit to making an orderly transition from an old to a new architecture:

Yes, and the major factor against redesigning the scheduler from scratch is that there would be no advanced test-apps to stress it and validate it. That alone is such a strong factor against doing a complelety new scheduler that I feel it is enough for a counter-argument. If there would be some missing feature in the scheduler-interface, it would better to implement it in the current scheduler first, and then move the whole interface to the new scheduler.

rdos · Post by **rdos** » Wed Apr 27, 2011 1:05 am

The first milestone in the new scheduler design is completed. Now runnable threads are queued per core, and not per system. It works on single core CPUs, while the SMP scheduler now is broken. Next will be to also move wakeup and signal lists to cores, provide a core selector per thread so it can be requeued on the correct core. After that the SMP scheduler locks should be removed, and SMP should instead use the single core scheduler locking scheme. At this point it should be possible to start RDOS on a SMP processor again, but all threads will run on the BSP. Next the load balancer should be implemented, and start shuffling threads to different cores.

EDIT: Moving lists and increasing lock granularity (separate locks per core and common lists) is done. The SMP scheduler locks are removed, and replaced with single core locks. Next is to test the new scheduler on a SMP processor, and start implementing the load balancer. I think it is easier if the load balancer just sets up some rules, and doesn't perform them. One rule could be to move a thread from core 2 to core 0, and let any operation on a thread located on core 2 switch the thread (putting it on the wakeup list on the destination core and sending the core an IPI). That way, the core thread lists will not need any extra protection with spinlocks in the scheduler. Only the wakeup list and the signal list need a spinlock.

rdos · Post by **rdos** » Wed Apr 27, 2011 12:49 pm

OK, so after reinserting the APIC timer again (which I happened to remove before), the new scheduler now also works on both my SMP machines. They run everything on BSP, and the AP cores just runs their null-threads.

The GUI still perfoms lousy on my AMD Athlon though. I don't know why that is. It has an ATI Radeon HD 4250, which should be a fairly new card, but still it performs lousy when used with LFB. When the GUI app is switched away from "input focus", and it can work against ordinary RAM instead, performance increases at least a factor 100. Very strange.

Owen · Post by **Owen** » Wed Apr 27, 2011 12:58 pm

Is the LFB mapped as WC (Write Combining)? WC offers massive performance gains for write-only devices like graphics frame buffers (and frame buffers truly should be write-only; WC memory is not even self-coherent)?

I presume also, of course, that the application isn't reading from the framebuffer.

In most cases, I would say that you should have the app draw to RAM and the GUI system should then blit that buffer to the display in one go.

rdos · Post by **rdos** » Wed Apr 27, 2011 10:36 pm

Owen wrote:Is the LFB mapped as WC (Write Combining)? WC offers massive performance gains for write-only devices like graphics frame buffers (and frame buffers truly should be write-only; WC memory is not even self-coherent)?

I'll have to check on that.

Owen wrote:I presume also, of course, that the application isn't reading from the framebuffer.

The test application definitely does, as it exercises all types of combine-codes directly against the LFB, but this is usually not done in an ordinary app.

Owen wrote: In most cases, I would say that you should have the app draw to RAM and the GUI system should then blit that buffer to the display in one go.

Yes.

rdos · Post by **rdos** » Wed Apr 27, 2011 10:46 pm

Basics for load balancer now in place. It uses elapsed time per thread (this is kept with sub-microsecond resolution) to first calculate time in null threads, and then calculating the differentials between cores to see how the past (currently 100ms) have looked like. Then the elapsed time for the busiest (currently 32) threads are recorded in a sorted array. Then the load balancer will pick threads to move so load is more balanced (assuming the same load pattern) at the next read-out. When this works, number of IPIs will also be taken into account. I plan to record the number of IPIs that each thread does to each core, and then the busiest IPI paths are used to try to move threads to cores that will minimize IPIs.

EDIT: The action of moving a thread from one core to another is really simple. All the load balancer needs to do is to modify the p_core_sel in the thread control block to point to the new core. This can be done without locking. The scheduler will do the move as the thread is unblocked or preempted on the target core when it is noticed that p_core_sel is not the current core.

The scheduler code to move a thread to another core is also really simple. All it does is to take the core lock (a spinlock), record if any pending requests are already present, add the thread to the wakeup-list, release the lock, and if there where no pending requests, send a wakeup-IPI to the target core. The wakeup-IPI does a conditional scheduler lock/unlock, which would activate the thread if the scheduler is not busy (if it is busy, it would be activated as the scheduler lock goes down to zero).

rdos · Post by **rdos** » Thu Apr 28, 2011 10:56 pm

The load balancer now gives reasonable suggestions for moving threads. I had to setup a couple of restrictions before this worked properly:

* Do not move anything if the differentials between core loading is less than 2ms (which translates to 2%)

* Only move threads from cores that are loaded more than the average core.

* Do not move threads that loads the current core too much (currently set to 0.75 times the overloading of the current core). This means a single, busy thread will not be moved.

* Do not move threads that used less than 100us CPU time during the last interval.

Using these criteria, nothing will happen as the command-line interpreter is run (this is correct). OTOH, when disc is accessed, the load balancer sees enough load on the disc server thread, and thus gives a suggestion to move it (or the command-line interpreter) to a new core. Next I will see what happens as I actually do the move.

rdos · Post by **rdos** » Sat Apr 30, 2011 5:04 am

OK, so the load balancer seem to work ok, but some synchronization primitives seem to fail. Now that the scheduler-locks are per core (and not per system), some things that could previously be handled with locking the scheduler, no longer can be handled like this. Mostly it seems to be critical sections that are broken. These need to use the global list spinlock. However, the problem is when a thread is blocked on a critical section. It needs to keep the spinlock until it has been queued in the wait-list, but with the current scheduler-design this is not possible. IOW, I need a new blocking function that presumes the global list spinlock is already taken, and that puts the thread in the target list, and releases the spinlock. Ideally, it should first put the thread in the section blocked list, release the spinlock, save the thread context (registers) and reschedule. Alternatively, save thread cotext, put thread in section blocked list, release the spinlock, and reschedule. The latter scenario would create more lock-contention, but it would be faster when there is no lock-contention. Maybe critical sections should even contain a spinlock field so each section could have a private spinlock instead. This would increase size of critical section data for non-SMP systems, but it would scale better on SMP-systems.

rdos · Post by **rdos** » Sat Apr 30, 2011 3:11 pm

The load-balancer approach is a complete disaster for the message passing application. It organizes the threads so that the common receiver-thread is on one core, and the sender threads on the remaining cores. This means each message needs two IPIs, and performance becomes really bad. In fact, the performance is so bad that it goes down a factor 7-8 when going from one core to three. Not only that, but the null threads will run most of the time when there is useful work queued on other cores that could be run. I'm beginning to suspect that threads being ready for execution should be in a common pool instead. That way priorities will also be honored, something that does not apply when work is local.

OTOH, the message passing application might be a bad example, and more "normal" applications might benefit more from the load-balancing approach.

OSDev.org

How to organize threads in an SMP scheduler

How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler

Re: How to organize threads in an SMP scheduler