Keeping real-time in a SMP kernel

rdos · Post by **rdos** » Wed Apr 20, 2011 2:32 pm

OK, so on a single-core OS this is quite simple: (using Timestamp-counter)

Intialization:

    rdtsc
    mov ds:last_tsc,eax

Updating real (elapsed) time:

Code: Select all

    rdtsc
    mov edx,ds:last_tsc
    mov ds:last_tsc,eax
    sub eax,edx
    adc ds:system_time,eax
    adc ds:system_time+4,0

When this code is used with more than one core, things can go kind of wrong. If the timestamp counters are not synchronized, time will not be updated correctly. I think this is why the logic seems to work (unmodified) on an Intel Atom (with only hyperthreading, which probably has a shared timestamp counter), while it doesn't work well on a multicore AMD Athlon.

So how should this be solved? An requirement is that any core should be able to read (the correct) system time at any time. The first thing is probably to move last_tsc into the private core datablock. Then one core (probably the BSP) should act as a reference, and send NMI IPIs to synchronize time between cores. System_time probably also needs to be moved to the private core datablock. The NMI handler would then read the BSPs time-setting, copy it to the local time, and then readout TSC and save this as last_tsc.

Combuster · Post by **Combuster** » Wed Apr 20, 2011 3:44 pm

The TSC is at worst
- core-dependent
- variable in frequency
and was only really useful for performance testing. Better is to use a fixed-frequency clock as the baseline for time (like the RTC or PIT), and synchronize the local TSC on each tick in case you really want sub-millisecond accuracy.

rdos · Post by **rdos** » Wed Apr 20, 2011 10:51 pm

I left-out the part where the RTC synchronizes the clock for clarity. In reality, RDOS keps time as PIT tics, and when it uses the TSC for high-precision time-keeping, it multiplies the TSC differential with a 32-bit value. This value is calculated (by measurement) at boot time, and is then adjusted by the RTC interrupt to keep the clock synchronized with RTC.

One idea is to keep the adjustment value value per-core, and to let the RTC int go to all cores (should be possible to setup in the APIC). It would also be possible to measure TSC frequency on AP cores as they start in a similar manner as with the BSP. Maybe this would be enough?

Brendan · Post by **Brendan** » Wed Apr 20, 2011 11:32 pm

Hi,

Rough notes (as I don't have much time to spare at the moment)...

rdos wrote:

Code: Select all

    rdtsc
    mov edx,ds:last_tsc
    mov ds:last_tsc,eax
    sub eax,edx
    adc ds:system_time,eax
    adc ds:system_time+4,0

If you only use the lowest 32-bits of the TSC, then you must make sure you read the TSC often enough to avoid missing roll-over (which would occur about once per second on a 4 GHz CPU), and if/when you've got any power saving stuff implemented it'd become a nightmare. If you use the entire 64-bit TSC, then some CPUs have a "ripple through" bug where it's possible to get the wrong values when the lowest 32-bits overflow (e.g. you might read 0x00000000FFFFFFFF, then 0x0000000000000000, then 0x0000000100000001 because the two halves of the counter are implemented as separate 32-bit registers internally).

Don't assume CPUID is fixed frequency unless you know it is. There's a CPUID flag you can use on newer CPUs to determine if the TSC is fixed-frequency or not. Otherwise you can fall back to using "vendor/family/model" or simply assume TSC isn't fixed frequency and use some other timer on that computer.

For some CPUs (if the TSC isn't fixed frequency), then it may be possible to use have a different "TSC multiplier" for each CPU and dynamically adjust this multiplier when the CPU's speed changes. There's software controlled speed changes (where you can update the "TSC multiplier" when you change the CPU's frequency) and hardware controlled speed changes used for thermal throttling. You'd use the "thermal sensor" IRQ in the local APIC to detect when the CPU has changed it's speed for thermal throttling purposes.

It's possible to use a different timer (e.g. an IRQ from the PIT that's configured in the IO APIC as "broadcast to all") to keep the cores in sync. In this case the TSC would be used to determine "how long since the last IRQ" to improve precision, and the frequency of the other timer will determine worst case accuracy even if the TSCs are all running at different speeds. You could also use this IRQ to constantly re-calibrate the "TSC multiplier" for each CPU. The problem here is the overhead from the IRQ.

Alternatives to the TSC (in order of precision) include the local APIC timer's count (but in most CPUs the local APIC timer stops working if/when the CPU is put into any "deep sleep" power management state), HPET, ACPI counter, PIT and RTC.

In general, it ends up being a compromise between complexity, overhead, precision, accuracy, and worst case "current time" difference between different CPUs.

rdos wrote:When this code is used with more than one core, things can go kind of wrong. If the timestamp counters are not synchronized, time will not be updated correctly. I think this is why the logic seems to work (unmodified) on an Intel Atom (with only hyperthreading, which probably has a shared timestamp counter), while it doesn't work well on a multicore AMD Athlon.

Even if the TSC in all CPUs are always running at the exactly same frequency, they can be different. For example, TSC is "cycles since the CPU started" and different CPUs are started at different times (e.g. BSP first, then other CPUs at different times later on). Because of this, at a minimum you need a different "last_tsc" variable for each CPU.

For performance in "many CPU" systems, you don't want any shared variables at all if you can avoid it, because that can cause "cache line thrashing" (especially if the variable is modified often). Basically, the cache line may be constantly being transferred between CPUs, which can consume significant amounts of bus bandwidth and (in severe cases) can cripple performance for everything using that bus. For this reason you should also consider having an entirely separate "system_time" variable for each separate CPU too.

Cheers,

Brendan

rdos · Post by **rdos** » Fri Apr 22, 2011 7:02 am

Brendan wrote:If you only use the lowest 32-bits of the TSC, then you must make sure you read the TSC often enough to avoid missing roll-over (which would occur about once per second on a 4 GHz CPU), and if/when you've got any power saving stuff implemented it'd become a nightmare. If you use the entire 64-bit TSC, then some CPUs have a "ripple through" bug where it's possible to get the wrong values when the lowest 32-bits overflow (e.g. you might read 0x00000000FFFFFFFF, then 0x0000000000000000, then 0x0000000100000001 because the two halves of the counter are implemented as separate 32-bit registers internally).

CPU frequency needs to increase 1000-fold before this becomes an issue. There is an garantee that TSC will be read at least every 1ms as this is the longest time-slice a thread can consume. The scheduler uses a timer to calculate preemption time of the currently running thread, which would read the TSC and recalculate system_time.

Brendan wrote:Don't assume CPUID is fixed frequency unless you know it is. There's a CPUID flag you can use on newer CPUs to determine if the TSC is fixed-frequency or not. Otherwise you can fall back to using "vendor/family/model" or simply assume TSC isn't fixed frequency and use some other timer on that computer.

Yes, this is a problem.

Brendan wrote:For some CPUs (if the TSC isn't fixed frequency), then it may be possible to use have a different "TSC multiplier" for each CPU and dynamically adjust this multiplier when the CPU's speed changes. There's software controlled speed changes (where you can update the "TSC multiplier" when you change the CPU's frequency) and hardware controlled speed changes used for thermal throttling. You'd use the "thermal sensor" IRQ in the local APIC to detect when the CPU has changed it's speed for thermal throttling purposes.

It's possible to use a different timer (e.g. an IRQ from the PIT that's configured in the IO APIC as "broadcast to all") to keep the cores in sync. In this case the TSC would be used to determine "how long since the last IRQ" to improve precision, and the frequency of the other timer will determine worst case accuracy even if the TSCs are all running at different speeds. You could also use this IRQ to constantly re-calibrate the "TSC multiplier" for each CPU. The problem here is the overhead from the IRQ.

Using the PIT seems like overkill. But I do need to make sure that system_time is always increasing between reads on different cores. If this is violated, horrible things could occur. Therefore, I think there should be a "last system_time" that is updated each time a core reads-out system_time, and the last value is compared to a core's system_time, and if the value is lower it would be set to last system_time instead, and the processor system_time is set accordingly.

Brendan wrote:Even if the TSC in all CPUs are always running at the exactly same frequency, they can be different. For example, TSC is "cycles since the CPU started" and different CPUs are started at different times (e.g. BSP first, then other CPUs at different times later on). Because of this, at a minimum you need a different "last_tsc" variable for each CPU.

For performance in "many CPU" systems, you don't want any shared variables at all if you can avoid it, because that can cause "cache line thrashing" (especially if the variable is modified often). Basically, the cache line may be constantly being transferred between CPUs, which can consume significant amounts of bus bandwidth and (in severe cases) can cripple performance for everything using that bus. For this reason you should also consider having an entirely separate "system_time" variable for each separate CPU too.

I think that ensuring a monotonously increasing system_time is more important than cache issues, so I would probably need to keep last calculated system_time in shared memory. A couple of other things tied to this are already in shared memory, like runnable threads, the scheduler spinlock and a few other things.

Brendan · Post by **Brendan** » Fri Apr 22, 2011 7:33 am

Hi,

rdos wrote:
Brendan wrote:If you only use the lowest 32-bits of the TSC, then you must make sure you read the TSC often enough to avoid missing roll-over (which would occur about once per second on a 4 GHz CPU), and if/when you've got any power saving stuff implemented it'd become a nightmare. If you use the entire 64-bit TSC, then some CPUs have a "ripple through" bug where it's possible to get the wrong values when the lowest 32-bits overflow (e.g. you might read 0x00000000FFFFFFFF, then 0x0000000000000000, then 0x0000000100000001 because the two halves of the counter are implemented as separate 32-bit registers internally).
CPU frequency needs to increase 1000-fold before this becomes an issue. There is an garantee that TSC will be read at least every 1ms as this is the longest time-slice a thread can consume. The scheduler uses a timer to calculate preemption time of the currently running thread, which would read the TSC and recalculate system_time.

If there's 8 CPUs and only 3 threads to run; that means you've got 5 CPUs that are doing nothing. How do they do "nothing"? Do they just do HLT or MWAIT in a loop to save a little bit of power until there's something for them to do? Do they do HLT or MWAIT for a little while, and then (after doing HLT/MWAIT for a little while) switch to some deeper sleep state to save even more power? In either of these cases, if something is waking these CPUs up ever 1 ms, then your OS has poor power management.

Then there's the "1 CPU with only 1 thread to run" case; where the overhead of having a timer IRQ for the scheduler in unnecessary - there's no other thread to switch to, and no need to waste CPU time with the scheduler's timer in this case until/unless a second thread becomes ready to run.

rdos wrote:
Brendan wrote:Even if the TSC in all CPUs are always running at the exactly same frequency, they can be different. For example, TSC is "cycles since the CPU started" and different CPUs are started at different times (e.g. BSP first, then other CPUs at different times later on). Because of this, at a minimum you need a different "last_tsc" variable for each CPU.

For performance in "many CPU" systems, you don't want any shared variables at all if you can avoid it, because that can cause "cache line thrashing" (especially if the variable is modified often). Basically, the cache line may be constantly being transferred between CPUs, which can consume significant amounts of bus bandwidth and (in severe cases) can cripple performance for everything using that bus. For this reason you should also consider having an entirely separate "system_time" variable for each separate CPU too.
I think that ensuring a monotonously increasing system_time is more important than cache issues, so I would probably need to keep last calculated system_time in shared memory. A couple of other things tied to this are already in shared memory, like runnable threads, the scheduler spinlock and a few other things.

Um - "the scheduler spinlock" implies that there's only one scheduler spinlock (and not one spinlock per CPU, or one spinlock per "scheduler queue" per CPU)? You'd want to fix that too.

A monotonously increasing system_time is... useless.

Imagine a CPU reads system_time, then receives 20 IRQs and handles them, then finally gets a chance to use the system_time it read. If your OS can't handle that then it's broken. If your OS can handle that, then you don't need a monotonously increasing system_time.

Cheers,

Brendan

rdos · Post by **rdos** » Fri Apr 22, 2011 9:12 am

Brendan wrote:If there's 8 CPUs and only 3 threads to run; that means you've got 5 CPUs that are doing nothing. How do they do "nothing"? Do they just do HLT or MWAIT in a loop to save a little bit of power until there's something for them to do? Do they do HLT or MWAIT for a little while, and then (after doing HLT/MWAIT for a little while) switch to some deeper sleep state to save even more power? In either of these cases, if something is waking these CPUs up ever 1 ms, then your OS has poor power management.

There is no power-management (currently). Idle cores will run in a loop like this:

Code: Select all

    sti

null_loop:
    hlt
    jmp null_loop

The scheduler will not care if a core is running the null-thread (with zero priority) or a normal thread with above zero priority, it will allocate it a 1 ms time-slice and reschedule it after that time has expired (provided something with higher priority doesn't interrupt it before this happens).

Brendan wrote:Then there's the "1 CPU with only 1 thread to run" case; where the overhead of having a timer IRQ for the scheduler in unnecessary - there's no other thread to switch to, and no need to waste CPU time with the scheduler's timer in this case until/unless a second thread becomes ready to run.

This cannot happen. Even in the minimal configuration, there are a few kernel-threads that will need attention regularily.

Brendan wrote: Um - "the scheduler spinlock" implies that there's only one scheduler spinlock (and not one spinlock per CPU, or one spinlock per "scheduler queue" per CPU)? You'd want to fix that too.

Threads are not allocated to specific cores. Instead, threads are run on arbitrary cores. This ensures that the highest priority threads are always run before the lower priority ones, something that cannot be guaranted when threads are asigned to specific cores.

OTOH, I have some plans to provide a function that locks a core to a specific procedure, and that does not intervene with it in any way (which includes refraining from assigning IRQs to the core and preempting it). This would be a really useful function to support hard real-time performance with 100% deterministic behavior.

Brendan wrote: A monotonously increasing system_time is... useless.

Imagine a CPU reads system_time, then receives 20 IRQs and handles them, then finally gets a chance to use the system_time it read. If your OS can't handle that then it's broken. If your OS can handle that, then you don't need a monotonously increasing system_time.

That will work ok. The thing that is problematic is if one thread is first run on core 0, reads system-time (and keeps it), and then is run on core 1, reads system-time, and now gets a lower value. It is especially problematic if system-time is not synchronized when it is used to time critical events and the timing code is run on different cores that have different time.

A design feature is that it should be possible to provide exact timers. If I want to do a device-driver that gets a callback after say 15us, it should recieve this callback at 15us, and not 5 or 25 provided the system is not overloaded with other IRQs. Also, it should be possible to have timers that fire at regular intervals, sometimes synchronized with real time. If I setup a 1s callback interval, I don't want 0.9 or 1.1. This is an area where for instance Windows sucks big-time.

Brendan · Post by **Brendan** » Fri Apr 22, 2011 10:28 am

Hi,

rdos wrote:
Brendan wrote:If there's 8 CPUs and only 3 threads to run; that means you've got 5 CPUs that are doing nothing. How do they do "nothing"? Do they just do HLT or MWAIT in a loop to save a little bit of power until there's something for them to do? Do they do HLT or MWAIT for a little while, and then (after doing HLT/MWAIT for a little while) switch to some deeper sleep state to save even more power? In either of these cases, if something is waking these CPUs up ever 1 ms, then your OS has poor power management.
There is no power-management (currently). Idle cores will run in a loop like this:

So, judging by other things you've said, the extra power consumption caused by taking the CPU/s out of "HLT" state for no reason every 1 ms is probably negligible at the moment. Once all the other problems are fixed it'll be "less negligible", and when you implement power management it'll start becoming noticeable. For example (after all power management is done), for something like a quad-core laptop being used for typical web browsing and/or office work (where most cores are idle most of the time) you might be looking at 6 hours running on battery instead of 8 because of that "IRQ every 1 ms".

rdos wrote:
Brendan wrote:Then there's the "1 CPU with only 1 thread to run" case; where the overhead of having a timer IRQ for the scheduler in unnecessary - there's no other thread to switch to, and no need to waste CPU time with the scheduler's timer in this case until/unless a second thread becomes ready to run.
This cannot happen. Even in the minimal configuration, there are a few kernel-threads that will need attention regularily.

How regularly is regularly?

If I leave a computer idle for 3 days, with no (user or network) activity at all, then these kernel threads will still be doing something after three days? If "yes", then I'm unable to think of a sane excuse.

Then there's the "8 CPUs with only 6 threads plus 2 kernel-threads to run" case; where the overhead of having a timer IRQ for the scheduler is unnecessary - there's no "other thread" for any CPU to switch to...

rdos wrote:
Brendan wrote: Um - "the scheduler spinlock" implies that there's only one scheduler spinlock (and not one spinlock per CPU, or one spinlock per "scheduler queue" per CPU)? You'd want to fix that too.
Threads are not allocated to specific cores. Instead, threads are run on arbitrary cores. This ensures that the highest priority threads are always run before the lower priority ones, something that cannot be guaranted when threads are asigned to specific cores.

Ok. How many thread priorities are there? Can you have a separate queue for each thread priority, with a different spinlock for each thread priority?

rdos wrote:
Brendan wrote: A monotonously increasing system_time is... useless.

Imagine a CPU reads system_time, then receives 20 IRQs and handles them, then finally gets a chance to use the system_time it read. If your OS can't handle that then it's broken. If your OS can handle that, then you don't need a monotonously increasing system_time.
That will work ok. The thing that is problematic is if one thread is first run on core 0, reads system-time (and keeps it), and then is run on core 1, reads system-time, and now gets a lower value. It is especially problematic if system-time is not synchronized when it is used to time critical events and the timing code is run on different cores that have different time.

That sounds easy to fix - have a "last_time" value for each thread, and when a thread reads "system_time" return "last_time" if that is higher.

rdos wrote:A design feature is that it should be possible to provide exact timers.

Define "exact". 100% accurate with infinite precision is never going to happen.

Something like "+/- 5 ms with 1 ms precision" is achievable (if it's backed up with NTP support). Something like "+/- 5 ms with 1 us precision" might also be possible depending on which CPU and/or if HPET or local APICs are present. Anything better than "+/- 1 ms with 1 ns precision" just isn't going to happen regardless of how good your software and hardware is.

Also don't forget that higher accuracy and/or higher precision means higher overhead - it's a compromise.

rdos wrote:If I want to do a device-driver that gets a callback after say 15us, it should recieve this callback at 15us, and not 5 or 25 provided the system is not overloaded with other IRQs. Also, it should be possible to have timers that fire at regular intervals, sometimes synchronized with real time. If I setup a 1s callback interval, I don't want 0.9 or 1.1. This is an area where for instance Windows sucks big-time.

Most OSs (including Windows) are optimised for performance, not for "real time". If you ask any of these OSs for 15 us then you will get "guaranteed 15 us or more". If your OS manages to support "+/- 1 ms with 1 ns precision" then you won't really be able to guarantee anything better than the other OSs anyway (in case the system is overloaded by IRQs, or SMM has taken a chunk of CPU time, or...).

Cheers,

Brendan

rdos · Post by **rdos** » Fri Apr 22, 2011 3:04 pm

Brendan wrote:So, judging by other things you've said, the extra power consumption caused by taking the CPU/s out of "HLT" state for no reason every 1 ms is probably negligible at the moment. Once all the other problems are fixed it'll be "less negligible", and when you implement power management it'll start becoming noticeable. For example (after all power management is done), for something like a quad-core laptop being used for typical web browsing and/or office work (where most cores are idle most of the time) you might be looking at 6 hours running on battery instead of 8 because of that "IRQ every 1 ms".

This seems easy enough to fix. You could just take one or more cores out of the scheduler and put them into deep sleep modes. As load is increasing, they could be put back into operational state and be schedulable. Kind of the same idea I talked about before when a single core could be fixed to a procedure for hard realtime needs. Either a core is avaliable for work from the scheduler (and then it is running hlts when idle), or it is not, in which case it could either be put to sleep or assigned to realtime code.

Brendan wrote:
rdos wrote:
Brendan wrote:Then there's the "1 CPU with only 1 thread to run" case; where the overhead of having a timer IRQ for the scheduler in unnecessary - there's no other thread to switch to, and no need to waste CPU time with the scheduler's timer in this case until/unless a second thread becomes ready to run.
This cannot happen. Even in the minimal configuration, there are a few kernel-threads that will need attention regularily.
How regularly is regularly?

I can take some examples. Each disc has it's own (above normal priority) server thread that handles disk-requests from applications. There is also a network server thread per network-card. These threads will run as disc requests or network requests are done.

Brendan wrote: If I leave a computer idle for 3 days, with no (user or network) activity at all, then these kernel threads will still be doing something after three days? If "yes", then I'm unable to think of a sane excuse.

No, they will normally be blocked.

Brendan wrote: Then there's the "8 CPUs with only 6 threads plus 2 kernel-threads to run" case; where the overhead of having a timer IRQ for the scheduler is unnecessary - there's no "other thread" for any CPU to switch to...

The scheduling overhead on a modern processor is negligable. I calculated it to 10% on a 386, which was substantial, but today it is far below 1%.

Brendan wrote:
rdos wrote:
Brendan wrote: Um - "the scheduler spinlock" implies that there's only one scheduler spinlock (and not one spinlock per CPU, or one spinlock per "scheduler queue" per CPU)? You'd want to fix that too.
Threads are not allocated to specific cores. Instead, threads are run on arbitrary cores. This ensures that the highest priority threads are always run before the lower priority ones, something that cannot be guaranted when threads are asigned to specific cores.
Ok. How many thread priorities are there? Can you have a separate queue for each thread priority, with a different spinlock for each thread priority?

Theoretically, there are 256 priorities, but in practise only a few are used. These are:
* Zero. For null threads
* One. For typical application threads
* Above one. For kernel threads.

But I still don't see how fixing threads to cores would work. Normally, threads are not executing some numerical application for hours, but do some short code-sequence, and then block waiting for events. Threads that are blocked consume no CPU time. Therefore, with static allocation of threads to cores, one core could have 5 blocked threads, while another could have 5 runnable threads, but could only work on one of them. Since these states change frequently, there would be a lot of overhead when trying to compensate by moving threads between cores in order to utilize cores as efficiently as possible.

One of my test-programs allows the creation of senders and receivers in a message-passing system. A sender will typical initialize a message and call the send API which would wait for a reply. A receiver will block until it gets a message, send a reply and block again. Static assginment of threads to cores for this program will provide catastrophic performance compared to dynamic assignment. And this is typical of any event-driven system.

Brendan wrote:
rdos wrote:A design feature is that it should be possible to provide exact timers.
Define "exact". 100% accurate with infinite precision is never going to happen.

Something like "+/- 5 ms with 1 ms precision" is achievable (if it's backed up with NTP support). Something like "+/- 5 ms with 1 us precision" might also be possible depending on which CPU and/or if HPET or local APICs are present. Anything better than "+/- 1 ms with 1 ns precision" just isn't going to happen regardless of how good your software and hardware is.

Also don't forget that higher accuracy and/or higher precision means higher overhead - it's a compromise.

It depends on the CPU used. On a 386, it will be in the order of several ms, while on a modern CPU it will be far better (probably around a few hundred microseconds).

Brendan · Post by **Brendan** » Sat Apr 23, 2011 12:59 am

Hi,

rdos wrote:
Theoretically, there are 256 priorities, but in practise only a few are used. These are:
* Zero. For null threads
* One. For typical application threads
* Above one. For kernel threads.

But I still don't see how fixing threads to cores would work. Normally, threads are not executing some numerical application for hours, but do some short code-sequence, and then block waiting for events. Threads that are blocked consume no CPU time. Therefore, with static allocation of threads to cores, one core could have 5 blocked threads, while another could have 5 runnable threads, but could only work on one of them. Since these states change frequently, there would be a lot of overhead when trying to compensate by moving threads between cores in order to utilize cores as efficiently as possible.

In general, with a fixed number of spinlocks, twice as many CPUs means twice as many CPUs fighting for the same spinlock/s and twice as many task switches, which adds up to 4 times as much lock contention. With 256 queues and 256 spinlocks (one for each priority), if only a few of those priorities are actually used then you're effectively only using a few locks. Under heavy load; for 2 CPUs you might have a 1% chance of lock contention; for 4 CPUs you might have a 4% chance of lock contention; for 8 CPUs you might have a 16% chance of lock contention; for 16 CPUs you might have a 64% chance of lock contention; and for 32 CPUs you might get better performance by refusing to use half the CPUs.

I typically use "queue per priority, per CPU", with 1024 priorities (actually 4 scheduling policies with 256 priorities per scheduling policy). On a system with 2 CPUs I'll have 2048 queues with 2048 spinlocks and almost no chance of lock contention. On a system with 256 CPUs I'll have 262144 spinlocks and almost no chance of lock contention.

When the scheduler is deciding which task to switch to, it finds the highest priority (non-empty) queue for that CPU and takes the first task off that queue. This is fast and there's almost no chance of lock contention (regardless of how many CPUs). Load balancing, etc occurs when thread's are placed on a queue, not when they're taken off a queue. When a thread is put onto a queue, you know the thread's priority (or policy and priority) and it's "CPU affinity" already; and for each CPU in the thread's "CPU affinity" you determine which CPU is the best CPU (e.g. has the least load); then put the thread on the queue that corresponds to the best CPU and the thread's priority.

In theory this can lead to unbalanced load. For example, for a system with 2 CPUs each CPU could have 10 threads in their queues, and all the threads in one CPU could take ages while the threads for the other CPU block very quickly, causing you to have 10 threads for one CPU and none for the other. In practice this isn't a problem at all - high priority threads don't consume much CPU time and get re-balanced relatively quickly, and for the low priority threads (e.g. the CPU bound ones) the amount of time it takes for them to get CPU time doesn't matter so much (and there's less chance of them becoming unbalanced anyway).

Also note that the code to determine which CPU is the "best" CPU can be much more complex than just comparing CPU loads; and can take into account NUMA domains, hyper-threading, CPU temperatures, cache sharing, etc. People only pin threads to specific CPUs when it improves performance, and it only improves performance when the scheduler isn't very smart (e.g. doesn't take things like NUMA domains, hyper-threading and caches into account); so smarter (more complex) scheduling can mean more flexible load balancing (because the user won't pin threads).

rdos wrote:
Brendan wrote:
rdos wrote:A design feature is that it should be possible to provide exact timers.
Define "exact". 100% accurate with infinite precision is never going to happen.

Something like "+/- 5 ms with 1 ms precision" is achievable (if it's backed up with NTP support). Something like "+/- 5 ms with 1 us precision" might also be possible depending on which CPU and/or if HPET or local APICs are present. Anything better than "+/- 1 ms with 1 ns precision" just isn't going to happen regardless of how good your software and hardware is.

Also don't forget that higher accuracy and/or higher precision means higher overhead - it's a compromise.
It depends on the CPU used. On a 386, it will be in the order of several ms, while on a modern CPU it will be far better (probably around a few hundred microseconds).

On a 386 (which doesn't support RDTSC, local APIC/s, HPET or ACPI timer), you can use the PIT chip to get an IRQ every 5 ms; then if/when "high precision" is needed you can read the "remaining count low byte" in the PIT chip to determine how much of the current 5 ms has been consumed. In this case you can get about 2 us precision. Of course this has relatively high overhead (e.g. it'd take several us to read the "high precision" time, so you wouldn't want to do that too often). On multi-CPU systems the PIT gets worse, because you'd need some sort of lock (so that a CPU doesn't read the PIT count while another CPU is handling the PIT's IRQ or something). Basically it's a compromise between precision and overhead, and I'd understand if you decided that the overhead of reading the "remaining count low byte" wasn't worth the extra precision.

There's also the CMOS/RTC which has a "periodic IRQ" that can be configured for between 2 Hz to 8192 Hz. It's unreliable at higher frequencies on some computers (I wouldn't use 2048 Hz or faster), but at 1024 Hz you could use the IRQ to increment a counter and end up with about 1 ms precision. It has fairly high overhead though (legacy device with slow IO port accesses, where you need read a status register every IRQ in addition to sending an EOI).

For local APIC, the main clock runs at the speed of the CPU's bus, which (depending on the computer) can be from about 33 MHz to about 1 Ghz. Just like the PIT, you can read it's "remaining count", and by doing that you can end up with between 30 ns to 1 ns precision. Unfortunately, in some systems (with hyper-transport or quickpath, rather than a single/common FSB) the local APICs in different CPUs aren't synchronised and may run at slightly different speeds. This can make it harder to keep different CPUs synchronised. Also, typically the local APIC is used in multi-CPU systems for the scheduler's IRQ, and also using it for the current time can complicate things a little more. This ends up being a compromise too - maybe the extra precision isn't worth the (only slightly higher) overhead and the extra complexity.

For the TSC, speed is between 33 MHz to about 4 Ghz, so it works out to about 30 ns to 0.25 ns precision. Unfortunately (as you already know) the TSC is messy and can take a lot of work to avoid all the potential problems. This ends up being a compromise too - maybe the extra precision isn't worth the extra complexity.

Then there's the ACPI timer. This has a main clock running at 3.579545 MHz that increments a counter. It's relatively easy to read and easy to get 280 ns precision from it. There's no real downside here, other than older computers don't support ACPI.

Finally, there's HPET. The main clock is about 10 MHz and it increments a "main counter register". It's easy to read this counter and get 100 ns precision. There's no real downside here, other than older computers don't support ACPI.

If an OS supports "all of the above", it could determine what hardware is present during boot and use the best time source that the hardware supports. In this case you get to write code that compares the capabilities of supported timers and makes a compromise between precision and overhead during boot. It's also the most complex (think of it as "the sum of all complexities") so I could probably understand if you decided the extra complexity isn't worth the precision/overhead benefits.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Apr 23, 2011 1:49 am

Brendan wrote:I typically use "queue per priority, per CPU", with 1024 priorities (actually 4 scheduling policies with 256 priorities per scheduling policy). On a system with 2 CPUs I'll have 2048 queues with 2048 spinlocks and almost no chance of lock contention. On a system with 256 CPUs I'll have 262144 spinlocks and almost no chance of lock contention.

When the scheduler is deciding which task to switch to, it finds the highest priority (non-empty) queue for that CPU and takes the first task off that queue. This is fast and there's almost no chance of lock contention (regardless of how many CPUs). Load balancing, etc occurs when thread's are placed on a queue, not when they're taken off a queue. When a thread is put onto a queue, you know the thread's priority (or policy and priority) and it's "CPU affinity" already; and for each CPU in the thread's "CPU affinity" you determine which CPU is the best CPU (e.g. has the least load); then put the thread on the queue that corresponds to the best CPU and the thread's priority.

OK, now I think I understand your method, and I like it. However, I would like my current design with a lot of lock-contention to work on 4-core (and possibly more, if I can get a system with it) first. It is a mightmare to debug things that might crash for now apparent reason with low probability. Then I think I can slowly migrate it to a design where threads are queued locally instead, without breaking the scheduler locking code. I have the migration from hardware to software taskswitching still in mind. It took a lot of time before this became stable, and I had to do it in small pieces.

Brendan wrote:Then there's the ACPI timer. This has a main clock running at 3.579545 MHz that increments a counter. It's relatively easy to read and easy to get 280 ns precision from it. There's no real downside here, other than older computers don't support ACPI.

Finally, there's HPET. The main clock is about 10 MHz and it increments a "main counter register". It's easy to read this counter and get 100 ns precision. There's no real downside here, other than older computers don't support ACPI.

Maybe I should look into these, especially in cases where TSC is not guaranteed to be stable. A problem is that I don't have a system (that I know of) which has a processor with a variable-frequency TSC.

rdos · Post by **rdos** » Sat Apr 23, 2011 2:45 am

Back to the synchronization issue, I think I have an "ideal" solution for this. Keep a global last_time variable. Every time a core calculates a new system time, it will first compare the calculated value with the global last_time variable. If the difference is higher than a certain amount (say 1ms), a synchronization will be performed with an IPI to all cores that are schedulable (alternatively, to all cores except self). To make sure that interrupt latencies will not matter, the core doing the synchronization will wait for all cores to have entered the interrupt handler (a global variable is used for this). Then it will instruct the other cores to start reading (local) system time, and then read its own system time. After the synchronization core has got it's system time, it will write it to the last_time variable, and then set a flag to indicate it is done. The other cores will wait for the flag, and then read global last_time and calculate the difference to its own system time, and then fix its state so the difference becomes zero.

EDIT: In order to be able to synchronize the local TSCs with RTC, I think it is necesary to always let the same core do the synchronization (I'd pick the BSP for this). The cores being synchronized, should use the amount of the skew to the clock so they can adjust the 32-bit TSC multiplier to eliminate the skew. The BSP will become synchronized from the RTC. This also eliminates the need for RTC to send interrupts to more than one core. The basic RTC synchronization thus could be shared between SMP and non-SMP platforms.

rdos · Post by **rdos** » Sun Apr 24, 2011 5:03 am

Seems like Brendan is correct about (lock) contention. Performance comparsions with the multhreaded message-test program (using 16 sender threads), are really bad: (scores are how much faster the AMD Athlon runs with various number of cores active compared to an Intel Atom that uses two cores with Hyperthreading):

1 core: 9.8
2 cores: 10.5
3 cores: 9.0

IOW, when the second cores is added, performance increases 7%, but when the third core is added, performance declines below that of a single core. These results are extreme since the threads are blocking / unblocking all the time, and most of the time is spent in the scheduler. But another test program with graphic primitives + a thread with sprites, shows lousy performance as well with 2 and 3 cores (mostly with 3). In fact, it performs worse with the Athlon using 3 cores than on a 200MHz Pentium. However, when the program runs in the background (against ordinary memory instead of video memory), performance increases substantially.

OSDev.org

Keeping real-time in a SMP kernel

Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel

Re: Keeping real-time in a SMP kernel