Hi,
rdos wrote:
Theoretically, there are 256 priorities, but in practise only a few are used. These are:
* Zero. For null threads
* One. For typical application threads
* Above one. For kernel threads.
But I still don't see how fixing threads to cores would work. Normally, threads are not executing some numerical application for hours, but do some short code-sequence, and then block waiting for events. Threads that are blocked consume no CPU time. Therefore, with static allocation of threads to cores, one core could have 5 blocked threads, while another could have 5 runnable threads, but could only work on one of them. Since these states change frequently, there would be a lot of overhead when trying to compensate by moving threads between cores in order to utilize cores as efficiently as possible.
In general, with a fixed number of spinlocks, twice as many CPUs means twice as many CPUs fighting for the same spinlock/s and twice as many task switches, which adds up to 4 times as much lock contention. With 256 queues and 256 spinlocks (one for each priority), if only a few of those priorities are actually used then you're effectively only using a few locks. Under heavy load; for 2 CPUs you might have a 1% chance of lock contention; for 4 CPUs you might have a 4% chance of lock contention; for 8 CPUs you might have a 16% chance of lock contention; for 16 CPUs you might have a 64% chance of lock contention; and for 32 CPUs you might get better performance by refusing to use half the CPUs.
I typically use "queue per priority, per CPU", with 1024 priorities (actually 4 scheduling policies with 256 priorities per scheduling policy). On a system with 2 CPUs I'll have 2048 queues with 2048 spinlocks and almost no chance of lock contention. On a system with 256 CPUs I'll have 262144 spinlocks and almost no chance of lock contention.
When the scheduler is deciding which task to switch to, it finds the highest priority (non-empty) queue for that CPU and takes the first task off that queue. This is fast and there's almost no chance of lock contention (regardless of how many CPUs). Load balancing, etc occurs when thread's are placed on a queue, not when they're taken off a queue. When a thread is put onto a queue, you know the thread's priority (or policy and priority) and it's "CPU affinity" already; and for each CPU in the thread's "CPU affinity" you determine which CPU is the best CPU (e.g. has the least load); then put the thread on the queue that corresponds to the best CPU and the thread's priority.
In theory this can lead to unbalanced load. For example, for a system with 2 CPUs each CPU could have 10 threads in their queues, and all the threads in one CPU could take ages while the threads for the other CPU block very quickly, causing you to have 10 threads for one CPU and none for the other. In practice this isn't a problem at all - high priority threads don't consume much CPU time and get re-balanced relatively quickly, and for the low priority threads (e.g. the CPU bound ones) the amount of time it takes for them to get CPU time doesn't matter so much (and there's less chance of them becoming unbalanced anyway).
Also note that the code to determine which CPU is the "best" CPU can be much more complex than just comparing CPU loads; and can take into account NUMA domains, hyper-threading, CPU temperatures, cache sharing, etc. People only pin threads to specific CPUs when it improves performance, and it only improves performance when the scheduler isn't very smart (e.g. doesn't take things like NUMA domains, hyper-threading and caches into account); so smarter (more complex) scheduling can mean more flexible load balancing (because the user won't pin threads).
rdos wrote:Brendan wrote:rdos wrote:A design feature is that it should be possible to provide exact timers.
Define "exact". 100% accurate with infinite precision is never going to happen.
Something like "+/- 5 ms with 1 ms precision" is achievable (if it's backed up with NTP support). Something like "+/- 5 ms with 1 us precision" might also be possible depending on which CPU and/or if HPET or local APICs are present. Anything better than "+/- 1 ms with 1 ns precision" just isn't going to happen regardless of how good your software and hardware is.
Also don't forget that higher accuracy and/or higher precision means higher overhead - it's a compromise.
It depends on the CPU used. On a 386, it will be in the order of several ms, while on a modern CPU it will be far better (probably around a few hundred microseconds).
On a 386 (which doesn't support RDTSC, local APIC/s, HPET or ACPI timer), you can use the PIT chip to get an IRQ every 5 ms; then if/when "high precision" is needed you can read the "remaining count low byte" in the PIT chip to determine how much of the current 5 ms has been consumed. In this case you can get about 2 us precision. Of course this has relatively high overhead (e.g. it'd take several us to read the "high precision" time, so you wouldn't want to do that too often). On multi-CPU systems the PIT gets worse, because you'd need some sort of lock (so that a CPU doesn't read the PIT count while another CPU is handling the PIT's IRQ or something). Basically it's a compromise between precision and overhead, and I'd understand if you decided that the overhead of reading the "remaining count low byte" wasn't worth the extra precision.
There's also the CMOS/RTC which has a "periodic IRQ" that can be configured for between 2 Hz to 8192 Hz. It's unreliable at higher frequencies on some computers (I wouldn't use 2048 Hz or faster), but at 1024 Hz you could use the IRQ to increment a counter and end up with about 1 ms precision. It has fairly high overhead though (legacy device with slow IO port accesses, where you need read a status register every IRQ in addition to sending an EOI).
For local APIC, the main clock runs at the speed of the CPU's bus, which (depending on the computer) can be from about 33 MHz to about 1 Ghz. Just like the PIT, you can read it's "remaining count", and by doing that you can end up with between 30 ns to 1 ns precision. Unfortunately, in some systems (with hyper-transport or quickpath, rather than a single/common FSB) the local APICs in different CPUs aren't synchronised and may run at slightly different speeds. This can make it harder to keep different CPUs synchronised. Also, typically the local APIC is used in multi-CPU systems for the scheduler's IRQ, and also using it for the current time can complicate things a little more. This ends up being a compromise too - maybe the extra precision isn't worth the (only slightly higher) overhead and the extra complexity.
For the TSC, speed is between 33 MHz to about 4 Ghz, so it works out to about 30 ns to 0.25 ns precision. Unfortunately (as you already know) the TSC is messy and can take a lot of work to avoid all the potential problems. This ends up being a compromise too - maybe the extra precision isn't worth the extra complexity.
Then there's the ACPI timer. This has a main clock running at 3.579545 MHz that increments a counter. It's relatively easy to read and easy to get 280 ns precision from it. There's no real downside here, other than older computers don't support ACPI.
Finally, there's HPET. The main clock is about 10 MHz and it increments a "main counter register". It's easy to read this counter and get 100 ns precision. There's no real downside here, other than older computers don't support ACPI.
If an OS supports "all of the above", it could determine what hardware is present during boot and use the best time source that the hardware supports. In this case you get to write code that compares the capabilities of supported timers and makes a compromise between precision and overhead during boot. It's also the most complex (think of it as "the sum of all complexities") so I could probably understand if you decided the extra complexity isn't worth the precision/overhead benefits.
Cheers,
Brendan