Hi,
AlfaOmega08 wrote:Brendan wrote:If one CPU is running at 2 GHz and another CPU is running at 500 MHz (because it got too hot and needs to cool down, for example); then if 2 tasks have the same priority and one gets 10 ms on the first CPU, then the other should get 40 ms on the other (slower) CPU, so that they both get the same fair share of CPU time.
Just realized that, if my knowledge isn't totaly wrong, the APIC timer runs at the same frequency of the CPU. So if I calculate the CPU frequency at boot, with full speed, and keep that as the frequency of the LAPIC Timer, if the CPU slows down, the timer will follow it, and threads will have more time.
No. Normally the local APIC timer runs at the speed of the bus, not at the speed of the CPU. The speed of the bus is constant and isn't effected by CPU speed changes. Newer local APICs do have a "TSC deadline mode" which runs at the speed of the CPU's time stamp counter; but for all newer CPUs the CPU's time stamp counter also runs at a fixed frequency and isn't effected by CPU speed changes.
AlfaOmega08 wrote:Brendan wrote:For this reason, your kernel needs "sleep_until( when );". You could implement a "sleep( duration )" function like this:
Code: Select all
void sleep( duration ) {
sleep_until( now() + duration );
}
This also means that tasks sleep until "now() > task->wake_time"; and you don't have to decrease a "time remaining" value for every sleeping task every time the timer IRQ occurs.
Using sleep_until, I don't even need a sleep_queue, right? Just having all the threads in a queue, and if now() > wakeup_time, then it can be runt. What about using the HPET One Shot Timer to move a thread from the sleep queue to the "ready to run" quque? I still don't get the point... sorry.
Start with a simple linked list that is sorted according to each task's "wake time". For a one shot timer, you'd work out how long until the next task should wake up ("delay = sleep_queue_head->wake_time") and set the one shot timer to generate an IRQ when that delay expires. When the IRQ occurs you'd do something like:
Code: Select all
IRQhandler {
acquire_spinlock(sleep_queue_lock);
while( (sleep_queue_head != NULL) && ( sleep_queue_head->wake_time <= now() ) ) {
temp_task = sleep_queue_head;
sleep_queue_head = temp_task->next; // Remove task from sleep queue
make_task_ready_to_run(temp_task); // Make the task "ready to run"
}
if(sleep_queue_head != NULL) {
delay = sleep_queue_head->wake_time - now();
set_one_shot_timer(delay);
}
release_spinlock(sleep_queue_lock);
}
This seems relatively simple, but there's 3 problems. The first problem is that inserting a task into the sleep queue can be quite slow when there's thousands of tasks on the sleep queue (you need to search through thousands on entries to find the right place to insert). The second problem is that many CPUs may be fighting for the "sleep_queue_lock" (especially when CPUs are inserting lots of tasks and insertion is slow) - you may have to wait for many other CPUs to search thousands of entries each before you can even start searching yourself. The third problem interrupt latency - you have to disable IRQs when you're inserting to make sure that the HPET IRQ doesn't occur and try to acquire the lock on the same CPU that already has the lock (and cause deadlock), and you really don't want to have interrupts disabled while you're searching a list of thousands of entries if you can avoid it.
The easiest way to fix the first problem is to have more "sorted linked lists"/queues (e.g. if there's 1000 queues then each queue will only have 1/1000th of the entries and you can insert tasks onto a queue 1000 times faster). The easiest way to reduce the second problem is the have more locks (e.g. with 1000 queues and 1000 locks, the chance of 2 or more CPUs fighting for the same lock is 1000 times better). Increasing the number of lists/queues improves the insertion time, which also helps with the interrupt latency problem.
So, if you're going to increase the number of queues, what is the smartest way of doing that?
To begin with, I really like "per-CPU" structures. If something is "per CPU" then only one CPU can access it and you can just disable IRQs and have no lock at all. HPET normally doesn't have enough timers capable of one-shot though, so you'd have to use local APIC timers for this; and if you do use local APIC timers for "per CPU" sleep queues then you have to do something about the "local APIC turned off" problem. To solve the "local APIC turned off" problem you can have per-CPU sleep queues that use the local APIC, and an extra global sleep queue that uses HPET that is used to keep things going when a local APIC is turned off. Basically, just before a CPU goes to sleep you'd shift everything from that CPU's sleep queue onto the global sleep queue.
That is a good start, but it still might not be enough. To reduce insertion times more you can have multiple sleep queues per CPU and multiple global sleep queues.
One way to increase the number of queues more would be to do something like "queue_number = task_wake_time % number_of_queues". This is easy and would spread things out well; but it means that your CPU's timer's IRQ handler would have to worry about all of the CPU's queues and not just one of them.
A slightly different way to increase the number of queues more would be to do something like "queue_number = (task_wake_time - time_for_current_first_queue) / time_per_queue" (where "time_per_queue" might be something like 128 ms). In this case, the CPU's timer's IRQ handler only needs to find the first queue and doesn't need to look at all of the CPU's queues. The problem with this is extremely long delays - you probably don't want to have millions of queues per CPU, just in case a task wants to sleep for 12 days.
Most delays are short (e.g. less than 5 seconds). If "time_per_queue" is 128 ms then you'd only need about 40 queues to handle delays that are up to 5 seconds long (40 * 128 ms = 5.12 seconds). If any task wants a delay that is longer than this, you can put the task onto an extra "delay too big" queue. Every 128 ms you'd check the "delay too big" queue and see if there's anything that could/should be shifted to the last "128 ms" queue. However, at the same time the "current first queue" would become empty, and you'd rotate the queues - the old "current first queue" becomes the new "current last queue", and then you check the "delay too big" queue and see if there's anything that should be shifted to the new "current last queue".
Think of it like this (with only 3 queues plus the "delay too big queue"):
Code: Select all
Queue1: ABCDE
Queue2: FGH
Queue3: IJ
QueueX: KLM
You'd wake up task A, then task B, then task C, etc. Eventually it ends up looking like this:
Code: Select all
Queue1:
Queue2: FGH
Queue3: IJ
QueueX: KLMN
Then (possibly a little later) you're finished with Queue1 and have to rotate the queues:
Code: Select all
Queue1: FGH
Queue2: IJ
Queue3:
QueueX: KLMN
And then see if there's any tasks on "QueueX" that need to be shifted:
Code: Select all
Queue1: FGH
Queue2: IJ
Queue3: KL
QueueX: MN
After that you'd wake up task F, task, G, etc; then rotate the queues again.
Of course with linked lists, if you know that "QueueX" is sorted, you only have to find where to split the list and split the list at that point. You don't need to remove tasks from "QueueX" and insert them onto "Queue3" one at a time.
The end result of all this is that you might have 64 queues per CPU (using the local APIC timer); plus another 64 global queues (using HPET). You only ever insert tasks onto the per CPU queues. With 16 CPUs you've got 1024 queues, and if there's 10000 tasks on those queues then you've only got an average of about 9.7 tasks on each queue. You can search a list of 10 things fairly quickly (far more quickly than you can search a list of 10000 things), and (because there's no need for locks for the per CPU queues) most of the lock contention problems are gone. The global queues still need locks, but (except for the HPET IRQ handler) you only need to acquire those locks when you're merging per CPU queues with the global queues before turning a CPU off, which isn't going to be happening too often anyway.
Cheers,
Brendan