When I first designed my thread scheduler, I remember choosing between setting up the hpet to send a lowest priority MSI, (send int to could with lowest ppr) or.. just send it to the bsp only, have bsp update timer based structures, and then send a broadcast ipi off to the other APs.
The reason I did this was because at the time it just seemed like a simpler approach without having to introduce more interrupt handlers in case another AP got the actual hpet interrupt. Looking back on this, and also having recently taken a look at the windows kernel, I see it is also done this way. Hence my question, other than introducing extra complexity, are there other solid reasons to do this?
Reason for using timer interrupt on BSP only.
Re: Reason for using timer interrupt on BSP only.
Are you talking about the scheduler/preemption or about timers? For scheduling your proposed process is extremely inefficient because your CPUs have to handle two IRQs instead of one. I seriously doubt that any mainstream OS does what you describe. I also doubt that they use the HPET for scheduling at all; the local APIC timer is much better suited for this task.
For timers (e.g. when implementing sleep() or other timeouts) your proposal can actually make sense but keeping CPU-local timers will still reduce cache trashing and improve latencies.
For timers (e.g. when implementing sleep() or other timeouts) your proposal can actually make sense but keeping CPU-local timers will still reduce cache trashing and improve latencies.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
Re: Reason for using timer interrupt on BSP only.
Hi,
The problem is that often the same local APIC timer is used for other things (as part of a general purpose "high precision timer" abstraction that's used for everything - networking time-outs, device driver delays, etc), and unfortunately (for a lot of CPUs) when a CPU is put into a "very low power consumption" state to save power (because it's been idle for long enough) the local APIC timer stops working. In this case it's reasonable to shift all of the remaining "timer events" (that are likely to have nothing to do with thread scheduling because the CPU is idle) over to a different timer, like HPET, until the CPU has work to do and is bought back to a "higher power consumption" state (and its local APIC timer starts working again).
Note: This is mostly done for scalability - you do not want all CPUs trying to use the same shared resource (e.g. the same HPET timer) because that means you need locks, etc (and get lock contention and other problems that ruin performance for "many CPUs"); so to avoid that you use an "each CPU does its own timing independently (where possible)" approach (which leads to using local APIC timers for everything).
If HPET is only used when CPUs are in a "very low power consumption" state, it doesn't make sense to broadcast HPET's IRQ/s to all CPUs because that will wake all CPUs from their "very low power consumption" state and ruin the power savings. Ideally you want to ensure that only one CPU (that is not in some kind of power saving state) receives HPET's IRQ/s (and ensure that only one CPU handles the "timer events" on behalf of any/all CPUs that are currently in a "very low power consumption" state).
Also note that without power management (and without any "very low power consumption" states), the HPET might also be used to keep everything in sync with "wall clock time" (as part of a tiered approach - e.g. NPT used to keep HPET in sync with "wall clock time", then HPET used to keep local APIC timers and TSC in sync with "wall clock time"); but the HPET IRQ/s are not needed for this (it can be done with HPET's "main counter" without using any of HPET's comparators). This means that HPET's IRQ can be enabled on demand - e.g. HPET's IRQ could be enabled (if it's not enabled already) when a CPU enters a "very low power consumption" state and its "timer events" are shifted to HPET, and then the IRQ can be disabled when there are no more "timer events" left for HPET to handle.
Cheers,
Brendan
For a thread scheduler; every sane OS currently uses the local APIC timer in some way.devsau wrote:When I first designed my thread scheduler, I remember choosing between setting up the hpet to send a lowest priority MSI, (send int to could with lowest ppr) or.. just send it to the bsp only, have bsp update timer based structures, and then send a broadcast ipi off to the other APs.
The reason I did this was because at the time it just seemed like a simpler approach without having to introduce more interrupt handlers in case another AP got the actual hpet interrupt. Looking back on this, and also having recently taken a look at the windows kernel, I see it is also done this way. Hence my question, other than introducing extra complexity, are there other solid reasons to do this?
The problem is that often the same local APIC timer is used for other things (as part of a general purpose "high precision timer" abstraction that's used for everything - networking time-outs, device driver delays, etc), and unfortunately (for a lot of CPUs) when a CPU is put into a "very low power consumption" state to save power (because it's been idle for long enough) the local APIC timer stops working. In this case it's reasonable to shift all of the remaining "timer events" (that are likely to have nothing to do with thread scheduling because the CPU is idle) over to a different timer, like HPET, until the CPU has work to do and is bought back to a "higher power consumption" state (and its local APIC timer starts working again).
Note: This is mostly done for scalability - you do not want all CPUs trying to use the same shared resource (e.g. the same HPET timer) because that means you need locks, etc (and get lock contention and other problems that ruin performance for "many CPUs"); so to avoid that you use an "each CPU does its own timing independently (where possible)" approach (which leads to using local APIC timers for everything).
If HPET is only used when CPUs are in a "very low power consumption" state, it doesn't make sense to broadcast HPET's IRQ/s to all CPUs because that will wake all CPUs from their "very low power consumption" state and ruin the power savings. Ideally you want to ensure that only one CPU (that is not in some kind of power saving state) receives HPET's IRQ/s (and ensure that only one CPU handles the "timer events" on behalf of any/all CPUs that are currently in a "very low power consumption" state).
Also note that without power management (and without any "very low power consumption" states), the HPET might also be used to keep everything in sync with "wall clock time" (as part of a tiered approach - e.g. NPT used to keep HPET in sync with "wall clock time", then HPET used to keep local APIC timers and TSC in sync with "wall clock time"); but the HPET IRQ/s are not needed for this (it can be done with HPET's "main counter" without using any of HPET's comparators). This means that HPET's IRQ can be enabled on demand - e.g. HPET's IRQ could be enabled (if it's not enabled already) when a CPU enters a "very low power consumption" state and its "timer events" are shifted to HPET, and then the IRQ can be disabled when there are no more "timer events" left for HPET to handle.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Reason for using timer interrupt on BSP only.
My scheduler uses APIC timers for each CPU, for sure there is extra complexity.
For me I wanted a large number of context switches per second, so the extra complexity was worth coding.
On my 24 core server, 2 apps per core, each app sets a timer and waits for the time-out, then increments a QWORD, then wait again...
I am context switching 55 million times per second, saving and restoring AVX data every time.
Ali
For me I wanted a large number of context switches per second, so the extra complexity was worth coding.
On my 24 core server, 2 apps per core, each app sets a timer and waits for the time-out, then increments a QWORD, then wait again...
I am context switching 55 million times per second, saving and restoring AVX data every time.
Ali
-
- Member
- Posts: 501
- Joined: Wed Jun 17, 2015 9:40 am
- Libera.chat IRC: glauxosdever
- Location: Athens, Greece
Re: Reason for using timer interrupt on BSP only.
Hi,
Normally, assuming a priority-based round robin scheduler, you assign variable time-slices to threads. Higher-priority threads (e.g. GUI) get smaller time-slices, but they are guaranteed to preempt a lower-priority thread whenever they are unblocked. Lower-priority threads (e.g. non-real-time 3D renderer) get bigger time-slices, but they are guaranteed to get preempted whenever a higher-priority thread unblocks. This way you keep a balance between responsiveness (high-priority threads should be more responsive) and the number of context switches (too many context switches don't let the CPU do the work it's intended for).
Hope this helps.
Regards,
glauxosdever
This is the problem then.I am context switching 55 million times per second
Normally, assuming a priority-based round robin scheduler, you assign variable time-slices to threads. Higher-priority threads (e.g. GUI) get smaller time-slices, but they are guaranteed to preempt a lower-priority thread whenever they are unblocked. Lower-priority threads (e.g. non-real-time 3D renderer) get bigger time-slices, but they are guaranteed to get preempted whenever a higher-priority thread unblocks. This way you keep a balance between responsiveness (high-priority threads should be more responsive) and the number of context switches (too many context switches don't let the CPU do the work it's intended for).
Hope this helps.
Regards,
glauxosdever
Re: Reason for using timer interrupt on BSP only.
Very much correct, there is a lot to take into consideration.glauxosdever wrote:This way you keep a balance between responsiveness (high-priority threads should be more responsive) and the number of context switches (too many context switches don't let the CPU do the work it's intended for).
For example: I loaded the same app 6 million times onto the same server and did the same test, with each core having around 250,000 apps to deal with. Each app again running a timer handler as fast as the OS could handle.
The limiting factor in this test was cache misses, the performance really drops off and the scheduler has a much harder time scheduling things, and the more advanced the scheduler the slower it ran, you really have to count/time/test and test and test and test your code, and then for me usually rewrite it a few times..
I find it very rewarding once you get something working, just keep on thinking....
Ali