Page 1 of 1

Algorithms for IRQ Balancing

Posted: Tue Mar 13, 2018 10:07 pm
by ShukantPal
Hi OSDever's,

I have search on google how IRQ-balancing is done internally, but no site explains how. How would an IRQ balancer work?

There are things like MSI where the interrupt directly reaches the destination CPU and where the IOAPIC redirects them to the CPU. If we keep a count of how many interrupts reached the CPU wouldn't that be inefficient - I think we should record the rate of interrupts firing otherwise the data would be useless if a device was recently plugged in or plugged out. But if we do - how? Or we could reset all of the records periodically - like zeroing the no. of interrupts occurred every 15 seconds?

Now, when would the IRQ balancer run? Should there be a periodically running iRQ-balancer on a system thread? Also, which devices need to be "pinned" - HPET, PIT, PS/2 Keyboard and Mouse?

Any specific algorithms for IRQ-balancing? I have the idea that I could use my CPU topology tree - balance within each domain - if two separate domains have a load-difference having greater overhead compared to transferring of IRQs - transfer them?

Regards,
Shukant Pal

Re: Algorithms for IRQ Balancing

Posted: Tue Mar 13, 2018 11:27 pm
by Brendan
Hi,
ShukantPal wrote:I have search on google how IRQ-balancing is done internally, but no site explains how. How would an IRQ balancer work?
There's 2 main approaches:
  • Explicit IRQ balancing - use "fixed delivery" and periodically reconfigure almost everything yourself.
    • Advantages: Always works
      Disadvantages: Slow to react, higher overhead
  • Automated IRQ balancing - use "lowest priority delivery" and the local APIC's TPR to take care of it automatically.
    • Advantages: Balancing decision is made by hardware for each IRQ individually when an IRQ occurs (and not later when it's too late), no overhead for "reconfigure all the things" in software
      Disadvantages: May not be supported properly in some motherboards for IO APIC, probably not supported properly for MSI. OS has to manage TPR properly.
These approaches are not mutually exclusive (it's possible for an OS to combine both them - e.g. use explicit IRQ balancing for some IRQs and automated IRQ balancing for other IRQs).

Note that balancing IRQs evenly across all CPUs is something I'd consider "very wrong". For example, if a CPU is in a sleep state then I'd prefer IRQs be sent to other CPUs where possible to avoid worse performance/worse IRQ latency (the time it takes to wake a CPU up before the interrupt handler can start) and to avoid worse power consumption. For another example, if one CPU is running a medium priority task and another CPU is running a high priority task, then I'd prefer IRQs be sent to the CPU running the medium priority task (so that more important/higher priority tasks are interrupted less and important work gets done faster). Of course because the scheduler frequently changes tasks and because CPUs enter/leave various sleep states (e.g. "HLT") relatively often I advocate using automated IRQ balancing as much as possible. Like most things, Linux does everything wrong (originally it didn't do IRQ balancing at all, and then someone retro-fitted an optional hack that is only capable of explicit IRQ balancing and only tries to do the "balance IRQs across all CPUs" stupidity).

For all cases (explicit or automatic balancing) it should probably be "per NUMA domain". More specifically, where possible, IRQs from devices connected to "NUMA domain #N" should only be sent to CPUs that are also within "NUMA domain #N". This avoids some traffic across any (quickpath, hypertransport) "inter-NUMA-domain" links; and allows IRQs for each NUMA domain to be balanced independently (which is better for scalability, especially for explicit IRQ balancing).

Finally; if IRQ balancing can cause an IRQ to be sent to any CPU within a group of CPUs; then it makes sense for all CPUs in that group to share the same IDT. This mostly prevents the use of "one IDT per CPU" to increase the maximum number of independent IRQs an OS can support. For NUMA, this doesn't preclude a "one IDT per NUMA domain" approach, which means that you can still get the benefits or "more than about 220 IRQs" where it matters most (on huge servers, with devices in multiple NUMA domains).
ShukantPal wrote:There are things like MSI where the interrupt directly reaches the destination CPU and where the IOAPIC redirects them to the CPU. If we keep a count of how many interrupts reached the CPU wouldn't that be inefficient - I think we should record the rate of interrupts firing otherwise the data would be useless if a device was recently plugged in or plugged out. But if we do - how? Or we could reset all of the records periodically - like zeroing the no. of interrupts occurred every 15 seconds?
For explicit IRQ balancing; I'd be tempted to do something more like "IRQ_rate = IRQ_rate/2" (rather than "IRQ_rate = 0") periodically (after IRQs are rebalanced), so that previous data isn't discarded but has less effect. I'd also do the periodic rebalancing more often (e.g. every second rather than every 15 seconds) so that it's able to react to changes in load less slowly.
ShukantPal wrote:Now, when would the IRQ balancer run? Should there be a periodically running iRQ-balancer on a system thread? Also, which devices need to be "pinned" - HPET, PIT, PS/2 Keyboard and Mouse?
Excluding interrupts from the local APIC itself (e.g. local APIC's timer, performance monitoring counter overflow, etc); I don't think any IRQs need to be pinned to a specific CPU (and due to power management I'd suggest that it's a mistake to pin normal IRQs to a specific CPU because that specific CPU might be put into a deep sleep state).
ShukantPal wrote:Any specific algorithms for IRQ-balancing? I have the idea that I could use my CPU topology tree - balance within each domain - if two separate domains have a load-difference having greater overhead compared to transferring of IRQs - transfer them?
In theory (especially for monolithic kernels) it might be nice to (e.g.) try to take cache sharing into account when doing IRQ balancing (e.g. try to ensure that an IRQ only goes to a CPU that's in a group of CPUs that all share the same L2 cache, in the hope of minimising cache misses when the interrupt handler starts).

Of course I have no idea how much complexity you're willing to deal with (especially for an initial implementation).


Cheers,

Brendan