IRQ balancing

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
AlexShpilkin
Posts: 15
Joined: Thu Aug 29, 2013 4:10 pm

IRQ balancing

Post by AlexShpilkin »

Can anyone point me to any materials regarding IRQ balancing on (modern, boring, shared-memory, mostly uniform and cache-coherent) multiprocessors, or express their own thougts on the matter? It appears it should be a problem receiving almost as much academic attention as process scheduling, but I can’t in fact find anything apart from the documentation section on the irqbalance website, which only scarcely describes a rather ad-hoc approach.

Rationale, in case I don’t get around to implementing things and it might still be useful to somebody else: It seems that this is one of the few policy-related things the resident part of a kernel/monitor has to be concerned with: scheduling decisions might (or might not, I’m not yet sure) be postponable to some sort of “timer driver”, lightweight intra-CPU synchronous IPC à la Liedtke is all coding and no policy, but something has to decide, centrally, where to deliver a particular IRQ. (Conveniently, deciding where to deliver inter-CPU synchronous IPC is exactly the same question.)
tlf30
Member
Member
Posts: 35
Joined: Fri Feb 15, 2013 9:29 pm

Re: IRQ balancing

Post by tlf30 »

In the past I have dealt with this in several ways. The way I like (I am not going to say that it is the best)is to have my task scheduler give a higher priority to tasks that have more external interrupts that they need to handle. This way when an external interrupt is thrown, the IRQ handling code is run. The task(s) that is registered to be using that IRQ is put in a high priority queue where is is most likely to be run sooner and deal with the end result of the IRQ. This would be in cases like where network data was received and the buffers need emptied.
Programming is like fishing, you must be very patient if you want to succeed.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: IRQ balancing

Post by Brendan »

HI,

Let's start by assuming it's a NUMA system with 4 NUMA domains, like this:

Code: Select all

             ________
            |        |
            | IO Hub |
            |________|
    _____    ___:____        ________    _____
   |     |  |        |      |        |  |     |
   | RAM |--| CPUs   |------| CPUs   |--| RAM |
   |_____|  | 0 to 1 |      | 2 to 3 |  |_____|
            |________|      |________|
                :          /    :
                :         /     :
                :        /      :
                :       /       :
                :      /        :
                :     /         :
    _____    ___:____/       ___:____    _____
   |     |  |        |      |        |  |     |
   | RAM |--| CPUs   |------| CPUs   |--| RAM |
   |_____|  | 4 to 5 |      | 6 to 7 |  |_____|
            |________|      |________|
                             ___:____
                            |        |
                            | IO Hub |
                            |________|
Each IO Hub connects to PCI devices; which means there's PCI devices in NUMA domain #0 (top left) and more PCI devices in NUMA domain #3 (bottom right). Obviously you'd want the device drivers for devices connected to NUMA domain #0 to be running on CPUs that are in NUMA domain #0, which are CPUs 0 and 1; and devices connected to NUMA domain #3 to be using CPUs 6 and 7.

Let's also assume that the OS doesn't have a single global IDT, but has a different IDT for each NUMA domain. This means that (using MSI) you can have 150 interrupt vectors for devices in NUMA domain #0 and another 150 interrupt vectors for devices in NUMA domain #3, and you're not limited to a global max. of about 200 interrupt vectors.

Now...

When a device in NUMA domain #3 sends an IRQ you want it to go to CPU 6 or 7. If CPU 6 is in a low power mode you don't want to wake it up (which would be bad for power consumption and bad for latency because waking CPUs up takes time). If CPU 6 is running a high priority task and CPU 7 is running a low priority task, then you don't want to interrupt the high priority task. Finally; if neither CPU is in a low power mode and both are running similar priority tasks, then you want the IRQs to be balanced reasonably evenly (e.g. about half to CPU 6 and half to CPU 7).

If you look into the way IRQ priorities interact with APICs, you'll notice there's a "send to lowest priority CPU" mode and a CPU's priority is determined (in part) by a "task priority register" in the CPU's local APIC. If the task priority register is higher than the IRQ's priority then the CPU won't accept it, which would be bad - when all CPUs are running at "too high priority" none accept the IRQ. However, if the task priority register is kept within the range that corresponds to exception handlers anyway this won't happen. This means you can adjust the task priority register during task switches and when putting a CPU to sleep and waking it up; so that IRQs are automatically sent to the "best" CPU by hardware.

Now let's think about CPUs 2, 3, 4 and 5. What if CPUs 6 and 7 are both running very high priority tasks and/or in a power saving state? Maybe we want CPUs 4 and 5 to help handle IRQs from NUMA domain #3 because its worth paying the "wrong NUMA domain" penalty in that case. To do this, you can set the task priority registers for CPUs 4 and 5 in the "high to very high priority" range (depending on what they're doing); and that way if (e.g.) CPU 4 is running a low priority task and CPUs 6 and 7 are currently running very high priority tasks (or asleep), then the IRQ would automatically get sent to CPU 4.

Now...let's look at the boring/simple computer that looks like this:

Code: Select all

             ________
            |        |
            | IO Hub |
            |________|
    _____    ___:____ 
   |     |  |        |
   | RAM |--| CPUs   |
   |_____|  | 0 to 1 |
            |________|
Is this "no NUMA", or is it "NUMA with only one NUMA domain"? There's no difference. ;)

All the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case.

Note that none of the above is really that easy. There's things like APIC logical destination register that come into it, and differences between xAPIC and x2APIC, and things like "directed EOI" to take into account, and dodgy hardware, and... What I'm trying to do is describe an "ideal framework" without the complications/distractions.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
ggodw000
Member
Member
Posts: 396
Joined: Wed Nov 18, 2015 3:04 pm
Location: San Jose San Francisco Bay Area
Contact:

Re: IRQ balancing

Post by ggodw000 »

never dealt with it before but encountered the terms more than once. looks like my guess for what it is is completely wrong:
i was thinking irq balancing might have something to do with PCIe interrupts being evenly distributed by 4 lines INT#A INT#B INT#C INT#D in which many PCIe devices interrupt shared.
key takeaway after spending yrs on sw industry: big issue small because everyone jumps on it and fixes it. small issue is big since everyone ignores and it causes catastrophy later. #devilisinthedetails
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re: IRQ balancing

Post by Candy »

INTA-INTD was a backward-compatible hack for Pci-Express to look like PCI, where it was a backward-compatible hack to fit on top of ISA interrupt routing (PIC) with some of them multiplexed and rotated around between slots. Remember that all Pci-E interrupts are MSI, so there's no reason not to have a 256 different ones.
AlexShpilkin
Posts: 15
Joined: Thu Aug 29, 2013 4:10 pm

Re: IRQ balancing

Post by AlexShpilkin »

Brendan: thanks for the detailed explanation (with pictures, even!). However, although of course
Brendan wrote:all the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case,
I didn’t dismiss (device communication latency in) NUMA because I think it’s irrelevant or that the simple case can’t be handled as a degenerate case of the complex one: I just thought the problem was complicated enough already. In fact, I was actually thinking about another part of the problems NUMA causes: shared caches. To wit, the CPU of the laptop that I’m writing this on has three levels of memory caches and two levels of TLBs shared to various degrees between four logical cores. Especially after all the discussion about the costs of hardware cache coherency (ask the lockless people), it looks like thread migration is a really bad idea most of the time, and the irqbalance text agrees. Thus the tradeoff I was talking about was not “subpar device communication speeds vs. CPU power and thread priority”, it was “cache thrashing vs. the same”. Sorry about not being clear about it the first time, and—is anybody aware of any sound approach to this problem?
AlexShpilkin
Posts: 15
Joined: Thu Aug 29, 2013 4:10 pm

Re: IRQ balancing

Post by AlexShpilkin »

tlf30 wrote:This way when an external interrupt is thrown, the IRQ handling code is run.
The silent assumption in this sentence is that we know which CPU (on a multiprocessor) is to deliver the IRQ to. This is the unavoidable part of the problem I was talking about. Whether or not to
tlf30 wrote:have my task scheduler give a higher priority to tasks that have [...] external interrupts
or to handle interrupts in the kernel or to not have priorities at all, on the other hand, is an unrelated design decision.
Post Reply