OSDev.org

Posted: **Sun Oct 04, 2015 4:06 pm**

Can anyone point me to any materials regarding IRQ balancing on (modern, boring, shared-memory, mostly uniform and cache-coherent) multiprocessors, or express their own thougts on the matter? It appears it should be a problem receiving almost as much academic attention as process scheduling, but I can’t in fact find anything apart from the documentation section on the irqbalance website, which only scarcely describes a rather ad-hoc approach.

Rationale, in case I don’t get around to implementing things and it might still be useful to somebody else: It seems that this is one of the few policy-related things the resident part of a kernel/monitor has to be concerned with: scheduling decisions might (or might not, I’m not yet sure) be postponable to some sort of “timer driver”, lightweight intra-CPU synchronous IPC à la Liedtke is all coding and no policy, but something has to decide, centrally, where to deliver a particular IRQ. (Conveniently, deciding where to deliver inter-CPU synchronous IPC is exactly the same question.)

Posted: **Sun Oct 25, 2015 1:39 pm**

In the past I have dealt with this in several ways. The way I like (I am not going to say that it is the best)is to have my task scheduler give a higher priority to tasks that have more external interrupts that they need to handle. This way when an external interrupt is thrown, the IRQ handling code is run. The task(s) that is registered to be using that IRQ is put in a high priority queue where is is most likely to be run sooner and deal with the end result of the IRQ. This would be in cases like where network data was received and the buffers need emptied.

Posted: **Mon Oct 26, 2015 12:41 am**

HI,

Let's start by assuming it's a NUMA system with 4 NUMA domains, like this:

Code: Select all

             ________
            |        |
            | IO Hub |
            |________|
    _____    ___:____        ________    _____
   |     |  |        |      |        |  |     |
   | RAM |--| CPUs   |------| CPUs   |--| RAM |
   |_____|  | 0 to 1 |      | 2 to 3 |  |_____|
            |________|      |________|
                :          /    :
                :         /     :
                :        /      :
                :       /       :
                :      /        :
                :     /         :
    _____    ___:____/       ___:____    _____
   |     |  |        |      |        |  |     |
   | RAM |--| CPUs   |------| CPUs   |--| RAM |
   |_____|  | 4 to 5 |      | 6 to 7 |  |_____|
            |________|      |________|
                             ___:____
                            |        |
                            | IO Hub |
                            |________|

Each IO Hub connects to PCI devices; which means there's PCI devices in NUMA domain #0 (top left) and more PCI devices in NUMA domain #3 (bottom right). Obviously you'd want the device drivers for devices connected to NUMA domain #0 to be running on CPUs that are in NUMA domain #0, which are CPUs 0 and 1; and devices connected to NUMA domain #3 to be using CPUs 6 and 7.

Let's also assume that the OS doesn't have a single global IDT, but has a different IDT for each NUMA domain. This means that (using MSI) you can have 150 interrupt vectors for devices in NUMA domain #0 and another 150 interrupt vectors for devices in NUMA domain #3, and you're not limited to a global max. of about 200 interrupt vectors.

Now...

When a device in NUMA domain #3 sends an IRQ you want it to go to CPU 6 or 7. If CPU 6 is in a low power mode you don't want to wake it up (which would be bad for power consumption and bad for latency because waking CPUs up takes time). If CPU 6 is running a high priority task and CPU 7 is running a low priority task, then you don't want to interrupt the high priority task. Finally; if neither CPU is in a low power mode and both are running similar priority tasks, then you want the IRQs to be balanced reasonably evenly (e.g. about half to CPU 6 and half to CPU 7).

If you look into the way IRQ priorities interact with APICs, you'll notice there's a "send to lowest priority CPU" mode and a CPU's priority is determined (in part) by a "task priority register" in the CPU's local APIC. If the task priority register is higher than the IRQ's priority then the CPU won't accept it, which would be bad - when all CPUs are running at "too high priority" none accept the IRQ. However, if the task priority register is kept within the range that corresponds to exception handlers anyway this won't happen. This means you can adjust the task priority register during task switches and when putting a CPU to sleep and waking it up; so that IRQs are automatically sent to the "best" CPU by hardware.

Now let's think about CPUs 2, 3, 4 and 5. What if CPUs 6 and 7 are both running very high priority tasks and/or in a power saving state? Maybe we want CPUs 4 and 5 to help handle IRQs from NUMA domain #3 because its worth paying the "wrong NUMA domain" penalty in that case. To do this, you can set the task priority registers for CPUs 4 and 5 in the "high to very high priority" range (depending on what they're doing); and that way if (e.g.) CPU 4 is running a low priority task and CPUs 6 and 7 are currently running very high priority tasks (or asleep), then the IRQ would automatically get sent to CPU 4.

Now...let's look at the boring/simple computer that looks like this:

Code: Select all

             ________
            |        |
            | IO Hub |
            |________|
    _____    ___:____ 
   |     |  |        |
   | RAM |--| CPUs   |
   |_____|  | 0 to 1 |
            |________|

Is this "no NUMA", or is it "NUMA with only one NUMA domain"? There's no difference.

All the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case.

Note that none of the above is really that easy. There's things like APIC logical destination register that come into it, and differences between xAPIC and x2APIC, and things like "directed EOI" to take into account, and dodgy hardware, and... What I'm trying to do is describe an "ideal framework" without the complications/distractions.

Cheers,

Brendan

Posted: **Fri Jan 01, 2016 1:48 pm**

never dealt with it before but encountered the terms more than once. looks like my guess for what it is is completely wrong:
i was thinking irq balancing might have something to do with PCIe interrupts being evenly distributed by 4 lines INT#A INT#B INT#C INT#D in which many PCIe devices interrupt shared.

Posted: **Mon Jan 04, 2016 3:47 am**

INTA-INTD was a backward-compatible hack for Pci-Express to look like PCI, where it was a backward-compatible hack to fit on top of ISA interrupt routing (PIC) with some of them multiplexed and rotated around between slots. Remember that all Pci-E interrupts are MSI, so there's no reason not to have a 256 different ones.

Posted: **Mon Apr 04, 2016 4:05 pm**

Brendan: thanks for the detailed explanation (with pictures, even!). However, although of course

Brendan wrote:all the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case,

I didn’t dismiss (device communication latency in) NUMA because I think it’s irrelevant or that the simple case can’t be handled as a degenerate case of the complex one: I just thought the problem was complicated enough already. In fact, I was actually thinking about another part of the problems NUMA causes: shared caches. To wit, the CPU of the laptop that I’m writing this on has three levels of memory caches and two levels of TLBs shared to various degrees between four logical cores. Especially after all the discussion about the costs of hardware cache coherency (ask the lockless people), it looks like thread migration is a really bad idea most of the time, and the irqbalance text agrees. Thus the tradeoff I was talking about was not “subpar device communication speeds vs. CPU power and thread priority”, it was “cache thrashing vs. the same”. Sorry about not being clear about it the first time, and—is anybody aware of any sound approach to this problem?

Posted: **Mon Apr 04, 2016 4:13 pm**

tlf30 wrote:This way when an external interrupt is thrown, the IRQ handling code is run.

The silent assumption in this sentence is that we know which CPU (on a multiprocessor) is to deliver the IRQ to. This is the unavoidable part of the problem I was talking about. Whether or not to

tlf30 wrote:have my task scheduler give a higher priority to tasks that have [...] external interrupts

or to handle interrupts in the kernel or to not have priorities at all, on the other hand, is an unrelated design decision.

OSDev.org

IRQ balancing

IRQ balancing

Re: IRQ balancing

Re: IRQ balancing

Re: IRQ balancing

Re: IRQ balancing

Re: IRQ balancing

Re: IRQ balancing