HI,
Let's start by assuming it's a NUMA system with 4 NUMA domains, like this:
Code: Select all
________
| |
| IO Hub |
|________|
_____ ___:____ ________ _____
| | | | | | | |
| RAM |--| CPUs |------| CPUs |--| RAM |
|_____| | 0 to 1 | | 2 to 3 | |_____|
|________| |________|
: / :
: / :
: / :
: / :
: / :
: / :
_____ ___:____/ ___:____ _____
| | | | | | | |
| RAM |--| CPUs |------| CPUs |--| RAM |
|_____| | 4 to 5 | | 6 to 7 | |_____|
|________| |________|
___:____
| |
| IO Hub |
|________|
Each IO Hub connects to PCI devices; which means there's PCI devices in NUMA domain #0 (top left) and more PCI devices in NUMA domain #3 (bottom right). Obviously you'd want the device drivers for devices connected to NUMA domain #0 to be running on CPUs that are in NUMA domain #0, which are CPUs 0 and 1; and devices connected to NUMA domain #3 to be using CPUs 6 and 7.
Let's also assume that the OS doesn't have a single global IDT, but has a different IDT for each NUMA domain. This means that (using MSI) you can have 150 interrupt vectors for devices in NUMA domain #0 and another 150 interrupt vectors for devices in NUMA domain #3, and you're not limited to a global max. of about 200 interrupt vectors.
Now...
When a device in NUMA domain #3 sends an IRQ you want it to go to CPU 6 or 7. If CPU 6 is in a low power mode you don't want to wake it up (which would be bad for power consumption and bad for latency because waking CPUs up takes time). If CPU 6 is running a high priority task and CPU 7 is running a low priority task, then you don't want to interrupt the high priority task. Finally; if neither CPU is in a low power mode and both are running similar priority tasks, then you want the IRQs to be balanced reasonably evenly (e.g. about half to CPU 6 and half to CPU 7).
If you look into the way IRQ priorities interact with APICs, you'll notice there's a "send to lowest priority CPU" mode and a CPU's priority is determined (in part) by a "task priority register" in the CPU's local APIC. If the task priority register is higher than the IRQ's priority then the CPU won't accept it, which would be bad - when all CPUs are running at "too high priority" none accept the IRQ. However, if the task priority register is kept within the range that corresponds to exception handlers anyway this won't happen. This means you can adjust the task priority register during task switches and when putting a CPU to sleep and waking it up; so that IRQs are automatically sent to the "best" CPU by hardware.
Now let's think about CPUs 2, 3, 4 and 5. What if CPUs 6 and 7 are both running very high priority tasks and/or in a power saving state? Maybe we want CPUs 4 and 5 to help handle IRQs from NUMA domain #3 because its worth paying the "wrong NUMA domain" penalty in that case. To do this, you can set the task priority registers for CPUs 4 and 5 in the "high to very high priority" range (depending on what they're doing); and that way if (e.g.) CPU 4 is running a low priority task and CPUs 6 and 7 are currently running very high priority tasks (or asleep), then the IRQ would automatically get sent to CPU 4.
Now...let's look at the boring/simple computer that looks like this:
Code: Select all
________
| |
| IO Hub |
|________|
_____ ___:____
| | | |
| RAM |--| CPUs |
|_____| | 0 to 1 |
|________|
Is this "no NUMA", or is it "NUMA with only one NUMA domain"? There's no difference.
All the same shenanigans we were doing for the complicated NUMA systems end up being perfectly fine for the much simpler "NUMA with only one NUMA domain" case.
Note that none of the above is really that easy. There's things like APIC logical destination register that come into it, and differences between xAPIC and x2APIC, and things like "directed EOI" to take into account, and dodgy hardware, and... What I'm trying to do is describe an "ideal framework" without the complications/distractions.
Cheers,
Brendan