OSDev.org

Posted: **Fri May 25, 2012 2:01 am**

This is an generic discussion of how to distribute IRQ load in an SMP system. People are welcome to describe their ideas in the context of their own implementations in their OS, or how popular OSes do it. I don't want to see offtopic posts about how bad particular implementations are, nor any other posts not related to the subject.

Some background from my own findings.

One way to balance IRQ load that the APIC architecture introduces is "lowest priority delivery". This is available for both IO-APIC originated IRQs, for some HPETs, and in the PCI MSI architecture that some (modern) PCI devices implement. When using lowest priority delivery, the scheduler must load the TPR register with appropriate task priority as it switches tasks.

Lowest priority delivery seems to work on most computers, but I have two cases of computers based on Intel Core Duo chips, where IRQs are not delivered when this mode is used. It is possible that I might have missed some setting in those processors / motherboards up, but there could be potential issues with the delivery mode itself.

At least on several AMD processors, it seems like when all CPUs have the same priority, IRQs are always delivered to the same CPU (in the AMD case, the CPU with the highest APIC ID that is active. This means that lowest priority delivery will not ensure that IRQ load is balanced between CPUs, only that they are delivered to the CPU that has the lowest current priority.

Another way to balance IRQs is to let the Power Manager (or some other thread/function) calculate the CPU with the lowest load, and redirect one or more IRQs to it in fixed delivery mode. This could be done on the scale of a few times per second. The upside of this approach is that it could do real IRQ load balancing. The downside is that it won't achieve lowest priority delivery of IRQs. An IRQ might interrupt a CPU executing a high priority task when there are other idle CPUs that would be better to use.

Posted: **Fri May 25, 2012 5:24 am**

I'm thinking about redesigning the current concept, which is partly why I started the thread. Originally, I used lowest priority delivery on all IRQs, but since then I've had to first change this for ISA-based IRQs and later also for PCI-based, as this was the only option for making two of my PCs run.

As I mentioned in the other thread, I also need load balancing, and the distribution of IRQs between CPUs makes a big difference in load-balancing as IRQs will wakeup server threads which wakeup applications, and since all these are short, they tend to run on the CPU that received the original IRQ.

Posted: **Fri May 25, 2012 5:48 am**

rdos wrote:This is an generic discussion of how to distribute IRQ load in an SMP system.

http://www.alexonlinux.com/smp-affinity ... g-in-linux

and

http://www.alexonlinux.com/why-interrup ... good-thing

talk about Linux, but there might be some useful tidbits in there, especally on the subject whether it's wise to spread out IRQs equally:

AlexOnLinux wrote:My point is that round-robin style interrupt delivery can be quite nasty on performance. It is much better to deliver interrupts from a certain device to a given core.

Posted: **Fri May 25, 2012 6:43 am**

Solar wrote:
rdos wrote:This is an generic discussion of how to distribute IRQ load in an SMP system.
http://www.alexonlinux.com/smp-affinity ... g-in-linux

and

http://www.alexonlinux.com/why-interrup ... good-thing

talk about Linux, but there might be some useful tidbits in there, especally on the subject whether it's wise to spread out IRQs equally:

AlexOnLinux wrote:My point is that round-robin style interrupt delivery can be quite nasty on performance. It is much better to deliver interrupts from a certain device to a given core.

OK, that was an interesting viewpoint from the Linux community.

This issue seems to be somewhat related to how the scheduler works. In Linux, it seems like the device that handles the interrupt is pegged to a particular core, and thus it would be optimal to deliver the IRQ to the same core in order to minimize latency (or send an IPI to it so it grabs the server thread). In RDOS (and possibly other OSes?), the server thread is not blocked on a particular core, but rather as soon as the IRQ signals it to wakeup, it will be run by the CPU waking it up. Therefore, there is no need to peg the server thread and IRQ to a particular core in RDOS, and if the IRQ is moved to a new CPU, so is the server thread. This is also why moving IRQs can achieve load balancing.

The same goes for wakeups from timer IRQs. If a global timer is used (HPET), all timeouts will be served by the CPU that is programmed to handle the HPET IRQ, and if these are many, it would affect load balance. By moving HPET between CPUs (slowly), a balanced load can be achieved.

They also confirm my finding:

On some computers IO-APIC does not support logical delivery mode. This can be because of buggy BIOS or too many CPUs. On such computers physical interrupt delivery mode is the only thing that works, so binding single interrupt to single core is the only choice and the only thing you can do is switch the core from one to another.

Posted: **Fri May 25, 2012 7:30 am**

rdos wrote:In Linux, it seems like the device that handles the interrupt is pegged to a particular core [...] Therefore, there is no need to peg the server thread and IRQ to a particular core in RDOS, and if the IRQ is moved to a new CPU, so is the server thread.

The need is called "cache coherence" / "cold cache". You might want to ignore the issue to keep things simple, but if you want to get good performance you want to run repetetive tasks (like handling a NIC interrupt) on the same core, as to enjoy the (significant) benefits of a hot cache.

Edit: That is, effectively, what was said in the linked article, and what Brendan elaborated on in quite some detail in that other thread...

Posted: **Fri May 25, 2012 8:40 am**

Solar wrote:
rdos wrote:In Linux, it seems like the device that handles the interrupt is pegged to a particular core [...] Therefore, there is no need to peg the server thread and IRQ to a particular core in RDOS, and if the IRQ is moved to a new CPU, so is the server thread.
The need is called "cache coherence" / "cold cache". You might want to ignore the issue to keep things simple, but if you want to get good performance you want to run repetetive tasks (like handling a NIC interrupt) on the same core, as to enjoy the (significant) benefits of a hot cache.

Edit: That is, effectively, what was said in the linked article, and what Brendan elaborated on in quite some detail in that other thread...

Yes, I know that the server thread should be run on the same core as much as possible, but this would be the case as well because the IRQs will only be switched slowly (say 4 times per second) in order to keep the cores equally loaded over time so temperatures stay roughly the same. The difference between the Linux design and my design primarily is that Linux keeps the server thread on some fixed core, and then try to match the IRQ to this core, while in my design, the server thread will automatically be run where the IRQ occurs.

Posted: **Fri May 25, 2012 8:52 am**

Hi,

In my opinion, for micro-kernels (e.g. where the IRQ handlers in the kernel only cause message/s to be sent to task/s) there's no need to worry about cache (as the message sending code is just as likely to be in every CPU's cache anyway). The best possible IRQ handling is to use "lowest priority, logical delivery" configured so that interrupts are sent to the CPU that currently has the lowest priority in the NUMA domain that is "closest" to the device; and correctly manage each CPU's TPR.

For correctly managing TPR, the TPR register has a "priority" and a "priority sub-class". Leave the "priority" set to zero. Use the ""priority sub-class" to tell the APIC what priority the CPU is running at (e.g. 0x00 = very low priority, 0x0F = very high priority). Then:

If the CPU is running a task, set the TPR in the range from 0x08 to 0x0F to correspond to the task's priority (e.g. for a very low priority task set the TPR to 0x08 and for a very high priority task set the TPR to 0x0F). This means that IRQs are more likely to be sent to CPUs that are running low priority tasks (and therefore it helps the performance of high priority tasks).
If a CPU is idle (but not in a sleep state), set the TPR to "as low as possible" (0x00). This means IRQs are less likely to interrupt CPUs that are doing other work.
If a CPU is in a sleep state, set the TPR to "high" (e.g. 0x0F or maybe 0x0E). Bringing a CPU out of its sleep state takes time, which increases IRQ latency and is bad for performance. It's also bad for power consumption.
If changing TPR isn't too expensive (it shouldn't be), consider lowering TPR when you enter the kernel and raising it when leaving the kernel. I'd subtract 7. For example, if you're running a very high priority task set the TPR to 0x0F when you return to CPL=3 and change it to 0x08 when you enter the kernel; and if you're running a very low priority task set the TPR to 0x08 when you return to CPL=3 and change it to 0x01 when you enter the kernel. This helps to avoid the overhead of CPL=3 -> CPL=0 -> CPL=3 switching.

If the chipset is dodgy and doesn't support "lowest priority delivery", then the only choice is to fall back to fixed delivery (something similar to what I describe for monolithic kernels).

For monolithic kernels, where the kernel's IRQ handler passes control directly to the corresponding device driver/s IRQ handler/s, the cache problem has a much larger effect. In this case I'd probably consider using fixed delivery instead, so that the most frequent IRQ is sent to the first CPU on core#0, the second most frequent IRQ is sent to the first CPU on core#1, etc. This means reconfiguring everything when a CPU goes to sleep or is is woken up again or whenever IRQs become too unbalanced, with no attempt to avoid interrupting high priority tasks.

Cheers,

Brendan

Posted: **Thu Jun 07, 2012 9:00 am**

Other thought: If IRQs are unevenly distributed, the logical consequence is that the core will be getting less non-IRQ work done. A consequence of this is that the cores' load average will increase, and hence the scheduler should begin shedding tasks because overall system load is unbalanced.

In other words, unbalanced IRQ load is not necessarily a bad thing because the rest of the system should be able to compensate for it. In fact, it is probably a good thing because it should, in theory, mean that one core keeps the IRQ handler in cache

Posted: **Thu Jun 07, 2012 12:44 pm**

Owen wrote:Other thought: If IRQs are unevenly distributed, the logical consequence is that the core will be getting less non-IRQ work done. A consequence of this is that the cores' load average will increase, and hence the scheduler should begin shedding tasks because overall system load is unbalanced.

That works if enough non-IRQ related tasks are present, but not otherwise. If you have only one major activity going, which is triggered by an IRQ, all load will be on a single core. If the load is heavy enough to overheat this core, you will be in for trouble.

Posted: **Fri Jun 08, 2012 1:12 am**

rdos wrote:
Owen wrote:Other thought: If IRQs are unevenly distributed, the logical consequence is that the core will be getting less non-IRQ work done. A consequence of this is that the cores' load average will increase, and hence the scheduler should begin shedding tasks because overall system load is unbalanced.
That works if enough non-IRQ related tasks are present, but not otherwise. If you have only one major activity going, which is triggered by an IRQ, all load will be on a single core. If the load is heavy enough to overheat this core, you will be in for trouble.

If you're using lowest priority first, and a core is servicing an IRQ, then a direct consequence of this is that it is no longer the lowest priority core. Ergo, some of the IRQ load will be distributed to the next core.

I'd say an overheating core is an indication of a failing cooling system... but, even so, the correct course of action is to thermally throttle the core, which will cause some of the IRQ load to be shed to another core

Posted: **Fri Jun 08, 2012 3:07 am**

Owen wrote:If you're using lowest priority first, and a core is servicing an IRQ, then a direct consequence of this is that it is no longer the lowest priority core. Ergo, some of the IRQ load will be distributed to the next core.

I no longer use lowest priority delivery, but rather fixed delivery. This is because of some chipsets not working properly with lowest priority delivery. Another reason why I don't use lowest priority delivery is that delivery is not random when all cores are idle / run at the same priority, rather in this scenario the interrupt is always delivered to the same core every time.

In the scenario when one major task is triggered by an IRQ, fixed delivery means the load will always happen on the same core. I fix this issue by slowly moving around IRQs so that the lowest loaded core gets one IRQ from another core at a regular basis (a few times per second). This way I can ensure that long-term load is similar across cores, and thus that temperature remains the same for single-socket systems.

OSDev.org

Distributing IRQs between cores in an SMP system

Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system

Re: Distributing IRQs between cores in an SMP system