Page 1 of 1

Strange IO-APIC register values.

Posted: Sun Nov 19, 2017 7:39 pm
by wangt13
I am learning IOAPIC in X86 system.
To do that, I used my vmware VM (Ubuntu 16.10, running in ESX).

I wrote a C program to dump IO-APIC registers, and tried to interpret the values according to IO-APIC page of OSDev.

Here is the output.

Code: Select all

# ./dumpioapic
Reg[0] = 0x01000000
Reg[1] = 0x00170011
Reg[2] = 0x01000000
Redir[12] = 300000000000931: vec = 31, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[14] = 300000000000930: vec = 30, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[16] = 300000000000933: vec = 33, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[18] = 300000000000934: vec = 34, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[1a] = 300000000000935: vec = 35, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[1c] = 300000000000936: vec = 36, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[1e] = 300000000000937: vec = 37, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[20] = 300000000000938: vec = 38, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[22] = 30000000000a939: vec = 39, d_mode = 1, dest_mode = 1, pin_pol = 1, tr_mode = 1
Redir[24] = 30000000000093a: vec = 3a, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[26] = 30000000000093b: vec = 3b, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[28] = 30000000000093c: vec = 3c, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[2a] = 30000000000093d: vec = 3d, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[2c] = 30000000000093e: vec = 3e, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[2e] = 10000000000093f: vec = 3f, d_mode = 1, dest_mode = 1, pin_pol = 0, tr_mode = 0
Redir[30] = 30000000000a954: vec = 54, d_mode = 1, dest_mode = 1, pin_pol = 1, tr_mode = 1
And, below is the output of 'cat /proc/interrupts | grep -i "IO-APIC"'.

Code: Select all

# cat /proc/interrupts | grep -i "IO-APIC"
  0:         16          0   IO-APIC   2-edge      timer
  1:          2          7   IO-APIC   1-edge      i8042
  6:          4          0   IO-APIC   6-edge      floppy
  8:          1          0   IO-APIC   8-edge      rtc0
  9:          0          0   IO-APIC   9-fasteoi   acpi
 12:        148          3   IO-APIC  12-edge      i8042
 14:          0          0   IO-APIC  14-edge      ata_piix
 15:     227096        238   IO-APIC  15-edge      ata_piix
 16:          1          0   IO-APIC  16-fasteoi   vmwgfx
 17:       4330      29827   IO-APIC  17-fasteoi   ioc0
To my surprise, the vector number shown by my C program did NOT match the output of /proc/interrupts.
I also got confused by my own output compared to the description of IO-APIC page in OSDev.

My code followed the IO-APIC page of OSdev, but I could NOT find out why my code showed the strange output.
Here is my C code for your reference.

Code: Select all

    addr = mmap(0, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, memfd, 0xFEC00000);
    if (addr == (void *)-1) {
        printf("Failed to mmap\n");
        goto out;
    } else {
        for (i = 0; i < 0x40; i ++) {
            if ((i > 2) && (i < 0x10)) {
                continue;
            }
            *(volatile uint32_t *)(addr) = i;
            data = *(volatile uint32_t*)(addr + 0x10);
            if (i >= 0x10) {
                if ((i % 2) == 1) {
                    redir_reg.raw_reg |= (uint64_t)data << 32;
                    if (redir_reg.reg_bits.mask == 0) {
                        printf("Redir[%x] = %lx",
                                i - 0x1, redir_reg.raw_reg);
                        printf(": vec = %x, d_mode = %x, dest_mode = %x, pin_pol = %x, tr_mode = %x\n",
                                redir_reg.reg_bits.vector, redir_reg.reg_bits.deliver_mode, redir_reg.reg_bits.dest_mode,
                                redir_reg.reg_bits.pin_polarity, redir_reg.reg_bits.trigger_mode);
                    }
                } else {
                    redir_reg.raw_reg = data;
                }
            } else {
                printf("Reg[%x] = 0x%08x\n", i, data);
            }
        }
    }
Thanks,
-Tao

Re: Strange IO-APIC register values.

Posted: Sun Nov 19, 2017 8:40 pm
by bellezzasolo
OK, strange issue.
From your register dump, it's clear that you've found an IO-APIC.
IOAPICID, IOAPICVER and IOAPICARB are all in order, and well formed.
This eliminates issues caused by your assumption of the IOAPIC's address (I believe it's potentially remapable).
It's also clear that there's no funny endianness stuff going on, you've built the values correctly.
That leaves the conclusion that the host OS (Linux) isn't feeding you the raw vectors in /proc/interrupts
I would suggest changing

Code: Select all

printf("Redir[%x] = %lx",
i - 0x1, redir_reg.raw_reg);
to

Code: Select all

printf("Redir[%x] = %lx",
(i - 0x11)/2, redir_reg.raw_reg);
to give you (IOREDTBLX). However, with a bit of mental gymnastics you can work around that.
My guess is that Linux stores some base vector (0x30), and the quoted IO-APIC 2 is an offset.
This will be a hangover from supporting the PIC mark 1.

Re: Strange IO-APIC register values.

Posted: Sun Nov 19, 2017 9:58 pm
by Brendan
Hi,
wangt13 wrote:To my surprise, the vector number shown by my C program did NOT match the output of /proc/interrupts.
I also got confused by my own output compared to the description of IO-APIC page in OSDev.
There's about 5 different "numbering schemes for things" involved:
  • IRQ number at the device
  • IRQ number on the source bus
  • Input line number at the interrupt controller
  • Global IRQ number
  • Interrupt vector at the CPU
The relationships between these numbering schemes varies from "linear" to "random/arbitrary". For example, "PCI IRQ A" at the device can be "PCI IRQ D" at the PCI bus, which could be "input number 33" at the IO APIC, which could be "IRQ #55" and might generate interrupt vector 123.

Linux is showing you "global IRQ number", and your utility is showing "interrupt vector" - they aren't the same numbering scheme.

Note that most of these have a basis in reality (e.g. something connected to "IO APIC input #3" literally is connected to "IO APIC input #3"); but "global IRQ number" doesn't - it's just a conceptual tool (meaningless without a definition, where the definition used by Linux is probably an extension of the definition that ACPI uses).

The definition that ACPI uses is:
  • IRQs 0, 1 and 3 to 15 have the same meaning as they used to for the legacy ISA bus
  • IRQs 16 to the total number of inputs that all IO APICs combined have are based on a sequential numbering of IO APIC inputs (e.g. if there are two IO APICs with 16 inputs each, then IRQ #17 would be input #1 on the 2nd IO APIC)
  • Anything else (e.g. anything using Message Signalled Interrupts/MSI) isn't part of ACPI's "global interrupt number" scheme and isn't defined

Cheers,

Brendan

Re: Strange IO-APIC register values.

Posted: Sun Nov 19, 2017 10:48 pm
by wangt13
Thank Brendan for the explanation, that makes sense.
I don't even know that 5 numbering schemes. :(

Would like to give me a direction to which I can finally figure out the matching of 'global IRQ number' and 'interrupt vector'?

With that I may get a better/correct understanding of IRQ Balance mechanism in Linux + X86.

Thanks,
-Tao

Re: Strange IO-APIC register values.

Posted: Mon Nov 20, 2017 12:39 am
by Brendan
Hi,
wangt13 wrote:Would like to give me a direction to which I can finally figure out the matching of 'global IRQ number' and 'interrupt vector'?
For APICs (local APIC and IO APIC), the interrupt vector is used as an "IRQ priority", and determines things like which IRQ is sent to CPU next (when 2 or more IRQs are pending), and whether the work the CPU is doing is more important than the IRQ. For this reason the interrupt vector should be chosen to reflect the importance/priority of the IRQ; and this has nothing to do with the "global IRQ number" at all, so there shouldn't be any way to determine "global IRQ number" from "interrupt vector" alone, and shouldn't be any way to determine "interrupt vector" from "global IRQ number" alone.

Instead, you'd have to figure out how "global IRQ number" is related to "IO APIC input number", then use the corresponding IO APIC redirection table entry to determine which interrupt vector is used for that IO APIC input.

Of course Linux fails to do anything right, and it shouldn't be a surprise that they've got this wrong too. For example; for legacy ISA they've assigned interrupt vectors in "linear order" without any regard for IRQ priorities at all; so the PIT chip's IRQ which should've been higher priority (and would have been the highest possible priority for PIC chips) ends up being the lowest priority IRQ.
wangt13 wrote:With that I may get a better/correct understanding of IRQ Balance mechanism in Linux + X86.
For IRQ balancing it doesn't matter how any numbering scheme is related to any other numbering scheme. It only matters what the OS does with IO APIC's redirection table entries.

For IRQ balancing it's best to start by looking at what 80x86 is capable of, and then after that look at how Linux did everything wrong.

What 80x86 (and IO APIC) is capable of is:
  • Automatically sending an IRQ to whichever CPU is doing the least important work (which can be used to automatically balance IRQs if "servicing an IRQ" is considered more important work than not servicing an IRQ)
  • Automatically sending an IRQ to whichever CPU within a certain group of CPUs is doing the least important work. Specifically; an OS can set up a group of CPUs for each NUMA domain, so that an IRQ IRQ from a device in NUMA domain #1 is sent to whichever CPU in NUMA domain #1 is doing the least important work
  • For some CPUs (older P6); automatically sending an IRQ to a CPU that is already executing the the interrupt service routine for that interrupt; to improve cache efficiency.
  • Always sending an IRQ to the same CPU
Note that because this is built into the hardware, things like automatically sending an IRQ to whichever CPU is doing the least important work is able to adapt extremely quickly to changes in conditions. It's also relatively powerful - for an example, if you know what you're doing you can use the APIC's features to make sure that IRQs don't wake CPUs out of power saving states (which hurts power consumption and latency) when there's alternatives.

For Linux, the default behaviour is to ignore all of the hardware's features and always send an IRQ to the same CPU. Then (to work around the fact that it's incredibly idiotic) people wrote user-space tools/daemons that regularly poll for information from the kernel (e.g. how often each IRQ has occurred) and ask the kernel to make changes by reprogramming the IO APIC, to regularly change which CPU will be the only CPU that each IRQ will be sent to for the next period of time. Of course this wastes a lot of time polling for information (which is extremely bad in some cases - e.g. when the computer is idle a CPU has to be woken up just to poll when nothing changed) and wastes a lot of time repeatedly reconfiguring the IO APIC, and doesn't adapt to changes in conditions very rapidly. To make this worse, it's left up to the end user to configure it all (because doctors and janitors and secretaries and teenagers browsing facebook are all far more qualified to make these decisions than Linux kernel developers), and the user-space tools/daemons have a large number of complicated options to make sure that almost all users are so confused that it's impossible for anyone to configure it properly (if any of them ever realise it actually exists in the first place).


Cheers,

Brendan

Re: Strange IO-APIC register values.

Posted: Mon Nov 20, 2017 10:07 am
by Schol-R-LEA
Brendan wrote:For Linux, the default behaviour is to ignore all of the hardware's features and always send an IRQ to the same CPU.
As an aside, to the best of anyone's knowledge has there ever been any discussion or public statements by Torvalds et al. on why this is the default? I may dive into the kernel mailing list to see, but I thought I would ask here first, especially since my own understanding of the APIC and the issues involved with them is limited.

In the absence of other information, my guess is that it has to do with compatibility - specifically, compatibility with non-Intel CPUs (whether it is another x86 model, such as the AMD chips or even something much older such as the Cyrix chips, or completely different designs such as ARM or SPARC), or across different models of CPU (you mentioned at least one instance where this is the case, and I wouldn't be surprised if there were others).

However, this leads me to think it is a matter of inadequate configuration control, either at installation or at boot-up. Even if it is a reasonable or even necessary default, the kernel design should be capable of determining when a more effective solution is possible.

Another possibility - which might be unrelated to the previous one, or could be tied to the configuration issue - is that it conflicted with some other design choice somehow. If there was some design decision that somehow blocked them from using the hardware IRQ balancing (or they thought it did) which they could not or would not back out of, then they may have simply left it as it was.

Or it could just be, you know, stupidity. Or laziness. Or pigheadedness, something Linus has in abundance - I can see him simply deciding one day that he didn't like the idea of hardware IRQ balancing, and that was that. Or internal politics among the kernel group, something that we've seen often enough too. Or maybe it's something that's they've scheduled to discuss 'later' and set to a 'wash the dog first' priority for some reason - again, not an unusual situation with the Linux kernel team. Or maybe it just never came up.

Like I said, I might dig into this a bit more, just because it's something that could use an explanation, and might be worth understanding for other OS devs. I just thought I would ask if anyone else here already knew anything more about the reasoning (or lack thereof) behind this seemingly bone-headed choice on the Linux kernel devs' part.

This is something I think worth investigating, regardless of whether their reasoning is something which otherwise sensible people could disagree on - or perhaps especially if it isn't, as mistakes often tell us more than successes do. The Lessons Learned aspect is justification enough for finding out more.

Re: Strange IO-APIC register values.

Posted: Mon Nov 20, 2017 11:36 am
by Schol-R-LEA
Huh. They had kernel IRQ balancing... and then removed it from the 32-bit version of the kernel in 2008.

Huh.

Gonna keep going, I suspect there's something interesting going on with that.

Re: Strange IO-APIC register values.

Posted: Mon Nov 20, 2017 7:29 pm
by Brendan
Hi,
Schol-R-LEA wrote:Huh. They had kernel IRQ balancing... and then removed it from the 32-bit version of the kernel in 2008.
That's interesting; but you can see from this patch that their kernel IRQ balancing uses the same "ignore the hardware's features, periodically poll, regularly reprogram the IO APIC, and be too slow to adapt" method that the current user-space approach uses. In other words, it wasn't very good to begin with, and was then probably left to rot (e.g. not updated to support NUMA properly, etc).
Schol-R-LEA wrote:As an aside, to the best of anyone's knowledge has there ever been any discussion or public statements by Torvalds et al. on why this is the default? I may dive into the kernel mailing list to see, but I thought I would ask here first, especially since my own understanding of the APIC and the issues involved with them is limited.

In the absence of other information, my guess is that it has to do with compatibility - specifically, compatibility with non-Intel CPUs (whether it is another x86 model, such as the AMD chips or even something much older such as the Cyrix chips, or completely different designs such as ARM or SPARC), or across different models of CPU (you mentioned at least one instance where this is the case, and I wouldn't be surprised if there were others).
In the absence of other information, I'd expect that there's multiple small reasons rather than one big reason.

a) Linux is Unix, which originally didn't have any concept of thread priorities. This led to a culture of focusing on throughput (getting the most work done without caring if it's unimportant work being done at the expense of important work) that has persisted ever since.

b) Linux gets IRQ priorities wrong, and is a monolithic kernel where IRQ handlers run on the same CPU that received the IRQ. This is likely to be the biggest reason they care about IRQ balancing (to dilute the consequences of faiiing to do IRQ priorities properly).

c) Linux scheduler makes the (incorrect for multiple reasons) assumption that all CPUs are the same. This means that if one CPU is handling lots of IRQs while another CPU isn't, processes running on one CPU will be slower than processes running on the other CPU, and the scheduler won't compensate for that to ensure tasks that were supposed to be given equal CPU time actually do get equal CPU time. This is likely to be the second biggest reason they care about IRQ balancing.

d) The hardware isn't designed for IRQ balancing - it's designed to send IRQs to the "best" CPU, where "best" is influenced by how the OS uses the local APIC's "task priority register" and how the OS configures logical destinations, and where IRQ balancing is more of an accidental side-effect rather than a goal. Basically, the hardware is more suited to a well designed OS (that uses the "task priority register" properly, that has a more intelligent scheduler, etc) where there isn't much reason to care about IRQ balancing, and the hardware is less suited to Linux.

e) Back when Linux was first started 80x86 only really had PIC chips, so the original code wouldn't/couldn't have been designed for anything modern (NUMA, the ability to change which CPU an IRQ is sent to, the ability to control IRQ priorties, etc). This makes it harder to add support for newer hardware features because there's always a risk of breaking existing/working code; so people tend to make small incremental changes and are reluctant to make significant changes (e.g. adding an "IRQ balancing" hack on top of code that never did IRQs properly, rather than completely redesigning and rewriting everything to do with the management of IRQs).

f) Relying on the hardware's features means that there's a slightly higher risk of being susceptible to hardware bugs/quirks; and Linux mostly lacks a common system for tracking/managing these kinds of problems (e.g. easily maintained list/s of CPU, chipset, motherboard and firmware characteristics as file/s the kernel reads during boot) and typically uses checks for work-arounds hard-coded directly into the kernel itself and/or end-user hassle (forget about "works out of the box" and let it blow up in the user's face, then expect the user to search the internet looking for an elusive combination of hackery involving kernel compile time configuration and/or kernel command line gibberish and/or other hidden knobs spread all over the file system).

g) Linux is portable, which means that using features that are provided on 80x86 but aren't provided on other architectures increases maintenance costs. For this reason there's a tendency towards "lowest common denominator" (not using special hardware features).


Cheers,

Brendan