IOAPIC and the LDR - interrupt being sent to multiple CPUs
IOAPIC and the LDR - interrupt being sent to multiple CPUs
I've added support for IOAPIC, MSI and MSIX interrupts, but I'm having an issue on certain hardware where an interrupt is being handled on multiple CPUs instead of just the lowest priority one. Here's my setup:
- During boot I set the logical destination register (LDR) of the boot APIC to be 0x01 << 24
- When configuring the IOAPIC, I set the appropriate redirection table entries to 0x0100000000000936, which should be lowest priority, logical mode.
- When I bring up all of the application processors, I set all of their LDR's to be 0x02 << 24
- Once all APs are up, I update the IOAPIC pins to be: 0x0300000000000936
- At boot, I set the TPR to 0 for all local APICs, and don't touch it again after.
From reading the intel docs (Vol3A chapter 10) and Brendan's post http://forum.osdev.org/viewtopic.php?p=202258#p202258 on the subject, it was my understanding that the LDR could be used to set up groups of processors to receive IPIs or interrupts. In my case, I have one group that is just the boot processor, and another group that contains all of the APs. Then when sending a lowest priority interrupt using logical mode, setting the IOAPIC destination field to be (0x1 | 0x2) would send the interrupt to all local APICs, and they would arbitrate amongst themselves to determine who was the lowest priority CPU and subsequently handles the interrupt.
Now this seems to work on most of the systems I've tested on, including virtual box. Only a single cpu actually handles the interrupt, and it's sometimes the BSP, sometimes the AP. However, on two Xeon's server boxes I've got, a single interrupt is actually being handled by all application processors (but not on the boot processor). So my PIT timer (and therefore time of day) is incrementing at a rather amusing rate, since it's being bumped 15 times more than it should be (16 core machine).
Is my understanding of how grouping processors with the LDR works correct? Are there differences in how lowest priority mode is used on different classes of CPUs? Any light you can shine on the subject would be most appreciated. Thanks!
**edit: Only on the machines I'm having difficulty with, I noticed that bits 48:55 of the redirection table entries is being set. Took some time to track it down since this is in the reserved section according to the 82093AA spec, but in the ICH9 spec (http://www.intel.com/content/dam/doc/da ... asheet.pdf, pg. 477), it says these are:
"Extended Destination ID (EDID) — RO. These bits are sent to a local APIC only when in Processor System Bus mode. They become bits 11:4 of the address."
Now I'm even more confused... what address is this referring to?
- During boot I set the logical destination register (LDR) of the boot APIC to be 0x01 << 24
- When configuring the IOAPIC, I set the appropriate redirection table entries to 0x0100000000000936, which should be lowest priority, logical mode.
- When I bring up all of the application processors, I set all of their LDR's to be 0x02 << 24
- Once all APs are up, I update the IOAPIC pins to be: 0x0300000000000936
- At boot, I set the TPR to 0 for all local APICs, and don't touch it again after.
From reading the intel docs (Vol3A chapter 10) and Brendan's post http://forum.osdev.org/viewtopic.php?p=202258#p202258 on the subject, it was my understanding that the LDR could be used to set up groups of processors to receive IPIs or interrupts. In my case, I have one group that is just the boot processor, and another group that contains all of the APs. Then when sending a lowest priority interrupt using logical mode, setting the IOAPIC destination field to be (0x1 | 0x2) would send the interrupt to all local APICs, and they would arbitrate amongst themselves to determine who was the lowest priority CPU and subsequently handles the interrupt.
Now this seems to work on most of the systems I've tested on, including virtual box. Only a single cpu actually handles the interrupt, and it's sometimes the BSP, sometimes the AP. However, on two Xeon's server boxes I've got, a single interrupt is actually being handled by all application processors (but not on the boot processor). So my PIT timer (and therefore time of day) is incrementing at a rather amusing rate, since it's being bumped 15 times more than it should be (16 core machine).
Is my understanding of how grouping processors with the LDR works correct? Are there differences in how lowest priority mode is used on different classes of CPUs? Any light you can shine on the subject would be most appreciated. Thanks!
**edit: Only on the machines I'm having difficulty with, I noticed that bits 48:55 of the redirection table entries is being set. Took some time to track it down since this is in the reserved section according to the 82093AA spec, but in the ICH9 spec (http://www.intel.com/content/dam/doc/da ... asheet.pdf, pg. 477), it says these are:
"Extended Destination ID (EDID) — RO. These bits are sent to a local APIC only when in Processor System Bus mode. They become bits 11:4 of the address."
Now I'm even more confused... what address is this referring to?
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Hi,
For larger systems (for an unknown definition of "larger") the chipset and APICs may expect to be configured as "cluster mode" and not "flat mode" (mostly in the local APIC's "Destination Format Register"). In this case half of the logical destination is used to select the cluster, and the other half is used to select CPUs within the cluster.
Note that a system with 16-cores is probably designed for "16-cores with hyper-threading = 32 logical CPUs" (even if/when the CPUs don't actually support hyper-threading). For x2APIC everything uses "cluster mode", and there's a maximum of 16 logical CPUs per cluster, which implies that (if x2APIC was being used, for a system designed for >= 32 logical CPUs) you'd have to have a minimum of 2 clusters and the chipset support for "cluster mode". When you're not using x2APIC (and are using xAPIC instead) it's possible (likely?) that the chipset still expects to be using "cluster mode" (because it's required for x2APIC).
If:
Cheers,
Brendan
Yes (maybe).LINT0 wrote:Are there differences in how lowest priority mode is used on different classes of CPUs?
For larger systems (for an unknown definition of "larger") the chipset and APICs may expect to be configured as "cluster mode" and not "flat mode" (mostly in the local APIC's "Destination Format Register"). In this case half of the logical destination is used to select the cluster, and the other half is used to select CPUs within the cluster.
Note that a system with 16-cores is probably designed for "16-cores with hyper-threading = 32 logical CPUs" (even if/when the CPUs don't actually support hyper-threading). For x2APIC everything uses "cluster mode", and there's a maximum of 16 logical CPUs per cluster, which implies that (if x2APIC was being used, for a system designed for >= 32 logical CPUs) you'd have to have a minimum of 2 clusters and the chipset support for "cluster mode". When you're not using x2APIC (and are using xAPIC instead) it's possible (likely?) that the chipset still expects to be using "cluster mode" (because it's required for x2APIC).
If:
- the local APIC's "Destination Format Register" (in all CPUs) is configured as "cluster mode" but the chipset expects "flat mode", or
- the local APIC's "Destination Format Register" (in all CPUs) is configured as "flat mode" but the chipset expects "cluster mode", or
- the chipset expects "cluster mode" and the local APICs are correctly configured as "cluster mode"; but:
- the IO APIC is sending to a cluster that doesn't exist (e.g. "destination cluster = 0x0300" where all local APICs think they're "cluster = 0x0100" or "cluster = 0x0200"), or
- the "cluster ID" part of each CPU's "Destination Format Register" doesn't make sense with respect to the system's topology (e.g. different CPUs that are within the same cluster from the chipset's perspective have a different "cluster ID")
I suspect (based on information about interrupt remapping from Intel's "Intel Virtualization Technology for Directed I/O" specification combined with the absence of any information anywhere else I looked); that the "Extended Destination ID (EDID)" is only used for Itanium and isn't used for 80x86.LINT0 wrote:**edit: Only on the machines I'm having difficulty with, I noticed that bits 48:55 of the redirection table entries is being set. Took some time to track it down since this is in the reserved section according to the 82093AA spec, but in the ICH9 spec (http://www.intel.com/content/dam/doc/da ... asheet.pdf, pg. 477), it says these are:
"Extended Destination ID (EDID) — RO. These bits are sent to a local APIC only when in Processor System Bus mode. They become bits 11:4 of the address."
Now I'm even more confused... what address is this referring to?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Thanks for the help. I am setting the local apic DFR for flat mode, but hadn't considered clustering. The OS doesn't use/support x2APIC yet, and I've got it disabled in the BIOS, so only xAPIC should be in play. I'll play with the clustering mode though and see if that makes any difference.
And not that it makes any difference to your example, but I misspoke wrt the number of cores. I should have said 16 logical CPUs, as it's only an 8 core. I have run on a dual socket 8 core machine with 32 total logical, but I don't have that one available at the moment so I can't see if it behaves any differently. The other machine I've seen this on is an 8 core system with hyperthreading disabled in the BIOS, so I've only got 8 logical CPUs. But as you said, if it's still expecting clusters...
As for the EDID address, looking into it more, I think those bits are added into the address that gets written to the system bus in the same way an MSI message is. The Intel doc says those same bits in the MSI address are also reserved. Whether or not they're used for anything, it seems like you've found more than I have.
Thanks again. Will try setting up clustering mode and get back if I find something.
And not that it makes any difference to your example, but I misspoke wrt the number of cores. I should have said 16 logical CPUs, as it's only an 8 core. I have run on a dual socket 8 core machine with 32 total logical, but I don't have that one available at the moment so I can't see if it behaves any differently. The other machine I've seen this on is an 8 core system with hyperthreading disabled in the BIOS, so I've only got 8 logical CPUs. But as you said, if it's still expecting clusters...
As for the EDID address, looking into it more, I think those bits are added into the address that gets written to the system bus in the same way an MSI message is. The Intel doc says those same bits in the MSI address are also reserved. Whether or not they're used for anything, it seems like you've found more than I have.
Thanks again. Will try setting up clustering mode and get back if I find something.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
The intel docs say that sending of lowest priority IPIs (and by extension, I would assume that includes I/O APIC and MSI/x interrupts) "is model specific and should be avoided by BIOS and operating system software."
Is there an MSR or CPUID flag that I can read that will tell me if the system supports lowest priority IPIs?
Is there an MSR or CPUID flag that I can read that will tell me if the system supports lowest priority IPIs?
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Hi,
I think you're going to need to try some experiments (things that shouldn't make any difference, but might). For a start, make sure you set the TPR on each CPU differently before any interrupt occurs (maybe the "chipset arbitrator" needs to be "reinformed" of CPU priorities). Also try setting the TPR for each CPU differently and see if that makes a difference (e.g. if the IRQ is being received by "all CPUs that are at the same lowest priority").
Also see what happens when the logical destination register is different for each CPU. Intel use "one bit in logical destination for each CPU" in their example, and while there's nothing in the docs to indicate that each CPU's logical destination needs to be different (and it goes against what is described in the docs) your "a single interrupt is actually being handled by all application processors (but not on the boot processor)" makes me suspicious (e.g. that could be interpreted as "a single interrupt is actually being handled by all CPUs that have the same logical destination").
Cheers,
Brendan
Intel only ever say (e.g.) "The ability for a processor to send a lowest priority IPI is model specific and should be avoided by BIOS and operating system software.". If they actually meant "to send and receive" then they'd be saying that the entire lowest priority delivery feature should never be used, and in that case there'd be no point providing the feature in any CPU and no point documenting it anywhere (they would've described it as "reserved" or "deprecated" instead). Also note that there isn't any "is model specific" note for lowest priority delivery in the MSI section (or in any IO APIC datasheet, or in their x2APIC specification, or in their "Virtualization Technology for Directed I/O" specification) - it's only in the "sending IPIs from CPUs" section and not in anything that describes "sending IPIs from hardware (to CPUs)".LINT0 wrote:The intel docs say that sending of lowest priority IPIs (and by extension, I would assume that includes I/O APIC and MSI/x interrupts) "is model specific and should be avoided by BIOS and operating system software."
As far as I know; there is no flag in CPUID or in any MSR to determine if a CPU can send lowest priority IPIs. There's also no other information anywhere else (e.g. in Intel's specification updates, on any web page online, etc) that says which CPU models do/don't support sending lowest priority IPIs.LINT0 wrote:Is there an MSR or CPUID flag that I can read that will tell me if the system supports lowest priority IPIs?
I think you're going to need to try some experiments (things that shouldn't make any difference, but might). For a start, make sure you set the TPR on each CPU differently before any interrupt occurs (maybe the "chipset arbitrator" needs to be "reinformed" of CPU priorities). Also try setting the TPR for each CPU differently and see if that makes a difference (e.g. if the IRQ is being received by "all CPUs that are at the same lowest priority").
Also see what happens when the logical destination register is different for each CPU. Intel use "one bit in logical destination for each CPU" in their example, and while there's nothing in the docs to indicate that each CPU's logical destination needs to be different (and it goes against what is described in the docs) your "a single interrupt is actually being handled by all application processors (but not on the boot processor)" makes me suspicious (e.g. that could be interpreted as "a single interrupt is actually being handled by all CPUs that have the same logical destination").
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Thanks. I couldn't fathom receiving lowest priority interrupts from I/O APIC and MSI/x devices wouldn't work, but am clutching at straws.Brendan wrote: Intel only ever say (e.g.) "The ability for a processor to send a lowest priority IPI is model specific and should be avoided by BIOS and operating system software.". If they actually meant "to send and receive" then they'd be saying that the entire lowest priority delivery feature should never be used, and in that case there'd be no point providing the feature in any CPU and no point documenting it anywhere (they would've described it as "reserved" or "deprecated" instead). Also note that there isn't any "is model specific" note for lowest priority delivery in the MSI section (or in any IO APIC datasheet, or in their x2APIC specification, or in their "Virtualization Technology for Directed I/O" specification) - it's only in the "sending IPIs from CPUs" section and not in anything that describes "sending IPIs from hardware (to CPUs)".
I'll try tweaking the TPR to see what happens, but my understanding is that it's used only to block interrupts of a lower priority than a given threshold, not all interrupts.Brendan wrote:I think you're going to need to try some experiments (things that shouldn't make any difference, but might). For a start, make sure you set the TPR on each CPU differently before any interrupt occurs (maybe the "chipset arbitrator" needs to be "reinformed" of CPU priorities). Also try setting the TPR for each CPU differently and see if that makes a difference (e.g. if the IRQ is being received by "all CPUs that are at the same lowest priority").
Also see what happens when the logical destination register is different for each CPU. Intel use "one bit in logical destination for each CPU" in their example, and while there's nothing in the docs to indicate that each CPU's logical destination needs to be different (and it goes against what is described in the docs) your "a single interrupt is actually being handled by all application processors (but not on the boot processor)" makes me suspicious (e.g. that could be interpreted as "a single interrupt is actually being handled by all CPUs that have the same logical destination").
I have set up several different processor "groups" using the LDR, and what I see is that on the Xeon machines, the processors that all share the same LDR "group" all handle the interrupt. For example:
- CPU 0 = LDR 0x1
CPU 1-5 = LDR 0x4
CPU 6-12 = LDR 0x10
CPU 13-15 = LDR 0x80
I then program the IO redirection table such that the destination is 0x95 (0x1 | 0x4 | 0x10 | 0x80).
What I see then is that CPUs 1-5 all handle the interrupt. The rest do not. Again, this seems to be different on the Xeons (at least the ones I'm testing with), since it works correctly, or at least as expected, on the rest of my desktop hardware.
*edit*
Another interesting tidbit... I haven't gotten all the way through the VT for Directed I/O doc yet, so I don't know if it's in any way relevant, but since you mentioned it, I'll bring it up.
According to the chipset datasheet, the extended destination ID bits (that I noticed were being set on these Xeon machines) "become bits 11:4 of the address." According to the VT for Directed I/O doc in section 5.1.2, bit 4 of the address specifies whether the interrupt request is in compatibility form or remappable form. So if there is an EDID that has its last bit on, I'm wondering if that would make the request a remappable request. I do see some IO APIC redir table entries with that bit on, but not on all interrupts, and not on the PIT interrupt which is the one I'm using to test with. I don't know if there's anything the OS would need to do to support the remappable format, or if the system will handle that internally. Regardless, just thought it was interesting.
Thanks,
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Hi,
Note that by setting TPR to values in the range 0x00 to 0x1F you're not blocking any IRQs and only effecting the priority used for "send to lowest priority"; which is useful for various reasons. For a simple example, by setting TPR to a value from 0x00 to 0x1F during task switches (to reflect the thread's priority) you can make it so higher priority threads are less likely to be interrupted (and lower priority threads more likely to be interrupted) and improve performance for higher priority threads. For another example; when you put a CPU to sleep you can set TPR higher (e.g. 0x08 for HLT, 0x1F for the deepest sleep state, etc) to reduce the latency involved in taking it out of sleep when other CPUs are available. This is also why (AMD's) "mov cr8, ..." instruction is disappointing - for no known/obvious reason, CR8 is only 4 bits and doesn't set the "sub-priority" part of the TPR (the lowest 4 bits) so it's far less useful for these kinds of performance tweaks.
Note that if you're not changing TPR then CPUs 1 to 5 probably always win the lowest priority contest (or at least, they win when they aren't already handling another IRQ that temporarily raises the priority). Intel has a comment about that somewhere.
Cheers,
Brendan
TPR does 2 different (but related) things - it blocks lower priority interrupts, and it also effects the priority used for "send to lowest priority".LINT0 wrote:I'll try tweaking the TPR to see what happens, but my understanding is that it's used only to block interrupts of a lower priority than a given threshold, not all interrupts.Brendan wrote:I think you're going to need to try some experiments (things that shouldn't make any difference, but might). For a start, make sure you set the TPR on each CPU differently before any interrupt occurs (maybe the "chipset arbitrator" needs to be "reinformed" of CPU priorities). Also try setting the TPR for each CPU differently and see if that makes a difference (e.g. if the IRQ is being received by "all CPUs that are at the same lowest priority").
Also see what happens when the logical destination register is different for each CPU. Intel use "one bit in logical destination for each CPU" in their example, and while there's nothing in the docs to indicate that each CPU's logical destination needs to be different (and it goes against what is described in the docs) your "a single interrupt is actually being handled by all application processors (but not on the boot processor)" makes me suspicious (e.g. that could be interpreted as "a single interrupt is actually being handled by all CPUs that have the same logical destination").
Note that by setting TPR to values in the range 0x00 to 0x1F you're not blocking any IRQs and only effecting the priority used for "send to lowest priority"; which is useful for various reasons. For a simple example, by setting TPR to a value from 0x00 to 0x1F during task switches (to reflect the thread's priority) you can make it so higher priority threads are less likely to be interrupted (and lower priority threads more likely to be interrupted) and improve performance for higher priority threads. For another example; when you put a CPU to sleep you can set TPR higher (e.g. 0x08 for HLT, 0x1F for the deepest sleep state, etc) to reduce the latency involved in taking it out of sleep when other CPUs are available. This is also why (AMD's) "mov cr8, ..." instruction is disappointing - for no known/obvious reason, CR8 is only 4 bits and doesn't set the "sub-priority" part of the TPR (the lowest 4 bits) so it's far less useful for these kinds of performance tweaks.
I suspect that Intel's "System Programmer's Manual" is wrong (for at least some CPUs and/or some chipsets); and the real behaviour is "find the logical destination with the lowest priority; then send to all processors that share that same logical destination".LINT0 wrote:I have set up several different processor "groups" using the LDR, and what I see is that on the Xeon machines, the processors that all share the same LDR "group" all handle the interrupt. For example:
- CPU 0 = LDR 0x1
CPU 1-5 = LDR 0x4
CPU 6-12 = LDR 0x10
CPU 13-15 = LDR 0x80
I then program the IO redirection table such that the destination is 0x95 (0x1 | 0x4 | 0x10 | 0x80).
What I see then is that CPUs 1-5 all handle the interrupt. The rest do not. Again, this seems to be different on the Xeons (at least the ones I'm testing with), since it works correctly, or at least as expected, on the rest of my desktop hardware.
Note that if you're not changing TPR then CPUs 1 to 5 probably always win the lowest priority contest (or at least, they win when they aren't already handling another IRQ that temporarily raises the priority). Intel has a comment about that somewhere.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Just a couple additional data points...
- VMware also exhibits this behavior. Virtual Box doesn't.
- Interesting though it may be, the EDID/remappable stuff appears to be irrelevant. In vmware, those bits are not set and the problem persists.
- Initializing each local apic unit's TPR to something unique (between 0x0 and 0x1F) has no effect. I'm not sure what you mean by "set the TPR on each CPU differently before any interrupt occurs" as by definition, I can't know when an interrupt will occur.
Also, with respect to determining whether or not a local apic can send a lowest-priority IPI, it seems the only sure-fire way to know is to try, then check bit 4 of the ESR.
- VMware also exhibits this behavior. Virtual Box doesn't.
- Interesting though it may be, the EDID/remappable stuff appears to be irrelevant. In vmware, those bits are not set and the problem persists.
- Initializing each local apic unit's TPR to something unique (between 0x0 and 0x1F) has no effect. I'm not sure what you mean by "set the TPR on each CPU differently before any interrupt occurs" as by definition, I can't know when an interrupt will occur.
Also, with respect to determining whether or not a local apic can send a lowest-priority IPI, it seems the only sure-fire way to know is to try, then check bit 4 of the ESR.
Well that would... suck. Linux reportedly uses lowest priority mode, so I'll dig through their code to see if I can spot them doing something different. I'm still not 100% convinced I haven't just screwed something up.Brendan wrote:I suspect that Intel's "System Programmer's Manual" is wrong (for at least some CPUs and/or some chipsets); and the real behaviour is "find the logical destination with the lowest priority; then send to all processors that share that same logical destination".
Even when CPUs 1-5 have a different TPR value, they all still handle the interrupt... Will keep digging...Brendan wrote:Note that if you're not changing TPR then CPUs 1 to 5 probably always win the lowest priority contest (or at least, they win when they aren't already handling another IRQ that temporarily raises the priority). Intel has a comment about that somewhere.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Hi,
I've been digging a little too. Originally when I was learning about all this stuff (>10 years ago now) I couldn't figure out when/if to use "flat mode" or "cluster mode" and had a conversation with someone (a Linux developer I think, but I can't remember) who mostly said "ignore cluster mode, it's only needed for massive servers". They were right at the time (as even dual-core was rare then); but it seems this has changed since and I've failed to notice(!).
More specifically; I found a "FORCE_APIC_CLUSTER_MODEL" flag and a "FORCE_APIC_PHYSICAL_DESTINATION_MODE" flag in the "fixed feature flags" in ACPI's FADT; where the descriptions are quite interesting:
Of course now I have no idea how you're supposed to use "cluster mode" (how you determine the correct "destination format register" for each cluster). I suspect that firmware does it for you in that case, but can't find a concrete specification saying so.
Cheers,
Brendan
That's...LINT0 wrote:Even when CPUs 1-5 have a different TPR value, they all still handle the interrupt... Will keep digging...Brendan wrote:Note that if you're not changing TPR then CPUs 1 to 5 probably always win the lowest priority contest (or at least, they win when they aren't already handling another IRQ that temporarily raises the priority). Intel has a comment about that somewhere.
I've been digging a little too. Originally when I was learning about all this stuff (>10 years ago now) I couldn't figure out when/if to use "flat mode" or "cluster mode" and had a conversation with someone (a Linux developer I think, but I can't remember) who mostly said "ignore cluster mode, it's only needed for massive servers". They were right at the time (as even dual-core was rare then); but it seems this has changed since and I've failed to notice(!).
More specifically; I found a "FORCE_APIC_CLUSTER_MODEL" flag and a "FORCE_APIC_PHYSICAL_DESTINATION_MODE" flag in the "fixed feature flags" in ACPI's FADT; where the descriptions are quite interesting:
ACPI's FORCE_APIC_CLUSTER_MODEL flag wrote:A one indicates that all local APICs must be configured for the cluster destination model when delivering interrupts in logical mode. If this bit is set, then logical mode interrupt delivery operation may be undefined until OSPM has moved all local APICs to the cluster model. Note that the cluster destination model doesn’t apply to ItaniumTM Processor Family (IPF) local SAPICs. This bit is intended for xAPIC based machines that require the cluster destination model even when 8 or fewer local APICs are present in the machine.
From these descriptions (and a look at some Linux patches) it seems that when there are more than 8 CPUs or when FORCE_APIC_CLUSTER_MODEL is set you have to use "cluster mode"; and when FORCE_APIC_PHYSICAL_DESTINATION_MODE is set you can't use logical destinations at all.ACPI's FORCE_APIC_PHYSICAL_DESTINATION_MODE flag wrote:A one indicates that all local xAPICs must be configured for physical destination mode. If this bit is set, interrupt delivery operation in logical destination mode is undefined. On machines that contain fewer than 8 local xAPICs or that do not use the xAPIC architecture, this bit is ignored.
Of course now I have no idea how you're supposed to use "cluster mode" (how you determine the correct "destination format register" for each cluster). I suspect that firmware does it for you in that case, but can't find a concrete specification saying so.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Nice find! Unfortunately, neither flag bit is set on my test systems...Brendan wrote:More specifically; I found a "FORCE_APIC_CLUSTER_MODEL" flag and a "FORCE_APIC_PHYSICAL_DESTINATION_MODE" flag in the "fixed feature flags" in ACPI's FADT; where the descriptions are quite interesting:
Well, I've definitely got more than 8 CPUs. Guess I'm just gonna have to bite the bullet and figure out clustering.Brendan wrote:From these descriptions (and a look at some Linux patches) it seems that when there are more than 8 CPUs or when FORCE_APIC_CLUSTER_MODEL is set you have to use "cluster mode"; and when FORCE_APIC_PHYSICAL_DESTINATION_MODE is set you can't use logical destinations at all.
I'm in the same boat. From what I've seen so far, figuring out how the clusters are set up will come from some combination of the APIC ID (which encodes processor "positions"), and I *think* data from the ACPI SRAT table, but I could be mistaken. I think it's largely handled by the system as long as you know which apics are part of which clusters.Brendan wrote:Of course now I have no idea how you're supposed to use "cluster mode" (how you determine the correct "destination format register" for each cluster). I suspect that firmware does it for you in that case, but can't find a concrete specification saying so.
Thanks again for all the help!
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Just a little follow up with what I discovered....
First off, I still haven't implemented a clustering model, so that's all still in the air. I also can't say conclusively what the problem that I was having was. Instead, I've bypassed the problem by reverting to physical destination mode.
What Linux does is... complicated. When I tested Linux on my 16 CPU machine, it was also using physical destination mode. They also send all interrupts to CPU 0, even after SMP has brought up all the other CPUs. To balance them, there is a user-space application called irqbalance that sets CPU affinities for each IRQ in the procfs. The kernel may then move them as requested. When I tried Linux on my 32 CPU system, despite having x2apic disabled in the BIOS, Linux configured and used x2apic, which of course requires clustering.
In my own testing, I've discovered that the I/O APIC spec and Intel manual are wrong. Or more accurately, the systems I've tested on don't conform to those specs. Using physical destination mode, I can target all 32 CPUs of my big machine, indicating that the system is taking the full 8 bits of the destination field rather than just the 4 it claims it will use. Also looking at Linux, nowhere do they truncate an APIC ID to 4 bits when writing that field; they always just take the full 8-bit ID and write it directly into the destination field. In my testing, this seems to work for MSI MSIx and the I/O APIC. My attempts to get any info from anybody at Intel proved fruitless.
So while it makes me nervous, and I can't guarantee that all machines I'll run across will behave this way, for now I'll just use physical destination mode. I've also implemented a way for the system administrator to redirect an IRQ to a different CPU should my basic round-robin IRQ->CPU allocation scheme prove deficient. When I do x2apic, I'll worry about clustering. Hopefully, lowest priority mode will then work and I can let the hardware worry about balancing IRQs. Until then, I'll stick with physical to be "compatible" with most machines. Of course, lowest priority does still work for all the non-Xeon, <=8 CPU systems I've tried (except for VMware), so feel free to use it if you're not going to run on big hardware or have special code to detect it.
First off, I still haven't implemented a clustering model, so that's all still in the air. I also can't say conclusively what the problem that I was having was. Instead, I've bypassed the problem by reverting to physical destination mode.
What Linux does is... complicated. When I tested Linux on my 16 CPU machine, it was also using physical destination mode. They also send all interrupts to CPU 0, even after SMP has brought up all the other CPUs. To balance them, there is a user-space application called irqbalance that sets CPU affinities for each IRQ in the procfs. The kernel may then move them as requested. When I tried Linux on my 32 CPU system, despite having x2apic disabled in the BIOS, Linux configured and used x2apic, which of course requires clustering.
In my own testing, I've discovered that the I/O APIC spec and Intel manual are wrong. Or more accurately, the systems I've tested on don't conform to those specs. Using physical destination mode, I can target all 32 CPUs of my big machine, indicating that the system is taking the full 8 bits of the destination field rather than just the 4 it claims it will use. Also looking at Linux, nowhere do they truncate an APIC ID to 4 bits when writing that field; they always just take the full 8-bit ID and write it directly into the destination field. In my testing, this seems to work for MSI MSIx and the I/O APIC. My attempts to get any info from anybody at Intel proved fruitless.
So while it makes me nervous, and I can't guarantee that all machines I'll run across will behave this way, for now I'll just use physical destination mode. I've also implemented a way for the system administrator to redirect an IRQ to a different CPU should my basic round-robin IRQ->CPU allocation scheme prove deficient. When I do x2apic, I'll worry about clustering. Hopefully, lowest priority mode will then work and I can let the hardware worry about balancing IRQs. Until then, I'll stick with physical to be "compatible" with most machines. Of course, lowest priority does still work for all the non-Xeon, <=8 CPU systems I've tried (except for VMware), so feel free to use it if you're not going to run on big hardware or have special code to detect it.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
Hi,
Cheers,
Brendan
Yes, Linux is considerably poor at this and considerably poor at most things involving priorities (e.g. ensuring the most important work is being done, rather than just any work). Because they're considerably poor at everything else, any benefits from doing IRQ balancing well would be ruined by doing things like scheduling and power management badly, so (even with extensive testing, benchmarks, etc) they can't find a reason to care about doing IRQ balancing well. Mostly, they only do "very coarse grained IRQ balancing" that's unable to adapt quickly to rapidly changing conditions (e.g. the priority of the currently running thread).LINT0 wrote:What Linux does is... complicated. When I tested Linux on my 16 CPU machine, it was also using physical destination mode. They also send all interrupts to CPU 0, even after SMP has brought up all the other CPUs. To balance them, there is a user-space application called irqbalance that sets CPU affinities for each IRQ in the procfs. The kernel may then move them as requested. When I tried Linux on my 32 CPU system, despite having x2apic disabled in the BIOS, Linux configured and used x2apic, which of course requires clustering.
According to the manuals; for physical destination the destination is 8-bit (for xAPIC) or 32-bit (for x2APIC); and the destination is only 4-bit when you're using logical destination (and not physical destination) and using xAPIC (and not x2APIC) and when you're using "cluster mode" (and not "flat mode").LINT0 wrote:In my own testing, I've discovered that the I/O APIC spec and Intel manual are wrong. Or more accurately, the systems I've tested on don't conform to those specs. Using physical destination mode, I can target all 32 CPUs of my big machine, indicating that the system is taking the full 8 bits of the destination field rather than just the 4 it claims it will use. Also looking at Linux, nowhere do they truncate an APIC ID to 4 bits when writing that field; they always just take the full 8-bit ID and write it directly into the destination field. In my testing, this seems to work for MSI MSIx and the I/O APIC. My attempts to get any info from anybody at Intel proved fruitless.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
In the I/O APIC spec, it has this to say about the destination field: "If the Destination Mode of this entry is Physical Mode (bit 11=0), bits [59:56] contain an APIC ID. If Logical Mode is selected (bit 11=1), the Destination Field potentially defines a set of processors. Bits [63:56] of the Destination Field specify the logical destination address". I've also checked a couple of the Intel chipset datasheets and they say the same thing. Intel's programmer manual does say that all 8 bits are used for IPIs (Pentium 4/Xeon) and MSI/x though. Now I need to double check the I/O APIC interrupts on CPUs > 16...Brendan wrote:According to the manuals; for physical destination the destination is 8-bit (for xAPIC) or 32-bit (for x2APIC); and the destination is only 4-bit when you're using logical destination (and not physical destination) and using xAPIC (and not x2APIC) and when you're using "cluster mode" (and not "flat mode").
Re: IOAPIC and the LDR - interrupt being sent to multiple CP
I recently ran into a similar situation of having an interrupt delivered to multiple processors. I fixed this by ensuring that each processor had a unique logical processor identifier. I believe the usage model is to use the destination id in the MSI interrupt to determine potential recipients. Thus, for XAPIC in cluster mode you could put your BSP in its own cluster, or you could assign the BSP a logical id of 1, and assign the other processors unique logical processor ids of 2, 4, and 8. These would all be in cluster 0, you could direct the interrupt to one of the "other" processors by using a destination id of 0xe (i.e. cluster 0, and one of bit 3:1 set).
In attempt at clarity, you could set the BSP to have a logical id of 0x28 and the other 3 processors to have destination ids of 0x21, 0x22, 0x24. The interrupt could then be sent to destination id 0x27.
If you wanted to use flat mode, then you can have up to 8 processors with logical ids of 1, 2, 4, 8, 0x10, 0x20, 0x40, and 0x80. The destination id of the interrupt would determine the set of processors that is eligible to receive the interrupt.
I suspect that both our problems were that we expected to use the logical destination register to create the group rather than using the destination id associated with the interrupt to define the group.
In attempt at clarity, you could set the BSP to have a logical id of 0x28 and the other 3 processors to have destination ids of 0x21, 0x22, 0x24. The interrupt could then be sent to destination id 0x27.
If you wanted to use flat mode, then you can have up to 8 processors with logical ids of 1, 2, 4, 8, 0x10, 0x20, 0x40, and 0x80. The destination id of the interrupt would determine the set of processors that is eligible to receive the interrupt.
I suspect that both our problems were that we expected to use the logical destination register to create the group rather than using the destination id associated with the interrupt to define the group.