x2apic losing interrupt after setting ISR around SMI

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

x2apic losing interrupt after setting ISR around SMI

Post by xeyes »

Title was updated to reflect new understanding of the actual issue, see posts below for details.



On my laptop there are Fn key combos that can be used to adjust the screen backlight.

These keys work even when running my kernel, which doesn't have any idea of backlight and doesn't handle any ACPI interrupts. I'm guessing that a SMI is fired when the key combos are pressed and the FW would adjust the backlight in SMM.

However, the EOI my timer handler sends near the end of its execution sometimes gets lost around this process. Namely, if I keep pressing the key combos to adjust brightness up or down I can get it into this state.

Why do I believe that the EOI is lost?

1. The action of sending EOI is hardcoded in the timer handler, no way around it.

2. While the core seems frozen, after sending it an NMI from a different core, I can see that it is executing the idle loop (and the IF bit is set), but the bits for the timer interrupt vector are high in both the ISR and the IRR so no more timer interrupts can go through to that core.

3. What's more, if the NMI handler sends an EOI after seeing that the timer interrupt vector's bit is set in the ISR, the frozen core will recover and go back to normal operation.

Not familiar with SMI/SMM and also didn't find anything about SMI/SMM eating EOIs in the manual. So I'm wondering what would be some pointers to look further into this?
Last edited by xeyes on Tue Jul 26, 2022 11:54 pm, edited 2 times in total.
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: EOI lost (around SMI?)

Post by nullplan »

EOI to the PIC is a single command, right? It is unlikely that SMM would interfere with that.

When an SMI happens, the current state of execution is serialized (with the notable exception of the NMI gate), and when SMM is done executing, it uses the RSM instruction to return to the OS already running. It is of course possible that the SMM handler corrupts the serialized execution state. However, in that case most OSes would have a problem. But Windows would not, on account of it taking control of ACPI on startup.

It is possible that the SMM handler isn't tested very well. In that case you may get out of the problem by writing an ACPI driver. Have fun doing that!

Another possibility is that the PIC in your system doesn't work very well with multiple cores. In that case you may get out of it by writing an APIC driver. That is way easier than a full-fledged ACPI driver, you only need to read the static tables, no AML.

A third possibility is that you have a race condition around interrupt delivery and halting. No way to know without reading your source.
Carpe diem!
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: EOI lost (around SMI?)

Post by xeyes »

nullplan wrote:EOI to the PIC is a single command, right? It is unlikely that SMM would interfere with that.
I'm not saying that SMM is interfering, but if I don't adjust the backlight brigtness this doesn't happen at all, so SMM/SMI seems related.
nullplan wrote: Another possibility is that the PIC in your system doesn't work very well with multiple cores. In that case you may get out of it by writing an APIC driver. That is way easier than a full-fledged ACPI driver, you only need to read the static tables, no AML.
Did more experiments as below, x2apic is a smoking gun, but the whole thing is still a mystery.

a. switching to use xapic instead of x2 makes the issue no longer happen even if I try very hard at asjusting the backlight.

b. DMAR table is not opting out of x2.

c. setting up the interrupt remappers makes things worse, as in much easier to get into frozen state using backlight adjust key combos.

d. using high priority vector in the 0xF* range for timer interrupt makes things worse, as in all cores freeze together instead of just core 0.

e. 32b Linux does not use x2 with or without ACPI off on this machine, it doesn't have this problem.

f. 64b Linux uses x2 when it sees the DMAR table, it also sets up the interrupt remappers, and of course(?) it also does not have this problem either. One difference I saw from dmesg is that Linux uses clustered mode and I'm using physical dest mode.


Seems that x2 is not happy with my setup and sometimes gets confused by the events around SMM and my timer handler.

nullplan wrote: A third possibility is that you have a race condition around interrupt delivery and halting. No way to know without reading your source.
Handler and idle loop code are both of the garden variety.

timer handler C code:

Code: Select all

void handler(...)
{
    // bunch of house keeping stuff, doesn't IRET or call anything that won't return.

    send_eoi(); // writes 0 to the EOI regiser at offset B0

    // scheduler may not return
    if (time_to_schedule)
        choose_next_task(...);

    // NOTE: I thought the scheduler might be too slow? 
    // but moving send_eoi() to points closer to various irets didn't help either. 
}
Its ASM helper pushes registers, calls the C function, pops the registers on return and then irets.


idle loop C code:

Code: Select all

do
{
    asm volatile("hlt");
}while(not_done);
What sort of race are you envisioning though?
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: x2apic losts EOI (around SMI?)

Post by nullplan »

xeyes wrote:What sort of race are you envisioning though?
The classic problem is to have an interrupt between the last time you check that no interrupt occurred and actually halting. In that case, the interrupt will not wake up the CPU, as the halt hasn't started yet, and the CPU will appear to be frozen. However, your architecture does not appear to suffer from that problem, it seems pretty solid as far as I can tell.
Carpe diem!
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: EOI lost (around SMI?)

Post by Octocontrabass »

xeyes wrote:f. 64b Linux uses x2 when it sees the DMAR table, it also sets up the interrupt remappers, and of course(?) it also does not have this problem either. One difference I saw from dmesg is that Linux uses clustered mode and I'm using physical dest mode.
Does the FADT say you must use clustered mode?
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: x2apic losts EOI (around SMI?)

Post by xeyes »

nullplan wrote:
xeyes wrote:What sort of race are you envisioning though?
The classic problem is to have an interrupt between the last time you check that no interrupt occurred and actually halting. In that case, the interrupt will not wake up the CPU, as the halt hasn't started yet, and the CPU will appear to be frozen. However, your architecture does not appear to suffer from that problem, it seems pretty solid as far as I can tell.
Ah, racing with the wake up event is a classic way to get stuck. It probably can't happen here as the idle loop can't be blocked.

This does make me wonder whether the pic lost the EOI, or the interrupt itself? Per the manual, pic sets the ISR bit before it dispatches the interrupt to the core, so it is not exactly atomic and maybe pic forgets about actually dispatching it after a SMM session?

However, it is probably part of the core, sounds super unlikely for it to have such obvious bugs.
Octocontrabass wrote:
xeyes wrote:f. 64b Linux uses x2 when it sees the DMAR table, it also sets up the interrupt remappers, and of course(?) it also does not have this problem either. One difference I saw from dmesg is that Linux uses clustered mode and I'm using physical dest mode.
Does the FADT say you must use clustered mode?
Bit 18 of the features flag? It's not set, and I think it means that when using logical mode, don't use the flat one, and doesn't precludes the usage of physical mode?
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: x2apic losing interrupt after setting ISR around SMI

Post by xeyes »

Tried a few more things.

1. What's lost is likely the interrupt itself, not EOI.

Found out by switching to TSC deadline mode of the timer. In this mode, the NMI needs to not only EOI but also re-arm the timer for the timer handler to recover. Thus it is likely that the ISR bit is set but the timer handler wasn't invoked. Otherwise the timer handler would have re-armed the timer already.


2. The threshold for the high priority vector is 0x76 but it doesn't make sense.

Vector 0x75 or below causes core 0 to lose intrrupt, vector 0x76 or above causes all cores to freeze when the keys are pressed. My kernel doesn't use anything near these numbers so can't think of any particular reasons for these 2 numbers to be special.


3. The issue doesn't seem related to the timer or clustered mode (or not) either.

Using HPET to interrupt the timer vector causes the exact same problem.
Switching to clustered logical mode didn't help either, with or without interrupt remapping.


Then I decided to give ACPICA a try. It more or less works, printing things like "Transition to ACPI mode successful" and can shutdown the computer.

But it made the Fn key combos ineffective (can't change brightness anymore), maybe the SMI isn't happening anymore, or maybe bios no longer does real work even if it gets SMI.


Maybe there are just some incompatibility with x2apic, the display driver (key combo to switch to external monitor is also effective without ACPICA, and can also result in core 0 freeze) in FW, and how I'm setting something up :? :?
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: x2apic losing interrupt after setting ISR around SMI

Post by nullplan »

xeyes wrote:But it made the Fn key combos ineffective (can't change brightness anymore), maybe the SMI isn't happening anymore, or maybe bios no longer does real work even if it gets SMI.
Well, obviously. You switched to ACPI mode, so you told the firmware that you now want to handle the button presses. This likely made the firmware change those interrupts from SMI to NMI or normal event. You now have to check ACPI for the event block that tells you how to tell if one of these buttons was pressed, and check it for the backlight device to tell how to set the brightness. And you need to connect the two things yourself (i.e. handle the "brightness up" button press event in a way that leads to an increased backlight brightness). In Linux, this goes all the way to userspace. The ACPI event generates a message to a certain netlink group, which something like "acpid" will catch and handle in a user-defined way, typically with a shell script that reads out the brightness setting and increases it by some amount.
Carpe diem!
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: x2apic losing interrupt after setting ISR around SMI

Post by Octocontrabass »

xeyes wrote:2. The threshold for the high priority vector is 0x76 but it doesn't make sense.

Vector 0x75 or below causes core 0 to lose intrrupt, vector 0x76 or above causes all cores to freeze when the keys are pressed. My kernel doesn't use anything near these numbers so can't think of any particular reasons for these 2 numbers to be special.
Those are the default vectors for ISA IRQ13 and IRQ14. Did you relocate the legacy PICs to different vectors, or just mask them?

ACPI has a _PIC method you might need to use.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: x2apic losing interrupt after setting ISR around SMI

Post by xeyes »

Octocontrabass wrote:
xeyes wrote:2. The threshold for the high priority vector is 0x76 but it doesn't make sense.

Vector 0x75 or below causes core 0 to lose intrrupt, vector 0x76 or above causes all cores to freeze when the keys are pressed. My kernel doesn't use anything near these numbers so can't think of any particular reasons for these 2 numbers to be special.
Those are the default vectors for ISA IRQ13 and IRQ14. Did you relocate the legacy PICs to different vectors, or just mask them?

ACPI has a _PIC method you might need to use.
They are remapped away. The 2 vectores themselves are not special, others below or above them causes the same issue as well.

What seems special is vector 75.8, or the gap between the two, which causes the machine to behave very differently once crossed :(


_PIC didn't seem to change either how the interrupts work, interrupts are working with or without it, interrupt that is lost is still lost. I get that it only sets 1 flag in the AML space? Didn't seem to write any port/register/address or talk to EC AFAIK by looking at what ACPICA is doing.


Also noticed that after the detour to ACPI and _BCM, the issue itself is still there, when I call _BCM, I can also cause x2apic to lose interrupt, just like how BIOS did it. _BCM seems to only issue 2 IO port writes (always to the same port, using the same value, regardless of level setting).

So I'm now confused about not only how _BCM caused the interrupt to be lost but also how it works. Maybe the writes are doorbells that wake up FW to look at some temporary values stored in AML space, before the FW goes on to talk to the backlight/GPU the same way as in a SMI?
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: x2apic losing interrupt after setting ISR around SMI

Post by Octocontrabass »

xeyes wrote:_PIC didn't seem to change either how the interrupts work, interrupts are working with or without it, interrupt that is lost is still lost. I get that it only sets 1 flag in the AML space?
In theory, the firmware running in SMM could read that flag. Looks like it isn't doing that here, or at least not in a way that would fix the problem.
xeyes wrote:_BCM seems to only issue 2 IO port writes (always to the same port, using the same value, regardless of level setting).
Is it an Intel chipset? Is it port 0xB2? It sounds like writing that port triggers SMI.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: x2apic losing interrupt after setting ISR around SMI

Post by xeyes »

Octocontrabass wrote:
xeyes wrote:_PIC didn't seem to change either how the interrupts work, interrupts are working with or without it, interrupt that is lost is still lost. I get that it only sets 1 flag in the AML space?
In theory, the firmware running in SMM could read that flag. Looks like it isn't doing that here, or at least not in a way that would fix the problem.
xeyes wrote:
xeyes wrote:_BCM seems to only issue 2 IO port writes (always to the same port, using the same value, regardless of level setting).
Is it an Intel chipset? Is it port 0xB2? It sounds like writing that port triggers SMI.
Wow that's a very accurate guess! It's a 7 series (ivy bridge) chipset and writes F5 to B2 during _BCM. Does this point to anything though?

I added experimental support for long mode (proudly supporting 4GB linear and 4GB physical address space) as it seems odd that 32bit Linux doesn't enable x2apic on this machine. But again it didn't seem to change anything and both backlight adjusting SMI and _BCM still have a high chance of sending core 0 into the ISR bit set but interrupt handler didn't run state.

:( Running out of ideas here, maybe I should just use another core as a watchdog to nudge core 0 as needed.
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: x2apic losing interrupt after setting ISR around SMI

Post by Octocontrabass »

xeyes wrote:Wow that's a very accurate guess! It's a 7 series (ivy bridge) chipset and writes F5 to B2 during _BCM. Does this point to anything though?
It confirms that SMI is responsible for the lost interrupt.
xeyes wrote: :( Running out of ideas here, maybe I should just use another core as a watchdog to nudge core 0 as needed.
There must be something you're doing that's different from what Linux does; otherwise Linux would have the same issue. Which timer does Linux use? How is it configured?
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: x2apic losing interrupt after setting ISR around SMI

Post by xeyes »

Octocontrabass wrote:
xeyes wrote:Wow that's a very accurate guess! It's a 7 series (ivy bridge) chipset and writes F5 to B2 during _BCM. Does this point to anything though?
It confirms that SMI is responsible for the lost interrupt.
xeyes wrote: :( Running out of ideas here, maybe I should just use another core as a watchdog to nudge core 0 as needed.
There must be something you're doing that's different from what Linux does; otherwise Linux would have the same issue. Which timer does Linux use? How is it configured?
:lol: I'm sure that there must be many things that are set up differently. Don't know a good way to tell on real hardware, but I've seen Linux using the tsc deadline mode of the apic timer in virtual machines.

In this case though, the issue is not specific to a timer or timers but interrupts in general. Tried HPET previously and its interrupts can also get lost. I even coerced HDA into sending periodical interrupts to trigger the timer interrupt handler, and again its interrupts face the same issue. So nothing special about timer interrupts, they just happen frequent enough and were the easiest to be affected/noticed.
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: x2apic losing interrupt after setting ISR around SMI

Post by Octocontrabass »

If the problem is indeed the APIC configuration, you can use something like msr-tools in Linux to compare the x2APIC registers against your OS.

I can't imagine what else it could be if it's not the APIC.
Post Reply