Page 1 of 1

IRQ1 when HLT on test machine zeroes RIP upper 32 bits.

Posted: Tue Mar 13, 2018 2:26 pm
by shacknetisp
The basic path the code takes is:

Code: Select all

* Set up IRQ1 with interrupt controller.
* Set interrupt handler for IRQ_BASE [currently 32] + 1.
* Enable interrupts with sti.

while(true) {
    asm("hlt");
}
In Bochs, Qemu, and Virtualbox this works as expected - pressing a key on the keyboard generates an IRQ, the interrupt is handled, and execution is resumed in the hlt loop waiting for the next key.
On my test machine, however, a HP Pavilion (zv5000 as written on all the labels, zv5200 as reported by the system), the IRQ is generated, but the interrupt handler receives the pushed RIP value with the upper 32 bits zeroed, which, of course, causes an issue when returning from the handler as my kernel is positioned in the upper parts of the address space so that the first 32 bits are all 1s.

I've done some experiments, trying to find what the source of this may be:
  • Perhaps most importantly, this does not occur when the loop is empty or a pause instruction. It only fails when the loop is a hlt.
  • This issue does not occur for other IRQs which I've tested, the PIT IRQ0 and RTC IRQ8.
  • This issue does not occur when firing the interrupt manually with the int instruction.
  • This/similiar issues do not occur in plain 32-bit protected mode.
  • The issue occurs whether I set up IRQ1 through the PIC or APIC. It is not affected by the APIC system being initialized or not.
At this point I'm stuck trying to find what is causing this, nothing I've read describes this kind of problem. Has anyone else encountered a similar issue to this and/or know what is happening here and how to approach it?

Re: IRQ1 when HLT on test machine zeroes RIP upper 32 bits.

Posted: Tue Mar 13, 2018 9:37 pm
by Brendan
Hi,
shacknetisp wrote:I've done some experiments, trying to find what the source of this may be:
  • Perhaps most importantly, this does not occur when the loop is empty or a pause instruction. It only fails when the loop is a hlt.
  • This issue does not occur for other IRQs which I've tested, the PIT IRQ0 and RTC IRQ8.
  • This issue does not occur when firing the interrupt manually with the int instruction.
  • This/similiar issues do not occur in plain 32-bit protected mode.
  • The issue occurs whether I set up IRQ1 through the PIC or APIC. It is not affected by the APIC system being initialized or not.
These tests rule out all the "more likely" possibilities I could think of and mostly only leaves the "firmware is buggy" option (e.g. maybe HLT triggers SMM and the firmware's SMM handler only supports 32-bit software properly).

I did some searching and found that Linux has a special work-around to make sure that (if the firmware's DMI says the computer is "Pavilion zv5000") only the maximum C state is used. Further digging suggests this was done because the C2 state ("stop clock") is buggy, but I'm not sure about the C1 state ("HLT"). Note that Linux also has generic work-arounds for various "HLT is buggy" situations done during boot (e.g. to test if HLT wakes up from an IRQ).

However; before assuming that the problem is the firmware, there's one little test I'd suggest.

Originally when AMD released long mode their documentation ("BIOS and Kernel Developer's Guide for AMD Athlon 64 TM and AMD Opteron Processors TM") described a BIOS function intended to inform the BIOS of whether or not the OS intends to use 64-bit/long mode. Nobody really had any idea why this BIOS function was created, and I assumed it might be so that the BIOS could reconfigure itself (and things like its SMM handler) differently if the OS says it intends to use 64-bit. Since then (as far as I know) almost everyone (including motherboard/firmware manufactures) have deprecated/ignored the BIOS function and mostly forgotten that it ("temporarily") existed. Note that HP also had a "Pavilion zv5000z" that did use an AMD Athlon CPU, and it's reasonable to assume that parts of the firmware are shared by both "Pavilion zv5000" and "Pavilion zv5000z".

I'm thinking that there's a tiny chance (if you're not doing it already) that the problem is that you're not informing the BIOS that you intend to use 64-bit during boot. To do this, here's the relevant part of AMD's original documentation:
AMD wrote:12.21 Detect Target Operating Mode Callback

The operating system notifies the BIOS what the expected operating mode is with the Detect Target Operating Mode callback (INT 15, function EC00h). Based on the target operating mode, the BIOS can enable or disable mode specific performance and functional optimizations that are not visible to system software.

This callback does not change the operating mode; it only declares the target mode to the BIOS. It should be executed only once by the BSP before the first transition into long mode.

The default operating mode assumed by the BIOS is Legacy Mode Target Only. If this is not the target operating mode, system software must execute this callback to change it before transitioning to long mode for the first time. If the target operating mode is Legacy Mode Target Only, the callback does not need to be executed.

The Detect Target Operating Mode callback inputs are stored in the AX and BL registers. AX has a value of EC00h, selecting the Detect Target Operating Mode function. One of the following values in the BL register selects the operating mode:
  • 01h — Legacy Mode Target Only. All enabled processors will operate in legacy mode only.
  • 02h — Long Mode Target Only. All enabled processor will switch into long mode once.
  • 03h — Mixed Mode Target. Processors may switch between legacy mode and long mode, or the preferred mode for system software is unknown. This value instructs the BIOS to use settings that are valid in all modes.
  • All other values are reserved.
The Detect Target Operating Mode callback outputs are stored in the AH register and CF (carry flag in the EFLAGS register), and the values of other registers are not modified. The following output values are possible:
  • AH = 00h and CF = 0, if the callback is implemented and the value in BL is supported.
  • AH = 00h and CF = 1, if the callback is implemented and the value in BL is reserved. This indicates an error; the target operating mode is set to Legacy Mode Target Only.
  • AH = 86h and CF = 1, if the callback is not supported.

Cheers,

Brendan

Re: IRQ1 when HLT on test machine zeroes RIP upper 32 bits.

Posted: Wed Mar 14, 2018 1:57 pm
by shacknetisp
I wasn't using Detect Target Operating Mode call, so thanks for pointing to that documentation, but after adding it to my setup sequence it only returned the "not supported" output.
After posting, I discovered the actual model of the computer is a bit quirky, labels on the case itself all say "Pavilion zv5000" which is what I initially read, however the system reports the name as "Pavilion zv5200".

For what it's worth, I was able to boot a 64 bit Linux on the system to check its logs, but see nothing out of the ordinary regarding hlt, SMM, PS/2, or the keyboard itself.

Re: IRQ1 when HLT on test machine zeroes RIP upper 32 bits.

Posted: Wed Mar 14, 2018 2:09 pm
by BrightLight
Brendan wrote:These tests rule out all the "more likely" possibilities I could think of and mostly only leaves the "firmware is buggy" option (e.g. maybe HLT triggers SMM and the firmware's SMM handler only supports 32-bit software properly).
This is actually an issue I've never faced nor have ever thought of, so I have a relevant question regarding this. If I'm using ACPI and ACPI SCI is enabled, no SMM code can run, right? I mean, once I properly enable ACPI, my OS is the only software running, isn't it?

Re: IRQ1 when HLT on test machine zeroes RIP upper 32 bits.

Posted: Wed Mar 14, 2018 11:18 pm
by Brendan
Hi,
shacknetisp wrote:I wasn't using Detect Target Operating Mode call, so thanks for pointing to that documentation, but after adding it to my setup sequence it only returned the "not supported" output.
After posting, I discovered the actual model of the computer is a bit quirky, labels on the case itself all say "Pavilion zv5000" which is what I initially read, however the system reports the name as "Pavilion zv5200".

For what it's worth, I was able to boot a 64 bit Linux on the system to check its logs, but see nothing out of the ordinary regarding hlt, SMM, PS/2, or the keyboard itself.
This could be as simple as (e.g.) the firmware's SMM code responsible for "PS/2 emulation for USB devices" only works for old 32-bit OSs (that didn't support USB), where newer/64-bit OSs that do support USB have no problem because they all disable the emulation.

Note: If I remember correctly (no guarantee!); when HLT is interrupted by SMI the CPU updates RIP so that it points to the instruction after the HLT; and SMM code is supposed to figure out (depending on what caused the SMI) if it should return to the instruction after the HLT instruction (and leave "return RIP" alone) or return to HLT instruction (and adjust "return RIP" to reverse what the CPU did). I can imagine SMM code (which is typically "unreal mode like" and not 64-bit) getting this wrong in a way that only effects "return to HLT" but doesn't effect "return to any other instruction", and getting it wrong in a way that only effects "emulated IRQ 1" and doesn't effect real IRQs that weren't emulated by SMM. Of course this is all pure speculation - it's almost impossible to guess what the SMM code actually does.
omarrx024 wrote:
Brendan wrote:These tests rule out all the "more likely" possibilities I could think of and mostly only leaves the "firmware is buggy" option (e.g. maybe HLT triggers SMM and the firmware's SMM handler only supports 32-bit software properly).
This is actually an issue I've never faced nor have ever thought of, so I have a relevant question regarding this. If I'm using ACPI and ACPI SCI is enabled, no SMM code can run, right? I mean, once I properly enable ACPI, my OS is the only software running, isn't it?
SMM can still be used for some things (e.g. "ECC RAM scrubbing") when ACPI/SCI is enabled (and when any/all kinds of legacy emulation is disabled).


Cheers,

Brendan