Page 1 of 1

AHCI: Have to insert random sleeps for real hardware

Posted: Mon Feb 22, 2021 8:59 am
by 8infy
Hi, today I spent about 6 hours trying to get my AHCI driver to work on all 3 of my laptops (all are from 2012-2016 range)
It ended up working on all 3 in the end, but now my code is sprinkled with random sleeps in different places and I'm trying to understand why.
I also have to mention that I try to implement it as closely to the spec as possible, without relying on BIOS for any kind of
HBA/port initialization.

1. If I don't reset the HBA and ports everything works out of the box of course (thanks BIOS).

2. If I reset the HBA, I have to wait for about 1 extra second for Phy interface to come back online on all ports after the HBA reset bit in GHC is cleared.
(My HBA reset code is written 1 : 1 according to the spec. and the spec doesn't mention the extra wait on top of the reset bit anywhere, why???)

3. I do check the staggered spin-up bit, as well as cold presence detection bit. I have one laptop, which supports staggered spin-up, and again I have to wait for X amount of time
before communication is established after setting the spin-up bit in port, and spec mentions no ways to verify that it's back online properly.

4. If I don't reset ports after resetting HBA on one specific laptop, ATA IDENTIFY hangs the port. (command issue bit is never cleared, no errors either) Why? Again, nothing in the spec.
(For some reason it also hangs the PS/2 emulation with it (lol), and I do perform BIOS handoff of course! Again 100% according to spec.)

5. I have to wait for about 500ms after enabling the DMA engines for a port. Again spec doesn't mention such delay is needed at all.
(And again my DMA engine enabling code is 100% spec compliant, as in I set the FRE bit first, then verify CR is off, then finally set ST).

I feel like inserting sleep is just a random hack that happened to work and maybe I missed some important bit that I have to check, or some other initialization I have to perform...
Any ideas as to where I could actually read about proper 100% safe AHCI initialization code without random sleep() sprinkled everywhere? Apparently not the AHCI specification.
I have tried reading the linux source code but it's obfuscated to an extent where its not really readable. It also implements random workarounds for different controllers, which is just too much for me atm.

UPD: I managed to get rid of sleep in all places but HBA reset and port reset. Those are still a mystery to me, and I don't understand what bit indicates a reset is fully complete :(
UPD2: After a few more hours of trying different delays 50ms seems to be the LCD between all 3 AHCI controllers. I guess i'll leave that at that for now.

As an example here's my code for resetting a port. (It's done after disabling the DMA engines in a different function)

Code: Select all

void AHCI::reset_port(size_t index)
{
    auto sctl = port_read<PortSATAControl>(index);
    sctl.device_detection_initialization = PortSATAControl::DeviceDetectionInitialization::PERFORM_INITIALIZATION;
    port_write(index, sctl);

    static constexpr size_t comreset_delivery_wait = Time::nanoseconds_in_millisecond;
    auto wait_begin = Timer::nanoseconds_since_boot();
    auto wait_end = wait_begin + comreset_delivery_wait;

    while (wait_end > Timer::nanoseconds_since_boot());

    sctl = port_read<PortSATAControl>(index);
    sctl.device_detection_initialization = PortSATAControl::DeviceDetectionInitialization::NOT_REQUESTED;
    port_write(index, sctl);

    wait_begin = Timer::nanoseconds_since_boot();
    wait_end = wait_begin + comreset_delivery_wait;

    auto ssts = port_read<PortSATAStatus>(index);

    while (ssts.device_detection != PortSATAStatus::DeviceDetection::DEVICE_PRESENT_PHY) {
        ssts = port_read<PortSATAStatus>(index);

        if (wait_end < Timer::nanoseconds_since_boot())
            break;
    }

    if (ssts.device_detection != PortSATAStatus::DeviceDetection::DEVICE_PRESENT_PHY)
        runtime::panic("AHCI: Port physical layer failed to come back online after reset");

    // Removing this will cause the driver to break on real hw
    sleep::for_milliseconds(50);

    m_hba->ports[index].error = 0xFFFFFFFF;

    log("AHCI") << "successfully reset port " << index;
}

Re: AHCI: Have to insert random sleeps for real hardware

Posted: Sat Feb 27, 2021 5:22 am
by 8infy
Ok, after many attempts I have finally figured out the actual AHCI hba initialization/port reset routine, that is confirmed to work on 3 different laptops from different
years as well as VMWare and QEMU. It is described in the AHCI specification but is actually scattered across different parts of the specification so you have to
collect all the parts and put them together.

Here's how it should work:
1. Enable AHCI mode by setting GHC.AE.
2. Perform BIOS handoff if supported.
3. Make sure all ports are idle (ST, CR and FR, FE are cleared)
4. Perform a standard AHCI reset by setting the GHC.HR bit and wait for it to become 0.
5. Enable AHCI mode by setting GHC.AE.
For each implemented port:
1. Set both command list & fis receive to a valid physical address.
2. Set the fis receive (FE) bit in the port register (otherwise PxTFD.STS.BSY will be set to 1 forever).
3. If staggered spin up is supported set the SUD bit in sata status register.
4. Wait for about 1ms for DET in sata status to be set to 3.
5. If it's not set within the timeframe, port has no device attached, so we continue to the next port.
6. Clear the port error register to 0xFFFFFFFF (otherwise again it will be stuck in BSY forever).
7. Spin on PxTFD.STS.DRQ and PxTFD.STS.BSY, they must be cleared within a small amount of time (after the device has finished transferring the initial FIS with signature and stuff)

That's pretty much it. Resetting a separate port is same but excluding step 3 (setting the SUD bit) and starting with the standard port reset sequence (clearly described in the ahci spec).

EDIT:
Another super important thing I learned today:
Wait for a positive indication that a device is attached to the port (the maximum amount
of time to wait for presence indication is specified in the Serial ATA Revision 2.6
specification). This is done by polling PxSSTS.DET. If PxSSTS.DET returns a value of 1h
or 3h when read, then system software shall continue to the next step, otherwise if the
polling process times out system software moves to the next implemented port and
returns to step 1.
The part where the spec says 1h OR 3h is bullshit. Each port's DET state goes from 0 to 1 to 3 during reset, if you were to go
to the next step after detecting 1h you would clear the error register too early resulting in port hanging in BSY state. I verified
this on multiple AHCI controllers! (therefore you MUST wait for 3h and assume no device is present otherwise)