Using APIC timer as a "system board" timer when HPET fails

rdos · Post by **rdos** » Fri Dec 23, 2011 5:21 pm

Cognition wrote:The walkstate being nulled or containing junk is a pretty good indicator there's a bug in either ACPICA or the OS specific code it depends on, it's highly unlikely the AML code would be messing up the internal state of the interpreter like that. Even if it were, you should be able to track it through the OS hooks ACPICA depends on. If you can get any other major OS running on the platform you can probably dump the DSDT and SSDTs (if there are any) with iasl just to make sure the AML code is sane.

I think the problem must depend on the AML code since it is just one entry (MEM_) on one machine (out of 4) that has this bug. Additionally, Linux won't boot on it, while Windows XP will (but I doubt Windows uses ACPICA). It is also only when the current resource setting is read that this happens.

I suppose I could dump DSDT or SSDT if somebody wants to look at it.

BTW, here is the device-list from my 6-core AMD (which seems to work a lot better):

Code: Select all

\_SB_.PCI0
    IO: 0CF8-0CFF

\_SB_.PCI0.AMDN

\_SB_.PCI0.SBRG.S800

\_SB_.PCI0.SBRG.SIO1

\_SB_.PCI0.SBRG.EC0_
    IO: 0066-0066
    IO: 0062-0062

\_SB_.PCI0.SBRG.PIC_
    IO: 00A0-00A1
    IO: 0020-0021
    IRQ: 2,  exclusive,  edge 

\_SB_.PCI0.SBRG.DMAD
    IO: 00C0-00DF
    IO: 008F-008F
    IO: 0089-008B
    IO: 0087-0087
    IO: 0081-0083
    IO: 0000-000F

\_SB_.PCI0.SBRG.TMR_
    IO: 0040-0043
    IRQ: 0,  exclusive,  edge 

\_SB_.PCI0.SBRG.RTC0
    IO: 0070-0071
    IRQ: 8,  exclusive,  edge 

\_SB_.PCI0.SBRG.SPKR
    IO: 0061-0061

\_SB_.PCI0.SBRG.RMSC
    IO: 04D0-04D1
    IO: 00E0-00EF
    IO: 00A2-00BF
    IO: 0090-009F
    IO: 008C-008E
    IO: 0088-0088
    IO: 0084-0086
    IO: 0080-0080
    IO: 0072-007F
    IO: 0065-006F
    IO: 0062-0063
    IO: 0044-005F
    IO: 0022-003F
    IO: 0010-001F

\_SB_.PCI0.SBRG.COPR
    IO: 00F0-00FF
    IRQ: 13,  exclusive,  edge 

\_SB_.PCI0.SBRG.NBRM

\_SB_.PCI0.SBRG.PS2K
    IO: 0064-0064
    IO: 0060-0060
    IRQ: 1,  exclusive,  edge 

\_SB_.PCI0.SBRG.PS2M
    IRQ: 12,  exclusive,  edge 

\_SB_.PCI0.SBRG.UAR1
    IO: 03F8-03FF
    IRQ: 4,  exclusive,  edge 

\_SB_.LNKA
    IRQ: 10,  sharable,  high level 

\_SB_.LNKB
    IRQ: 7,  sharable,  high level 

\_SB_.LNKC
    IRQ: 7,  sharable,  high level 

\_SB_.LNKD
    IRQ: 3,  sharable,  high level 

\_SB_.LNKE
    IRQ: 11,  sharable,  high level 

\_SB_.LNKF
    IRQ: 10,  sharable,  high level 

\_SB_.LNKG
    IRQ: 11,  sharable,  high level 

\_SB_.LNKH
    IRQ: 11,  sharable,  high level 

\HPET

Brendan · Post by **Brendan** » Fri Dec 23, 2011 10:59 pm

Hi,

rdos wrote:
Brendan wrote:Sadly, Linux does something like this too - take a high precision timer like the local APIC timer, and use it as a general purpose timing thing so that you can bury it under millions of networking timeouts (that have no need for high precision). It's stupid because there's always a minimum amount of time between delays, and when there's too many things using the same timer you have to group things together to avoid the "minimum time between delays" problem. For example, if "foo" should happen in 1000 ns and "bar" should happen in 1234 ns, then you can't setup a 234 ns delay and have to bunch them together, and "foo" ends up happening 234 ns too late. Things that don't need such high precision should use a completely different timer to avoid screwing up the precision for things that do need high precision.
I think the major difference between RDOS and Linux is how ISRs and timer-callbacks are coded. In RDOS, you should keep both ISRs and timer-callbacks short. A typical ISR and / or timer callback only consists of a signal to wake a server-thread, along with clearing some interrupt conditions. User-apps cannot use timers directly at all (they are kernel-only and run as ISRs). Since timer-callbacks are generally shorter than the overall interrupt latency, mixing precision timers is not a problem. You would not gain anything by using separate hardware for high precision timers as it is the interrupt latency that determines response times, not the resolution of the timer. In order to get ns resolution for timed events, it is necesary to run on a dedicated core without preemption and interrupt load.

You're missing the point.

Imagine you're using the PIT in "one shot" mode (local APIC timer is similar with different numbers). You set the count, the PIT decreases it at a fixed rate and an IRQ occurs when the count reaches zero. The largest value you can use for the count is 65536 which gives a maximum delay of 54.9254 ms. What is the minimum value you can set the count to? If you set the count too low you're going to have problems with race conditions and the timer and the PIC chip can fail to keep up (leading to missed IRQs, etc). The minimum count you can actually use in practice might be 119, giving a minimum delay of 99.7 us.

Basically, for the PIT you can get 838 ns precision, but only for delays between 99.7 us and 54925.4 us. Now let's assume the PIT is being used by the scheduler, and the scheduler wants a task to run for exactly 1.2345 ms. That's easy to do - you set the count 1473 and the IRQ occurs after 1.2345 ms (actually "1.2345144424780244416538973528759 ms" but it's close enough). That gives the scheduler about 838 ns precision, which is very nice.

Now imagine someone decides to use the timer for something else at the same time. The scheduler wants a task to run for exactly 1.2345 ms, but the networking stack has asked for a delay that will expire in 1.200 ms time. After 1.2 ms has passed the IRQ for the networking stack occurs, and you want another IRQ in 34.5 us (because 1.2 ms + 34.5 us = 1.2345 ms). Unfortunately 34.5 us is below the minimum you can ask for in practice, so you have to use the minimum itself, resulting in an IRQ that occurs after 99.7 us. The scheduler's 1.2345 ms delay ends up being a 1.2997 ms delay because someone else was using the timer too. Instead of getting 838 ns precision the scheduler can only really get 99.7 us precision; and using the timer for multiple things has made it worse than 100 times less precise.

rdos wrote:
Brendan wrote:What is "elapsed time"? I'd use TSC for this if I could (and fall back to HPET if TSC can't be used, and fall back to ACPI's counter if both HPET and TSC can't be used).
Elapsed time is how many tics has elapsed since the system started (simply explained, but not entirely true). A tic is one period on the PIT, which is convinient since 2^32 tics is one hour.

2^32 tics is closer to 0.99988669323428908795205349655706 hours. For every hour you can expect to be wrong by half a second; which adds up to about 68.6 seconds per week. You'd want to fix this problem (and also fix other causes of drift) by adding a tiny bias. For example, every "10 ms" you might add 10.0000001 ms to your counter. In this case "2^32 tics is one hour" isn't more convenient than anything else (e.g. you could just as easily use "64-bit nanoseconds since the start of the year 2000" if you wanted; or maybe 32-bit seconds and 32-bit fractions of a second) because you're just adding a pre-computed amount.

rdos wrote:I thus use 8 bytes to represent elapsed time.

Which means you can't just "add [ticks],value" and have to update both dwords atomically with something like "lock cmpxchg". Not that it really matters much, given that the biggest problem is that whenever the timer IRQ occurs the cache line is modified and all other CPUs have to fetch the new contents of the cache line, causing excessive cache traffic one many-CPU systems (which is the biggest advantage of "loosely synchronised TSC").

rdos wrote:
Brendan wrote:Your normal "generic timer" stuff uses NMI? Sounds seriously painful to me.
No. I have reserved NMI for the crash debugger. When the scheduler hits a fatal error, it will send NMI to all other cores to freeze them, regardless if they have interrupts enabled or not. Thus, NMI is not available.

It doesn't take a genius to set a "entering crash debugger" flag, and test that flag at the start of the NMI handler to determine if the cause of the NMI was watchdog or crash.

rdos wrote:So, if the user loads the PIC device-driver, it will look for PITs for timers / elapsed time, as those are commonly found on older hardware without APIC. If the user loads the APIC device-driver, it would make different choices, selecting between APIC timer, HPET or PIT. The only choice users have is to select which interrupt controller is available.

So if the user says there's a PIC, you ignore the flag in the ACPI tables that says if there's PIC chips or not, and also assume there's no HPET even if there is? If the user says there's APICs, you ignore the details in the ACPI MADT that says if there's APICs and assume there's no PIT even if there is?

If someone cuts off their arm and is bleeding to death, instead of making the obvious assumption (that they want some first aid) based on easily observable facts, do you ask them if they want a hamburger?

rdos wrote:As can be seen, it reports that the PIT exists, but doesn't have any interrupts. It also reports that HPET has both IRQ 0 and 8.

That's probably 100% correct - there is a PIT (but it might only be used for speaker control) and HPET (and SMM) is used to emulate PIT channel 0 and IRQ0 for legacy OSs that don't enable "ACPI mode"; and an OS that looks at the AML is assumed to use "ACPI mode" where the firmware doesn't bother using SMM to emulate the PIT (the OS just uses HPET directly).

Cheers,

Brendan

Owen · Post by **Owen** » Sat Dec 24, 2011 3:11 am

Another machine with no AML-defined PIT->IRQ routing:

Code: Select all

                Device (TIMR)
                {
                    Name (_HID, EisaId ("PNP0100"))
                    Name (_CRS, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0040,             // Range Minimum
                            0x0040,             // Range Maximum
                            0x01,               // Alignment
                            0x04,               // Length
                            )
                    })
                }

Also interesting:

Code: Select all

                Device (EC)
                {
                    Name (_HID, EisaId ("PNP0C09"))
                    Name (_UID, 0x00)
                    Name (_CRS, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0062,             // Range Minimum
                            0x0062,             // Range Maximum
                            0x00,               // Alignment
                            0x01,               // Length
                            )
                        IO (Decode16,
                            0x0066,             // Range Minimum
                            0x0066,             // Range Maximum
                            0x00,               // Alignment
                            0x01,               // Length
                            )
                    })

...I have a hunch they just reused the SuperIO's keyboard controller MCU for the SuperIO (Incidentally, this machine's ACPI tables report no i8042 keyboard controller and prodding where it would be is known to blow it up)

rdos · Post by **rdos** » Sat Dec 24, 2011 6:25 am

Brendan wrote:You're missing the point.

Imagine you're using the PIT in "one shot" mode (local APIC timer is similar with different numbers). You set the count, the PIT decreases it at a fixed rate and an IRQ occurs when the count reaches zero. The largest value you can use for the count is 65536 which gives a maximum delay of 54.9254 ms. What is the minimum value you can set the count to? If you set the count too low you're going to have problems with race conditions and the timer and the PIC chip can fail to keep up (leading to missed IRQs, etc). The minimum count you can actually use in practice might be 119, giving a minimum delay of 99.7 us.

No, there is no minimum delay. Additionally, there is no maximum delay either. A thread could ask to be blocked for 10s, 1 hour, or whatever, it works.

The implementation of timers is by a linked-list that is sorted in expire order. There is one entry that never expires (it has elapsed time FFFFFFFFFFFFFFFF). The timer is loaded with the head of the timer-list. If the value is larger than 65535 tics, the timer is loaded with the maximum value. The timer lists contains 8 byte expired times (in elapsed time units). In order to calculate the value to load the timer with, the current elapsed time subtracted from the expire time of list head. If this is negative or zero, list head already expired, so the timer callback is called and the entry is removed. If it is positive, the value is loaded into the timer. The PIT code also contains a special feature. It calculates the time it takes to reprogram the timer at boot-time, and compensates for this by subtracting this from the load value.

Brendan wrote:Basically, for the PIT you can get 838 ns precision, but only for delays between 99.7 us and 54925.4 us. Now let's assume the PIT is being used by the scheduler, and the scheduler wants a task to run for exactly 1.2345 ms. That's easy to do - you set the count 1473 and the IRQ occurs after 1.2345 ms (actually "1.2345144424780244416538973528759 ms" but it's close enough). That gives the scheduler about 838 ns precision, which is very nice.

No. The only limitation is interrupt latency (and latency of timers themselves). If you cannot program the PIT with less than 119 tics, your implementation is seriously flawed. There is a more problematics race condition with the HPET, but that can also be solved by reading the timer value after programming the comparator. If it is larger, the timer has expired, and you reenter the timer check code (and thus need no IRQ).

Brendan wrote:2^32 tics is closer to 0.99988669323428908795205349655706 hours. For every hour you can expect to be wrong by half a second; which adds up to about 68.6 seconds per week. You'd want to fix this problem (and also fix other causes of drift) by adding a tiny bias. For example, every "10 ms" you might add 10.0000001 ms to your counter. In this case "2^32 tics is one hour" isn't more convenient than anything else (e.g. you could just as easily use "64-bit nanoseconds since the start of the year 2000" if you wanted; or maybe 32-bit seconds and 32-bit fractions of a second) because you're just adding a pre-computed amount.

This problem is fixed by having an additional fraction counter, and synchronization from RTC. The performance critical case is with older CPUs, which have PITs, and thus you don't want to do conversions on these.

Brendan wrote:Which means you can't just "add [ticks],value" and have to update both dwords atomically with something like "lock cmpxchg". Not that it really matters much, given that the biggest problem is that whenever the timer IRQ occurs the cache line is modified and all other CPUs have to fetch the new contents of the cache line, causing excessive cache traffic one many-CPU systems (which is the biggest advantage of "loosely synchronised TSC").

Updating timers and elapsed time requires spinlocks. That is the main reason why APIC timer is better for timers. This implementation has less lock-contention, and doesn't need spinlocks. As for elapsed time, I don't think it is a big issue. The locked code in that case is very short.

Brendan wrote:That's probably 100% correct - there is a PIT (but it might only be used for speaker control) and HPET (and SMM) is used to emulate PIT channel 0 and IRQ0 for legacy OSs that don't enable "ACPI mode"; and an OS that looks at the AML is assumed to use "ACPI mode" where the firmware doesn't bother using SMM to emulate the PIT (the OS just uses HPET directly).

It would have been correct if the HPET had worked, which it doesn't.

Cognition · Post by **Cognition** » Sat Dec 24, 2011 7:32 am

Owen wrote:Another machine with no AML-defined PIT->IRQ routing:

Also interesting:

Code: Select all

                Device (EC)
                {
                    Name (_HID, EisaId ("PNP0C09"))
                    Name (_UID, 0x00)
                    Name (_CRS, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0062,             // Range Minimum
                            0x0062,             // Range Maximum
                            0x00,               // Alignment
                            0x01,               // Length
                            )
                        IO (Decode16,
                            0x0066,             // Range Minimum
                            0x0066,             // Range Maximum
                            0x00,               // Alignment
                            0x01,               // Length
                            )
                    })

...I have a hunch they just reused the SuperIO's keyboard controller MCU for the SuperIO (Incidentally, this machine's ACPI tables report no i8042 keyboard controller and prodding where it would be is known to blow it up)

The ACPI spec makes it sound like this is actually a pretty common practice, as many EC bus interfaces use the KB controller style of addressing registers. Interestingly enough linux also reserves only ports 0x60 and 0x64 for the KB controller. Likewise Intel's ICH10 chipsets contain flags specifically to route 0x62 and 0x66 to an EC. Definitely not somewhere to blindly prod around

On the subject of the PIT and IRQ's, my main system has some AML code which actually returns resource values based on a chipset specific flag value. At least on the ICH10 it seems like Brendan's hunch was right, the same 14.38181 frequency is used to drive both the HPET and ACPI PM Timer. I know the HPET can be disabled in some BIOSes as well, it'd be interesting to see if doing so on systems that support it would spit out a different set of resource values for the RTC and PIT.

Code: Select all

                Device (TMR)
                {
                    Name (_HID, EisaId ("PNP0100"))
                    Name (ATT5, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0040,             // Range Minimum
                            0x0040,             // Range Maximum
                            0x00,               // Alignment
                            0x04,               // Length
                            )
                        IRQNoFlags ()
                            {0}
                    })
                    Name (ATT6, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0040,             // Range Minimum
                            0x0040,             // Range Maximum
                            0x00,               // Alignment
                            0x04,               // Length
                            )
                    })
                    Method (_CRS, 0, NotSerialized)
                    {
                        If (LGreaterEqual (OSFX, 0x03))
                        {
                            If (HPTF)
                            {
                                Return (ATT6)
                            }
                            Else
                            {
                                Return (ATT5)
                            }
                        }
                        Else
                        {
                            Return (ATT5)
                        }
                    }
                }

                Device (HPET)
                {
                    Name (_HID, EisaId ("PNP0103"))
                    Name (ATT3, ResourceTemplate ()
                    {
                        IRQNoFlags ()
                            {0}
                        IRQNoFlags ()
                            {8}
                        Memory32Fixed (ReadWrite,
                            0xFED00000,         // Address Base
                            0x00000400,         // Address Length
                            )
                    })
                    Name (ATT4, ResourceTemplate ()
                    {
                    })
                    Method (_STA, 0, NotSerialized)
                    {
                        If (LGreaterEqual (OSFX, 0x03))
                        {
                            If (HPTF)
                            {
                                Return (0x0F)
                            }
                            Else
                            {
                                Return (0x00)
                            }
                        }
                        Else
                        {
                            Return (0x00)
                        }
                    }

                    Method (_CRS, 0, NotSerialized)
                    {
                        If (LGreaterEqual (OSFX, 0x03))
                        {
                            If (HPTF)
                            {
                                Return (ATT3)
                            }
                            Else
                            {
                                Return (ATT4)
                            }
                        }
                        Else
                        {
                            Return (ATT4)
                        }
                    }
                }

                Device (RTC)
                {
                    Name (_HID, EisaId ("PNP0B00"))
                    Name (ATT0, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0070,             // Range Minimum
                            0x0070,             // Range Maximum
                            0x00,               // Alignment
                            0x04,               // Length
                            )
                        IRQNoFlags ()
                            {8}
                    })
                    Name (ATT1, ResourceTemplate ()
                    {
                        IO (Decode16,
                            0x0070,             // Range Minimum
                            0x0070,             // Range Maximum
                            0x00,               // Alignment
                            0x04,               // Length
                            )
                    })
                    Method (_CRS, 0, NotSerialized)
                    {
                        If (LGreaterEqual (OSFX, 0x03))
                        {
                            If (HPTF)
                            {
                                Return (ATT1)
                            }
                            Else
                            {
                                Return (ATT0)
                            }
                        }
                        Else
                        {
                            Return (ATT0)
                        }
                    }
                }

rdos · Post by **rdos** » Sat Dec 24, 2011 7:59 am

Brendan wrote:2^32 tics is closer to 0.99988669323428908795205349655706 hours.

Yes, I noticed this as I looked at the HPET-timer which obviously is running at 12 times the frequency of the PIT. When I made the decision to use PIT tics as a timebase, I assumed that it was intentional that 2^32 tics was one hour, and that the frequency of the oscilator was just rounded which produced a slight deviation. The value HPET gives seems to contradict this, and support that this is not an exact match. OTOH, this small deviation (113 ppm) is probably within the accuracy of typical oscilators at the time it was invented.

Anyway, I decided to use a conversion factor for HPET that made an exact match with 2^32 tics being one hour, instead of the (rounded) frequency the PIT oscillator operates at.

Brendan wrote:In this case "2^32 tics is one hour" isn't more convenient than anything else (e.g. you could just as easily use "64-bit nanoseconds since the start of the year 2000" if you wanted; or maybe 32-bit seconds and 32-bit fractions of a second) because you're just adding a pre-computed amount.

If I had done this now, I'd probably use the timestamp format that NTP uses. If I remember it correctly, they use tenth of ns since some predefined date.

Brendan · Post by **Brendan** » Sat Dec 24, 2011 8:06 am

Hi,

rdos wrote:No, there is no minimum delay. Additionally, there is no maximum delay either.

I don't believe you.

rdos wrote:If the value is larger than 65535 tics, the timer is loaded with the maximum value.

That looks a lot like a maximum PIT delay to me.

rdos wrote:It calculates the time it takes to reprogram the timer at boot-time, and compensates for this by subtracting this from the load value.

That looks a lot like a minimum PIT delay to me.

It also looks seriously flawed. The time taken to program the PIT may be significantly less than the time between IRQs that the PIT and/or PIC can sustain. For example, it might only take 1 us to set the "lo-byte" and "hi-byte", and in that case your code would assume a minimum PIT delay of "count = 2"; but no sane person would expect that the PIC and PIT can handle IRQs at a rate of 596590 IRQs per second. As a test, in the PIT IRQ handler repeatedly set the count to the minimum, and try that on a variety of machines to see if it works on any of them.

rdos wrote:
Brendan wrote:Basically, for the PIT you can get 838 ns precision, but only for delays between 99.7 us and 54925.4 us. Now let's assume the PIT is being used by the scheduler, and the scheduler wants a task to run for exactly 1.2345 ms. That's easy to do - you set the count 1473 and the IRQ occurs after 1.2345 ms (actually "1.2345144424780244416538973528759 ms" but it's close enough). That gives the scheduler about 838 ns precision, which is very nice.
No. The only limitation is interrupt latency (and latency of timers themselves). If you cannot program the PIT with less than 119 tics, your implementation is seriously flawed.

"119 tics" is a number I made up for my example in the hope that it'd be easier for you to see my point (but I guess it's still too easy for you to miss the point and focus on the unimportant details instead). Any number greater than one would cause precision loss in the same way.

You are right in one way though - interrupt latency is an additional problem that would also effect the minimum PIT delay (that your "assume minimum PIT delay is the same as time to setup PIT" code fails to account for).

Also note that the exact same "minimum delay cause effective precision loss when a timer is shared" problem occurs for the local APIC timer too. I chose to use the PIT as an example because it's easier to make up numbers to illustrate the point (because the PIT's base "1.193182 MHz" frequency is the same on all computers and I get to avoid typing "if the local APIC is running at 123456 HZ, then..." everywhere).

rdos wrote:
Brendan wrote:2^32 tics is closer to 0.99988669323428908795205349655706 hours. For every hour you can expect to be wrong by half a second; which adds up to about 68.6 seconds per week. You'd want to fix this problem (and also fix other causes of drift) by adding a tiny bias. For example, every "10 ms" you might add 10.0000001 ms to your counter. In this case "2^32 tics is one hour" isn't more convenient than anything else (e.g. you could just as easily use "64-bit nanoseconds since the start of the year 2000" if you wanted; or maybe 32-bit seconds and 32-bit fractions of a second) because you're just adding a pre-computed amount.
This problem is fixed by having an additional fraction counter, and synchronization from RTC. The performance critical case is with older CPUs, which have PITs, and thus you don't want to do conversions on these.

Even an underclocked 8086 can handle one additional "add" instruction in the PIT's IRQ handler.

rdos wrote:
Brendan wrote:Which means you can't just "add [ticks],value" and have to update both dwords atomically with something like "lock cmpxchg". Not that it really matters much, given that the biggest problem is that whenever the timer IRQ occurs the cache line is modified and all other CPUs have to fetch the new contents of the cache line, causing excessive cache traffic one many-CPU systems (which is the biggest advantage of "loosely synchronised TSC").
Updating timers and elapsed time requires spinlocks. That is the main reason why APIC timer is better for timers. This implementation has less lock-contention, and doesn't need spinlocks. As for elapsed time, I don't think it is a big issue. The locked code in that case is very short.

For a 64-bit counter you shouldn't need spinlocks on Pentium or later (and multi-CPU isn't worth the hassle of supporting on CPUs that are so old they don't support cmpxchg, and you don't need more than cli/sti for single-CPU). The "lock cmpxchg" doesn't even need to be in a loop because it's impossible for it to need a retry when there's only one writer.

rdos wrote:
Brendan wrote:That's probably 100% correct - there is a PIT (but it might only be used for speaker control) and HPET (and SMM) is used to emulate PIT channel 0 and IRQ0 for legacy OSs that don't enable "ACPI mode"; and an OS that looks at the AML is assumed to use "ACPI mode" where the firmware doesn't bother using SMM to emulate the PIT (the OS just uses HPET directly).
It would have been correct if the HPET had worked, which it doesn't.

The AML is correct and there's bugs you haven't found yet in your HPET and/or APIC code? Ok.

Cheers,

Brendan

Brendan · Post by **Brendan** » Sat Dec 24, 2011 9:19 am

Hi,

rdos wrote:Anyway, I decided to use a conversion factor for HPET that made an exact match with 2^32 tics being one hour, instead of the (rounded) frequency the PIT oscillator operates at.

The words "conversion factor" make me wonder if you're multiplying or something like that.

For single-CPU only; the basic code to update the current time (in the HPET or PIT timer IRQ handler) would go something like this:

Code: Select all

time_low_dword:          dd 0     ;Low 32-bits of the publicly visible "64-bit counter"
time_high_dword:         dd 0     ;High 32-bits of the publicly visible "64-bit counter"

hidden_dword:            dd 0     ;A hidden dword (used to extend the "64-bit counter" to 96-bits for drift compensation)

time_between_IRQs_low:   dd 0     ;Value to add to "hidden_dword" during the IRQ
time_between_IRQs_high:  dd 0     ;Value to add to "time_low_dword" during the IRQ

some_IRQ_handler:
    *** irrelevant stuff **

    mov eax,[time_between_IRQs_low]
    mov ebx,[time_between_IRQs_high]
    add [hidden_dword],eax
    adc [time_low_dword],ebx
    adc dword [time_high_dword],0

    *** more irrelevant stuff **

To read the time you'd have to make sure IRQs are disabled:

Code: Select all

    pushfd
    cli
    mov eax,[time_low_dword]
    mov edx,[time_low_dword]
    popfd

Of course that only works for single-CPU. To make it safe for multi-CPU (for Pentium and later only), you'd do something like this instead:

Code: Select all

some_IRQ_handler:
    *** irrelevant stuff **

    mov ecx,[time_between_IRQs_low]
    mov ebx,[time_between_IRQs_high]
    add [hidden_dword],ecx
    mov eax,[time_low_dword]
    mov edx,[time_high_dword]
    adc ebx,eax
    adc ecx,0
    lock cmpxchg8b [time_low_dword]     ;Force atomic 64-bit write (can't fail as there's no other writers, so won't need retry)

    *** more irrelevant stuff **

You also need to do an atomic 64-bit read, like this:

Code: Select all

    mov ebx,eax
    mov ecx,edx
    lock cmpxchg8b [time_low_dword]

Note: For the "atomic 64-bit read" the comparison would almost always fail and you'd almost always end up atomically reading the time into EDX:EAX. If you're extremely lucky ECX:EBX (and EDX:EAX) will already be correct. The original values of EDX and EAX don't matter as the result (EDX:EAX set to the 64-bit time) is the same regardless of what values were in EDX and EAX beforehand.

The point here is that it's fast and doesn't require any spinlocks.

As a bonus, you get very fine granularity drift control. For example, for "2^32 ticks is one hour" the "time_between_IRQs_low" variable gives you drift adjustment in 0.000000195156391 ns increments.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Dec 24, 2011 10:42 am

Brendan wrote:
rdos wrote:It calculates the time it takes to reprogram the timer at boot-time, and compensates for this by subtracting this from the load value.
That looks a lot like a minimum PIT delay to me.

It's not. It is just what it looks like. If the tick I would program into the PIT timer is less than the time calculated at boot-time for reloading the PIT, there is no sense in programming the PIT at all, rather it is more sensible to let the timer expire directly, and take the next timer instead. If a count that is say one more than this is calculated, then one tick would be programmed into PIT, and it would expire almost directly, leading to an IRQ that would reprogram the timer, and handle expired timers. There are no race conditions for short timeouts, since if they are too short, timers expire directly, and if they are slightly longer, they expire when the IRQ fires. Naturally, the IRQ / timer handling code must be protected by cli/sti (on single core or with APIC timer), and with spinlocks otherwise.

Brendan wrote:It also looks seriously flawed. The time taken to program the PIT may be significantly less than the time between IRQs that the PIT and/or PIC can sustain. For example, it might only take 1 us to set the "lo-byte" and "hi-byte", and in that case your code would assume a minimum PIT delay of "count = 2"; but no sane person would expect that the PIC and PIT can handle IRQs at a rate of 596590 IRQs per second. As a test, in the PIT IRQ handler repeatedly set the count to the minimum, and try that on a variety of machines to see if it works on any of them.

It will work. If the IRQs happen too late, and there are many timers with short intervalls, then each IRQ will retire multiple timers in the same invokation. As long as the timers themselves have time to run, there is no problem. In fact, as load increases, many timers will be retired in the same run of the timer routine, which in turn would lower IRQ load, and so some kind of balance will be achieved. You can think of it as a pure poll-loop when there is very high load. The PIT will never be programmed and the IRQ will never fire if the timer load is sufficiently high.

Brendan wrote:"119 tics" is a number I made up for my example in the hope that it'd be easier for you to see my point (but I guess it's still too easy for you to miss the point and focus on the unimportant details instead). Any number greater than one would cause precision loss in the same way.

I think you missed the point that there is no need for IRQs if periods are very short

Brendan wrote:For a 64-bit counter you shouldn't need spinlocks on Pentium or later (and multi-CPU isn't worth the hassle of supporting on CPUs that are so old they don't support cmpxchg, and you don't need more than cli/sti for single-CPU). The "lock cmpxchg" doesn't even need to be in a loop because it's impossible for it to need a retry when there's only one writer.

There is more involved than that. When PIT is used, you first need to cache counter and read it out. Then you would need to add this value to elapsed time, and read the value. All of this must be atomic. When HPET is used, it is similar, but then you also need to multiply by a scale factor.

Brendan wrote:It would have been correct if the HPET had worked, which it doesn't.

It is possible. I only fail to understand why it works with MSI-delivery.

rdos · Post by **rdos** » Sat Dec 24, 2011 10:56 am

Brendan wrote:The words "conversion factor" make me wonder if you're multiplying or something like that.

Yes. With HPET I must do that in order to get the normal tics that RDOS time-management code expects to see.

Brendan wrote:For single-CPU only; the basic code to update the current time (in the HPET or PIT timer IRQ handler) would go something like this:

Code: Select all

time_low_dword:          dd 0     ;Low 32-bits of the publicly visible "64-bit counter"
time_high_dword:         dd 0     ;High 32-bits of the publicly visible "64-bit counter"

hidden_dword:            dd 0     ;A hidden dword (used to extend the "64-bit counter" to 96-bits for drift compensation)

time_between_IRQs_low:   dd 0     ;Value to add to "hidden_dword" during the IRQ
time_between_IRQs_high:  dd 0     ;Value to add to "time_low_dword" during the IRQ

some_IRQ_handler:
    *** irrelevant stuff **

    mov eax,[time_between_IRQs_low]
    mov ebx,[time_between_IRQs_high]
    add [hidden_dword],eax
    adc [time_low_dword],ebx
    adc dword [time_high_dword],0

    *** more irrelevant stuff **

But I don't update current time with an IRQ. I update it every time I read it, since otherwise I would not have a current time, but rather a time from the last IRQ.

Brendan wrote:To read the time you'd have to make sure IRQs are disabled:
Code: Select all
    pushfd
    cli
    mov eax,[time_low_dword]
    mov edx,[time_low_dword]
    popfd

You are not reading current time here, rather time at last IRQ, which doesn't have the precision I require. That's also why I wrote that if timer load is very high, I don't need to program the timer, but only read (real) current time, and decide that a timer has expired.

Brendan wrote:The point here is that it's fast and doesn't require any spinlocks.

The real point is that you are reading stale time, or overloading your CPU with unnecesary interrupts.

rdos · Post by **rdos** » Sat Dec 24, 2011 11:42 am

Some additional points here to the timer management code. The reason why race conditions cannot happen is because the code to handle timers can be run any number of times without adverse effects. Especially with high load on multicore, there will be unnecesary reprogramming of the PIT, and possibly unnecessary IRQs that will do nothing, but this will not affect anything.

It is also the case that the IRQ itself will first send EOI to interrupt controller, enable interrupts, and then call the timer management routine. This is so to avoid locking-up the system. The timer IRQ can then be scheduled away from, but this is not a problem as the scheduler itself updates the timers. The timer expire callback is also run with interrupts enabled, and only with the scheduler lock taken.

Brendan · Post by **Brendan** » Sat Dec 24, 2011 12:35 pm

Hi,

rdos wrote:
Brendan wrote:
rdos wrote:It calculates the time it takes to reprogram the timer at boot-time, and compensates for this by subtracting this from the load value.
That looks a lot like a minimum PIT delay to me.
It's not. It is just what it looks like. If the tick I would program into the PIT timer is less than the time calculated at boot-time for reloading the PIT, there is no sense in programming the PIT at all, rather it is more sensible to let the timer expire directly, and take the next timer instead.

Translation: If the tick you would program into the PIT timer is less than the minimum PIT delay, there is no sense in programming the PIT at all, rather it is more sensible to [let the timer expire directly, and] take the next timer instead (which can't happen until after the minimum PIT delay).

rdos wrote:If the IRQs happen too late, and there are many timers with short intervalls, then each IRQ will retire multiple timers in the same invokation.

Translation: If the IRQs happen too late, and there are many timers with short intervalls, then each IRQ will screw up the precision of multiple timers in the same invokation.

rdos wrote:
Brendan wrote:"119 tics" is a number I made up for my example in the hope that it'd be easier for you to see my point (but I guess it's still too easy for you to miss the point and focus on the unimportant details instead). Any number greater than one would cause precision loss in the same way.
I think you missed the point that there is no need for IRQs if periods are very short

There's only 3 ways to do that:

screw up effective precision by combining "close" IRQs into one
poll the PIT's current count and have a lot more overhead that's typically unnecessary
lose track of the current time and screw everything up

Perhaps I'm giving you more credit than you deserve by assuming the first (least bad) option?

rdos wrote:
Brendan wrote:The words "conversion factor" make me wonder if you're multiplying or something like that.
Yes. With HPET I must do that in order to get the normal tics that RDOS time-management code expects to see.

For elapsed time (not real time), keep it in its native format. You don't convert something from "tens of a nanoseconds precision" format into a "several hundred nanoseconds precision" format unless your code is crap (but converting the other way is fine - e.g. "tens of nanoseconds precision" converted to "nanosecond precision" format).

For real time (not elapsed time), what matters most is long term accuracy (drift) and overhead (various pieces of software pound the daylights out of "real time", including file systems). Precision doesn't matter much as long as you can get within about 1 ms (and most code only really needs seconds).

rdos wrote:The real point is that you are reading stale time, or overloading your CPU with unnecesary interrupts.

For real time, obviously you wouldn't do this at all if you can rely on HPET's main counter or the ACPI counter or TSC. If you have to use PIT or RTC it's perfectly fine for real time. Do the calculation backwards if you want. If that code costs 100 cycles per IRQ and is running on a slow 500 MHz CPU, and if you want it to consume 0.1% of one CPU's time; then "500000000 * (0.1/100) / 100 = 5000 Hz", which implies you could get 200 us precision out of it without making a significant dent in performance. Of course I'd do something like this calculation during boot, so that faster CPUs get more precision and slower CPUs get less precision but overhead stays about the same.

For elapsed time (where precision is important and accuracy isn't), use something that's suitable for elapsed time.

Of course (getting back to an earlier point), you can't know which timers are best to use for what until you know which timers are present and usable; and you can't do that properly until boot.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Dec 24, 2011 1:37 pm

Brendan wrote:
rdos wrote:It's not. It is just what it looks like. If the tick I would program into the PIT timer is less than the time calculated at boot-time for reloading the PIT, there is no sense in programming the PIT at all, rather it is more sensible to let the timer expire directly, and take the next timer instead.
Translation: If the tick you would program into the PIT timer is less than the minimum PIT delay, there is no sense in programming the PIT at all, rather it is more sensible to [let the timer expire directly, and] take the next timer instead (which can't happen until after the minimum PIT delay).

The "minimum delay" only exists on older machines. On a modern CPU, it takes much lesser than one PIT tic to program the timer, meaning that the "minimum delay" is one tic.

Brendan wrote:Translation: If the IRQs happen too late, and there are many timers with short intervalls, then each IRQ will screw up the precision of multiple timers in the same invokation.

If the IRQ happen to late, it is because of interrupt latencies, and you are out of luck with any implementation. I have no idea what is screwed up when timers that are expired are handled directly instead of waiting for an IRQ.

Brendan wrote:There's only 3 ways to do that:

screw up effective precision by combining "close" IRQs into one

poll the PIT's current count and have a lot more overhead that's typically unnecessary

lose track of the current time and screw everything up

I think you misunderstand the function of the IRQ. The IRQ is there to force an update of timers, not to trigger an event!

Brendan wrote:For elapsed time (not real time), keep it in its native format. You don't convert something from "tens of a nanoseconds precision" format into a "several hundred nanoseconds precision" format unless your code is crap (but converting the other way is fine - e.g. "tens of nanoseconds precision" converted to "nanosecond precision" format).

For real time (not elapsed time), what matters most is long term accuracy (drift) and overhead (various pieces of software pound the daylights out of "real time", including file systems). Precision doesn't matter much as long as you can get within about 1 ms (and most code only really needs seconds).

I don't agree. There is no reason to separate elapsed time precision from real time precision. The most efficient way is to keep real time and elapsed time in the same format, and only add a constant to elapsed time in order to get real time.

Brendan wrote:For real time, obviously you wouldn't do this at all if you can rely on HPET's main counter or the ACPI counter or TSC. If you have to use PIT or RTC it's perfectly fine for real time. Do the calculation backwards if you want. If that code costs 100 cycles per IRQ and is running on a slow 500 MHz CPU, and if you want it to consume 0.1% of one CPU's time; then "500000000 * (0.1/100) / 100 = 5000 Hz", which implies you could get 200 us precision out of it without making a significant dent in performance. Of course I'd do something like this calculation during boot, so that faster CPUs get more precision and slower CPUs get less precision but overhead stays about the same.

The only overhead that is needed to synchronize real time with elapsed time is two RTC ints per second. That would adjust for the drift, and doesn't cost any significant overhead even on a 386 processor.

Here is the IRQ handler for global timers:

Code: Select all


irq:
    push ds
    push es
    push fs
    pushad
;    
    mov ax,apic_mem_sel
    mov ds,ax
    xor eax,eax
    mov ds:APIC_EOI,eax     ; EOI
;
    mov ax,task_sel
    mov ds,ax
    mov ax,core_data_sel
    mov fs,ax
;
    GetSystemTime                ; read elapsed time in EDX:EAX
    call LockTimerGlobal          ; take timer spinlock
    add eax,cs:update_tics
    adc edx,0
    mov bx,ds:timer_head
    sub eax,ds:[bx].timer_lsb    ; check if head timer has expired
    sbb edx,ds:[bx].timer_msb
    jc timer_expired_reload

timer_expired_lock:
    call TryLockCore                 ; it has expired, so lock scheduler so we cannot be swapped out

timer_expired_remove:    
    call LocalRemoveTimerGlobal    ; call the expired callback (also releases spinlock)
    GetSystemTime                  ; get elapsed time again so we have a fresh expire time
    call LockTimerGlobal            ; take the spinlock again
    add eax,cs:update_tics
    adc edx,0
    mov bx,ds:timer_head
    sub eax,ds:[bx].timer_lsb
    sbb edx,ds:[bx].timer_msb
    jnc timer_expired_remove      ; check if next timer also expired
;    
    neg eax
    ReloadSysTimer                   ; reload timer hardware with new expire count
    jc timer_expired_remove       ; if it already expired (HPET), recheck timers
;
    call UnlockTimerGlobal          ; unlock
    call TryUnlockCore
    jmp timer_expired_done

timer_expired_reload:
    neg eax
    ReloadSysTimer                  ; reload timer hardware with new expire count
    jc timer_expired_lock
;
    call UnlockTimerGlobal

timer_expired_done:    
    popad
    pop fs
    pop es
    pop ds
    iretd    

; code to handle expired timer

LocalRemoveTimerGlobal Proc near
    mov bx,ds:timer_head
    mov ax,ds:[bx].timer_next
    mov ds:timer_head,ax         ; unlink
;    
    push ds
    push es
    push fs
;    
    xor eax,eax
    mov ax,cs
    push eax
    mov ax,OFFSET timer_global_return
    push eax
    mov ax,ds:[bx].timer_sel
    push eax
    push ds:[bx].timer_offset
;
    mov ax,ds:timer_free
    mov ds:[bx].timer_next,ax
    mov ds:timer_free,bx
;    
    mov ecx,ds:[bx].timer_id
    mov eax,ds:[bx].timer_lsb
    mov edx,ds:[bx].timer_msb        
    call UnlockTimerGlobal                  ; release the spinlock as we don't want the spinlock to be taken in the callback
;    
    xor bx,bx
    mov ds,bx
    mov es,bx
    retf32                                       ; run the callback

timer_global_return:
    pop fs
    pop es
    pop ds
    ret
LocalRemoveTimerGlobal    Endp

; Request the timer spinlock

LockTimerGlobal    Proc near
    push ax

ltigSpinLock:    
    mov ax,ds:timer_spinlock
    or ax,ax
    je ltigGet
;
    sti
    pause
    jmp ltigSpinLock

ltigGet:
    cli
    inc ax
    xchg ax,ds:timer_spinlock
    or ax,ax
    jne ltigSpinLock
;
    pop ax
    ret
LockTimerGlobal    Endp

; Release the timer spinlock

UnlockTimerGlobal    Proc near
    mov ds:timer_spinlock,0
    sti
    ret
UnlockTimerGlobal    Endp

Brendan · Post by **Brendan** » Sun Dec 25, 2011 1:16 am

Hi,

rdos wrote:The "minimum delay" only exists on older machines. On a modern CPU, it takes much lesser than one PIT tic to program the timer, meaning that the "minimum delay" is one tic.

And meaning that your broken code could attempt 1.193 million IRQs per second under the right conditions and screw things up (setting PIT count to 1 is never a good idea).

rdos wrote:
Brendan wrote:Translation: If the IRQs happen too late, and there are many timers with short intervalls, then each IRQ will screw up the precision of multiple timers in the same invokation.
If the IRQ happen to late, it is because of interrupt latencies, and you are out of luck with any implementation.

In a good OS (possibly not your OS) IRQs are rarely disabled for more than about 100 cycles. On an old slow 500 MHz CPU that works out to about 200 ns of "jitter" caused by interrupt latency. The minimum precision of the PIT is about 838 ns which is 4 times higher than "jitter" caused by interrupt latency. Interrupt latency is negligible in a good OS (when running on CPU/s that were made in this century).

rdos wrote:I think you misunderstand the function of the IRQ. The IRQ is there to force an update of timers, not to trigger an event!

The line in your source code that says "call LocalRemoveTimerGlobal ; call the expired callback (also releases spinlock)" doesn't trigger an event, and only updates a timer?

rdos wrote:
Brendan wrote:For elapsed time (not real time), keep it in its native format. You don't convert something from "tens of a nanoseconds precision" format into a "several hundred nanoseconds precision" format unless your code is crap (but converting the other way is fine - e.g. "tens of nanoseconds precision" converted to "nanosecond precision" format).

For real time (not elapsed time), what matters most is long term accuracy (drift) and overhead (various pieces of software pound the daylights out of "real time", including file systems). Precision doesn't matter much as long as you can get within about 1 ms (and most code only really needs seconds).
I don't agree. There is no reason to separate elapsed time precision from real time precision. The most efficient way is to keep real time and elapsed time in the same format, and only add a constant to elapsed time in order to get real time.

Don't confuse the "effective precision" with the precision that values are stored in. If I use a timer set to 1 Hz to add 1000000000 nanoseconds to a "nanoseconds since 2000" counter, then the effective precision is still 1 second even though the counter itself is capable of 1 nanosecond precision. There is no real need to store different values in different precisions; but I was talking about effective precision.

Unless you're using HPET's main counter or TSC (where there's no downside), there's no point increasing overhead (to increase effective precision, not the "storage precision") for real time when nobody cares about that extra precision anyway.

rdos wrote:
Brendan wrote:For real time, obviously you wouldn't do this at all if you can rely on HPET's main counter or the ACPI counter or TSC. If you have to use PIT or RTC it's perfectly fine for real time. Do the calculation backwards if you want. If that code costs 100 cycles per IRQ and is running on a slow 500 MHz CPU, and if you want it to consume 0.1% of one CPU's time; then "500000000 * (0.1/100) / 100 = 5000 Hz", which implies you could get 200 us precision out of it without making a significant dent in performance. Of course I'd do something like this calculation during boot, so that faster CPUs get more precision and slower CPUs get less precision but overhead stays about the same.
The only overhead that is needed to synchronize real time with elapsed time is two RTC ints per second. That would adjust for the drift, and doesn't cost any significant overhead even on a 386 processor.

I'm talking about the overhead of keeping track of real time; and you're talking about the overhead of synchronising real time with elapsed time.

Cheers,

Brendan

rdos · Post by **rdos** » Sun Dec 25, 2011 3:12 am

Brendan wrote:And meaning that your broken code could attempt 1.193 million IRQs per second under the right conditions and screw things up (setting PIT count to 1 is never a good idea).

If somebody is stupid enough to start a new timer each tic, yes.

Brendan wrote:In a good OS (possibly not your OS) IRQs are rarely disabled for more than about 100 cycles. On an old slow 500 MHz CPU that works out to about 200 ns of "jitter" caused by interrupt latency. The minimum precision of the PIT is about 838 ns which is 4 times higher than "jitter" caused by interrupt latency. Interrupt latency is negligible in a good OS (when running on CPU/s that were made in this century).

I'm pretty sure that interrupt latencies in RDOS on new CPUs is several orders lower than the PIT tic, which means that a sustained rate of 1.193 million PIT interrupts would be possible. Late PIT ISRs is only an issue on older CPUs.

Brendan wrote:Unless you're using HPET's main counter or TSC (where there's no downside), there's no point increasing overhead (to increase effective precision, not the "storage precision") for real time when nobody cares about that extra precision anyway.

I could give you some examples when it is useful. If you log network traffic, you could also log the real-time marks with us precision. This is useful since you both want to see the real-time when things happened, and the time interval between packets. It is not important if real-time is accurate down to us, but it is important that the difference between packets is accurate down to us. Another example is synchronization with NTP-servers, that have much better resolution than seconds or milliseconds. In order to take advantage of NTP, you need far better precision than milliseconds.

Brendan wrote:
rdos wrote:The only overhead that is needed to synchronize real time with elapsed time is two RTC ints per second. That would adjust for the drift, and doesn't cost any significant overhead even on a 386 processor.
I'm talking about the overhead of keeping track of real time; and you're talking about the overhead of synchronising real time with elapsed time.

Because you suggested to keep track of real time with an ISR.

Here is how I read real time (no overhead or IRQs involved):

Code: Select all

get_time    PROC far
    GetSystemTime
    add eax,cs:time_diff
    adc edx,cs:time_diff+4
    retf32
get_time   ENDP

OSDev.org

Using APIC timer as a "system board" timer when HPET fails

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai

Re: Using APIC timer as a "system board" timer when HPET fai