Implementation of system time

zberry7 · Post by **zberry7** » Sun Jun 06, 2021 11:28 am

Hello! This is my first question/post here, I’ve been working on my OS for couple months in my spare time, and I absolutely love it, and this forum/wiki

I’ve come to the point where I need precise system time, and a timer/event system. And I have a basic idea nailed down pretty well, which is setting up an IRQ at ‘X’ Hz, and then in the handler adding ‘10^9 / X’ to a ‘nanoseconds since boot’ counter, and then using the TSC to interpolate between IRQs to get nanosecond precise time.

But, the actual implementation of this system is where I am unsure, I’ve thought of two ways to go about it.

1) Having the HPET/PIT raising an IRQ on the BSP, where it performs the logic needed to increment the counter (writing COUNTER_MAIN, LAST_TSC, and RATE_TSC). But, what if I need to sample the system time from another CPU? I could end up reading in between COUNTER_MAIN and LAST_TSC, getting a result thats ‘1000/X’ ms ahead, and a reading immediately after could be less than the older reading (or a race condition?). I was thinking of making a sort of ‘swap chain’ where there’s two sets of counters, and they are swapped after each update so we don’t write to the set of counters that might be sampled

2) I could(?) send the IRQ to each CPU in the system, using one of the I/O APIC entries, although I’m not sure how to do this (0xFF for destination like with IPIs?), and the documentation isn’t very clear to me. But, this would ensure there’s no issues with data races, getting bad readings during an update, or anything of that nature. But, this requires extra overhead due to each CPU having to run the IRQ, and having to find its own list of counters. Also, on reading system time this would also require getting a unique identifier and calculating where the timers are in memory.

This basically leads me to ask, is there any major issues with either of these methods? Any way you would prefer? And any issues with power-states?

And one final question, is a timer/event system, with time buckets and what not, actually needed, or at least is the development time justified? And why can’t a scheduler perform any timing/event callbacks on its own without having to implement a dedicated system for it?

Thank you to anyone who takes the time to read and reply!

nullplan · Post by **nullplan** » Sun Jun 06, 2021 10:00 pm

For system time, I would simply use a hardware counter. In many cases, the TSC is consistent and can be used, or else you can read the HPET timer. I would not count IRQs (which is what you are proposing), just use a counter that doesn't overflow. Counting interrupts has the problem that you must halt everything you are currently doing to keep track of a value that is not always needed. This has power management implications as well, but mostly it is just plain unnecessary.

As for updating multiple variables in such a way that the values are consistent, if you cannot do so atomically, use a spinlock. This is the prototypical example of when to use a spinlock. The lock will only be held for a very short time, either to update or to read those variables.

klange · Post by **klange** » Sun Jun 06, 2021 10:55 pm

In the recent re-write of my kernel I opted to assume a constant TSC rate and calibrate that on startup through a PIT one-shot - no interrupts involved. The HPET as a timing source has been widely considered problematic; TSCs have been constant for over a decade, and they're zero-cost and have gigahertz precision available.

rdos · Post by **rdos** » Mon Jun 07, 2021 1:40 am

zberry7 wrote:Hello! This is my first question/post here, I’ve been working on my OS for couple months in my spare time, and I absolutely love it, and this forum/wiki

I’ve come to the point where I need precise system time, and a timer/event system. And I have a basic idea nailed down pretty well, which is setting up an IRQ at ‘X’ Hz, and then in the handler adding ‘10^9 / X’ to a ‘nanoseconds since boot’ counter, and then using the TSC to interpolate between IRQs to get nanosecond precise time.

So, what you will your IRQ do? You will not do anything meaningful in it, and so it is just wasting CPU cycles.

Instead, you always need to have a preemption timer that the scheduler updates, and so instead of updating system time in an IRQ, it's better to do it in the scheduler when it sets up the preemption timer as it reads system time. Every time you read system time you read out the counter (PIT, HPET or whatever), subtract it from the previous read-out and add it to system time (possibly after a conversion).

Timers are best implemented by creating a list of active timers and sorting them in expire order. The expire time of the timer could be expressed in system time, which makes sure the timer related to system time is read regularly so it doesn't overflow.

zberry7 wrote: 1) Having the HPET/PIT raising an IRQ on the BSP, where it performs the logic needed to increment the counter (writing COUNTER_MAIN, LAST_TSC, and RATE_TSC). But, what if I need to sample the system time from another CPU? I could end up reading in between COUNTER_MAIN and LAST_TSC, getting a result thats ‘1000/X’ ms ahead, and a reading immediately after could be less than the older reading (or a race condition?). I was thinking of making a sort of ‘swap chain’ where there’s two sets of counters, and they are swapped after each update so we don’t write to the set of counters that might be sampled

It will depend on the hardware available. I think using the HPET for system time is optimal, and it is a global resource and therefore can be read from any processor core and still give the same result. The only difference in a multicore solution is that updating system time must use a spinlock. If the HPET is not available, and you use TSC, things get more problematic since this is a per-core function that cannot be assumed to be consistent regardless of which core it is used from.

Timers in a multicore system are best implemented per processor core, for instance using the APIC timer. If you only have the PIT, then you must make timers global, but then OTOH, such systems typically are single-core anyway.

zberry7 wrote: 2) I could(?) send the IRQ to each CPU in the system, using one of the I/O APIC entries, although I’m not sure how to do this (0xFF for destination like with IPIs?), and the documentation isn’t very clear to me. But, this would ensure there’s no issues with data races, getting bad readings during an update, or anything of that nature. But, this requires extra overhead due to each CPU having to run the IRQ, and having to find its own list of counters. Also, on reading system time this would also require getting a unique identifier and calculating where the timers are in memory.

You shouldn't use IRQs for system time, and for timers, you should have them per core and thus the IRQs will trigger per core too. If you have no per-core timer, you should set the system timer to trigger BSP only.

zberry7 · Post by **zberry7** » Mon Jun 07, 2021 5:58 am

klange wrote:In the recent re-write of my kernel I opted to assume a constant TSC rate and calibrate that on startup through a PIT one-shot - no interrupts involved. The HPET as a timing source has been widely considered problematic; TSCs have been constant for over a decade, and they're zero-cost and have gigahertz precision available.

I believe you can use CPUID to check if the TSC is ‘invariant’ or whatever the correct adjective is, but does this mean it won’t stop or change rate when going into deep power savings mode as well?
I was not accounting for the fact that each CPU will have different TSC values, so if I wanted to use TSC alone for wall-clock time, this would require some method to sync the TSC of all CPUs correct? How did you overcome this issue?

And my only issue with using the PIT counter is the reduced precision (~800ns I believe), although HPET seems like a decent solution with a precision of 100ns or better, but it might not be present (at least it is not on my dev VM).

Edit: Sorry I edited my post after you replied!

rdos · Post by **rdos** » Mon Jun 07, 2021 7:14 am

zberry7 wrote: I believe you can use CPUID to check if the TSC is ‘invariant’ or whatever the correct adjective is, but does this mean it won’t stop or change rate when going into deep power savings mode as well?

I think most hobby-OSes won't bother with power savings modes, but the TSC could also change frequency as the core frequency changes, and that's more problematic.

From Wikipedia:

There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. Therefore, a program can get reliable results only by limiting itself to run on one specific CPU

That's actually a big problem because you want to read out system time on all cores in the system, and get an exact count and not just an IRQ count from a shared IRQ.

zberry7 wrote: Really the main task of the 20Hz IRQ I was setting up was to constantly recalibrate TSC to ensure any changes in rate are *mostly* accounted for, but if it’s not needed that simplifies things a lot!

That won't help much unless you only want 50ms of precision, something I consider quite inadequate.

zberry7 wrote: And reading the other replies it seems like the HPET/PIT counters might further simplify things since they’re consistent across CPUs, in the responses there’s conflicting advice on wether HPET is reliable though. Thank you everyone.

The main problem with the HPET is that you cannot rely on it always running at the same frequency, and so you might need to convert counts to your own internal format. The PIT is better in that regard since it always runs at the same frequency, but accessing legacy IO ports might be a rather slow operation on modern CPUs.

As for calibration of time, you'd either do that with some timeserver (like NTP) or possibly be calculating drift from the CMOS real-time clock.

klange · Post by **klange** » Mon Jun 07, 2021 5:05 pm

rdos wrote:
zberry7 wrote: I believe you can use CPUID to check if the TSC is ‘invariant’ or whatever the correct adjective is, but does this mean it won’t stop or change rate when going into deep power savings mode as well?
I think most hobby-OSes won't bother with power savings modes, but the TSC could also change frequency as the core frequency changes, and that's more problematic.

Constant TSC guarantees the rate of the TSC remains the same across frequency changes and certain power states (T, P, and C1) and has been true of all Intel CPUs since the Prescott Pentium 4 (~2005). Invariant TSC is a further guarantee that the TSC keeps ticking across C-states, and everything since Nehalem supports this (~2010). Ultimately, this is why eg. Linux supports many different clock sources and will dynamically "upgrade" to better ones depending on the runtime environment.

zberry7 wrote:I was not accounting for the fact that each CPU will have different TSC values, so if I wanted to use TSC alone for wall-clock time, this would require some method to sync the TSC of all CPUs correct? How did you overcome this issue?

Frankly, by not caring that much about multiple-CPU systems. It's a bit of a bogeyman to worry about clock synchronization across a pair of Xeons in a hobby OS. (And unless they're vastly different cores, in which case you have a whole other class of problems to deal with, they're probably close enough for a user-visible clock anyway?)

zberry7 · Post by **zberry7** » Mon Jun 07, 2021 7:48 pm

klange wrote: Frankly, by not caring that much about multiple-CPU systems. It's a bit of a bogeyman to worry about clock synchronization across a pair of Xeons in a hobby OS. (And unless they're vastly different cores, in which case you have a whole other class of problems to deal with, they're probably close enough for a user-visible clock anyway?)

Probably a dumb question, but with a single CPU package, with multiple cores, would the TSC value vary between the different cores?

Honestly I’m probably overthinking this, I just wanted to get the best accuracy possible so I can use this for everything in the future, like a timer/event system, scheduling and what not. Basically I’m just trying to get a good ‘nanoseconds since boot’ and converting to UTC seems trivial once I have that.

rdos · Post by **rdos** » Tue Jun 08, 2021 1:06 am

klange wrote:Constant TSC guarantees the rate of the TSC remains the same across frequency changes and certain power states (T, P, and C1) and has been true of all Intel CPUs since the Prescott Pentium 4 (~2005). Invariant TSC is a further guarantee that the TSC keeps ticking across C-states, and everything since Nehalem supports this (~2010). Ultimately, this is why eg. Linux supports many different clock sources and will dynamically "upgrade" to better ones depending on the runtime environment.

I think the most important issue is if TSC is ONE hardware resource in a multicore processor or one per core. Because if you read out different TSC values from different cores, then it becomes highly problematic to keep a shared system time with the TSC.

Also, the availability of this in AMD is important too, and if they have a different implementation then it is still problematic to create a TSC centered design.

linguofreak · Post by **linguofreak** » Tue Jun 08, 2021 1:07 am

zberry7 wrote:
klange wrote: Frankly, by not caring that much about multiple-CPU systems. It's a bit of a bogeyman to worry about clock synchronization across a pair of Xeons in a hobby OS. (And unless they're vastly different cores, in which case you have a whole other class of problems to deal with, they're probably close enough for a user-visible clock anyway?)
Probably a dumb question, but with a single CPU package, with multiple cores, would the TSC value vary between the different cores?

It could, because originally the TSC counted cycles of the CPU clock, and the frequency on different cores might be throttled differently. (If just one single threaded program is actively using CPU time, one core might be running at 3 GHz while everything else is running at 1 GHz). Depending on what era and manufacturer your CPU is from, you might or might not have a constant/invariant TSC.

rdos · Post by **rdos** » Tue Jun 08, 2021 2:22 am

linguofreak wrote:
zberry7 wrote:
klange wrote: Frankly, by not caring that much about multiple-CPU systems. It's a bit of a bogeyman to worry about clock synchronization across a pair of Xeons in a hobby OS. (And unless they're vastly different cores, in which case you have a whole other class of problems to deal with, they're probably close enough for a user-visible clock anyway?)
Probably a dumb question, but with a single CPU package, with multiple cores, would the TSC value vary between the different cores?
It could, because originally the TSC counted cycles of the CPU clock, and the frequency on different cores might be throttled differently. (If just one single threaded program is actively using CPU time, one core might be running at 3 GHz while everything else is running at 1 GHz). Depending on what era and manufacturer your CPU is from, you might or might not have a constant/invariant TSC.

Kind of. I've decided that TSC is not usable for system time, but if it is possible to identify particular CPUs where the TSC is shared and invariant, then I might decide to prioritize it. Currently, I prioritize HPET highest, both for system time and as a timer resource, and fall back to APIC timer or PIT when HPET is not functional.

klange · Post by **klange** » Tue Jun 08, 2021 4:27 am

I'll buy the possibility of the TSCs varying wildly on asymmetric systems with vastly different CPUs, and maybe even on otherwise-symmetric systems with a core in a separate package running at a fundamentally different clock rate, though as long as an individual core is capable of remaining consistent in its tickrate (see previous post on constant vs. invariant in the older and newer chips), I would think you can resolve that with per-CPU multipliers (and I'd further strongly suggest never using TSC values directly anyway since they have so much variance). From my understanding of Intel's description of invariant TSC operation, though, I'm reasonably certain cores on the same physical chip will get the same counter as they'll have the same base clock source, since the "big thing" with invariant TSC was the switch to the base clock instead of using the actual CPU clock (hence why power states, frequency scaling, etc. don't affect the TSC rate anymore).

As an anecdotal aside, Linux of course supports multiple clock sources - on modern systems the choices are generally between TSC, HPET, and ACPI Power Management timers - and Linux strongly prefers the TSC on SMP systems if it's invariant, with the HPET being the "second best" option (and there were, historically, some long discussions on the HPET being unfavorable - for precisely the reason rdos prefers it! Being a single shared device, it requires a bit of finesse to access it from multiple cores vs. the "zero-cost" local TSC).

bzt · Post by **bzt** » Tue Jun 08, 2021 4:27 am

zberry7 wrote:And one final question, is a timer/event system, with time buckets and what not, actually needed, or at least is the development time justified? And why can’t a scheduler perform any timing/event callbacks on its own without having to implement a dedicated system for it?

Well, it can be implemented without (my kernel does that for example). Although I have a dedicated system for time, but that's totally irrelevant to the scheduler, that sets up task switch interrupts independently on its own.

Basically, you're mixing system time and timer events. You want to separate these, because precision and accuracy are two different things:
- timer event: the one that interrupts tasks and used by the scheduler, has to be precise but not accurate
- system time: used for wallclock, has to be accurate but not precise

Let me explain: for timer events you probably want miscrosec or nanosec precision, but it isn't an issue if it's not accurate because you only use it for relative measurements (time passed since the last task switch). Nobody will complain (or notice for that matter) if your tasks are running a few nanosec more or less. Best to implement with LAPIC, PIT or HPET.

Now for the system time, you don't need more than a second precision, but it must be accurate: users don't care about subsec precision on the clock shown on the UI, but they expect that the system time must not drift over time. It is also expected that it should keep counting when the computer is turned off. Best to implement with RTC (or HPET with RTC emulation).

Another advantage of separating these is that you can configure the two differently. System time will always use regular intervals (1 IRQ per second probably, no more often), however the timer event can be made in one-shot mode. You switch to a task and set the timer in one-shot mode for the time that task is allowed to run. Then after the next task switch, you set it again, but now for the next task's time, which might be different to previous task's time. This is called tickless kernel.

Cheers,
bzt

rdos · Post by **rdos** » Tue Jun 08, 2021 5:30 am

bzt wrote: Let me explain: for timer events you probably want miscrosec or nanosec precision, but it isn't an issue if it's not accurate because you only use it for relative measurements (time passed since the last task switch). Nobody will complain (or notice for that matter) if your tasks are running a few nanosec more or less. Best to implement with LAPIC, PIT or HPET.

I generally agree. You also have the uncertainty of when a task will run on a preemptive kernel.

bzt wrote: Now for the system time, you don't need more than a second precision, but it must be accurate: users don't care about subsec precision on the clock shown on the UI, but they expect that the system time must not drift over time. It is also expected that it should keep counting when the computer is turned off. Best to implement with RTC (or HPET with RTC emulation).

I disagree. Timestamps are important and must have far better precision than one second. You use them to log communication or events, for instance, and then you want decent precision (at least milliseconds, but I have microseconds). There is also a small difference between waiting for 25 milliseconds and waiting until a certain time passes, and you want to have them with similar precision. Implementing timers you can poll for completion also benefits from a system time with decent precision.

So, you want a system clock that increases monotonously so you can wait for events in real-time. You also want a wall clock, but it might be implemented by adding an offset to the system clock, which allows for implementing summer/winter time shifts. You can also compensate for a slightly unstable system time by slowly modifying the offset, or synching it with RTC, NTP, or some other time reference.

rdos · Post by **rdos** » Tue Jun 08, 2021 5:48 am

klange wrote: As an anecdotal aside, Linux of course supports multiple clock sources - on modern systems the choices are generally between TSC, HPET, and ACPI Power Management timers - and Linux strongly prefers the TSC on SMP systems if it's invariant, with the HPET being the "second best" option (and there were, historically, some long discussions on the HPET being unfavorable - for precisely the reason rdos prefers it! Being a single shared device, it requires a bit of finesse to access it from multiple cores vs. the "zero-cost" local TSC).

I suspect our different preferences are based on differences in design. An important goal in my design is to have a fixed-frequency "tics" resource that can always be counted on to exist. Therefore, my system time increments with the frequency of the PIT (something like 1.193 MHz). I suspect that Linux puts the burden of adapting to the reference on the application developer instead, that needs to figure out which frequency the TSC runs at. A definite advantage of the TSC is that applications can read it directly without syscalls, but I have an idea of how to fix this in my design too by mapping a page in application memory to the system time counter. This could then be used to implement timers in userspace too without syscalls. However, then the resolution would be in the millisecond range instead (since then it is updated by timers in the kernel, and the preemption timeout is one millisecond).

OSDev.org

Implementation of system time

Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time

Re: Implementation of system time