Efficient timekeeping on SMP systems
Posted: Wed Mar 14, 2018 12:23 pm
I currently consider adding timekeeping functionality to my OS. This is a brainstorming thread to discuss the issues that arise during this implementation. Note that I do not want to discuss hardware details like APIC vs. HPET vs. TSC.
The most difficult part of the timekeeping functionality is designing a reliable monotonic clock (similar to Linux CLOCK_MONOTONIC_RAW). Once you have a monotonic clock, everything else (i.e. the realtime clocks) can be derived from that. The main problem here is that modern systems (e.g. NUMA systems) do not have a central clock source. Different CPUs might have different sources for their APIC and TSC counters and different latencies when accessing external clocks like the HPET.
How can we design a reliable monotonic clock in this scenario? We have to take into account that some applications want to query such clocks thousands to hundreds of thousands times per second¹, so the implementation needs to be very efficient. The first idea of implementing a highly efficient clock is using a local time source (like RDTSCP or the local APIC) and mapping an array of local-to-global time offsets plus an array of tick-to-nanosecond values into each process. These arrays would be indexed by CPU indices so that a thread could query its own CPU index, lookup the clock data and compute the global time. They would also reside in shared, usermode-read-only memory. That approach, however, is too simplistic. In particular, it easily breaks monotonicity. When clock data is not accurate enough, threads can observe a clock value C1, send it to another CPU via some fast channel (e.g. shared memory) and them observe another clock value C2 on the second CPU with C2 < C1, even though there is a synchronization operation (i.e. the shared memory read) between the two CPUs.
This brings me to the main question of this thread: How can we do better? A perfect solution (i.e. one that guarantees monotonicity under all circumstances and is still efficient enough) is probably impossible due to physical constraints. But surely things can be improved over this naive approach.
For the discussion, let us assume some basic hardware properties, that make our lives easier:
¹ This happens even for legitimate reasons. I worked with some HPC codes before that really want to be able to do limited profiling and tracing during production to analyze their own behavior on-the-fly (because running external debugging or instrumentation tools increases their running time to unacceptable levels).
The most difficult part of the timekeeping functionality is designing a reliable monotonic clock (similar to Linux CLOCK_MONOTONIC_RAW). Once you have a monotonic clock, everything else (i.e. the realtime clocks) can be derived from that. The main problem here is that modern systems (e.g. NUMA systems) do not have a central clock source. Different CPUs might have different sources for their APIC and TSC counters and different latencies when accessing external clocks like the HPET.
How can we design a reliable monotonic clock in this scenario? We have to take into account that some applications want to query such clocks thousands to hundreds of thousands times per second¹, so the implementation needs to be very efficient. The first idea of implementing a highly efficient clock is using a local time source (like RDTSCP or the local APIC) and mapping an array of local-to-global time offsets plus an array of tick-to-nanosecond values into each process. These arrays would be indexed by CPU indices so that a thread could query its own CPU index, lookup the clock data and compute the global time. They would also reside in shared, usermode-read-only memory. That approach, however, is too simplistic. In particular, it easily breaks monotonicity. When clock data is not accurate enough, threads can observe a clock value C1, send it to another CPU via some fast channel (e.g. shared memory) and them observe another clock value C2 on the second CPU with C2 < C1, even though there is a synchronization operation (i.e. the shared memory read) between the two CPUs.
This brings me to the main question of this thread: How can we do better? A perfect solution (i.e. one that guarantees monotonicity under all circumstances and is still efficient enough) is probably impossible due to physical constraints. But surely things can be improved over this naive approach.
For the discussion, let us assume some basic hardware properties, that make our lives easier:
- There is a efficient and accurate local clock source for each CPU. Also assume, that threads can identify the CPU they are on and the local clock value without races. In x86 terms, assume that RDTSCP and the constant TSC feature is available.
- However, we cannot assume that clocks on different CPUs have either the same frequency or the same offset.
- We also cannot accept solutions that require running times in the order of multiple microseconds, at least not in the average case.
¹ This happens even for legitimate reasons. I worked with some HPC codes before that really want to be able to do limited profiling and tracing during production to analyze their own behavior on-the-fly (because running external debugging or instrumentation tools increases their running time to unacceptable levels).