Efficient timekeeping on SMP systems

Korona · Post by **Korona** » Wed Mar 14, 2018 12:23 pm

I currently consider adding timekeeping functionality to my OS. This is a brainstorming thread to discuss the issues that arise during this implementation. Note that I do not want to discuss hardware details like APIC vs. HPET vs. TSC.

The most difficult part of the timekeeping functionality is designing a reliable monotonic clock (similar to Linux CLOCK_MONOTONIC_RAW). Once you have a monotonic clock, everything else (i.e. the realtime clocks) can be derived from that. The main problem here is that modern systems (e.g. NUMA systems) do not have a central clock source. Different CPUs might have different sources for their APIC and TSC counters and different latencies when accessing external clocks like the HPET.

How can we design a reliable monotonic clock in this scenario? We have to take into account that some applications want to query such clocks thousands to hundreds of thousands times per second¹, so the implementation needs to be very efficient. The first idea of implementing a highly efficient clock is using a local time source (like RDTSCP or the local APIC) and mapping an array of local-to-global time offsets plus an array of tick-to-nanosecond values into each process. These arrays would be indexed by CPU indices so that a thread could query its own CPU index, lookup the clock data and compute the global time. They would also reside in shared, usermode-read-only memory. That approach, however, is too simplistic. In particular, it easily breaks monotonicity. When clock data is not accurate enough, threads can observe a clock value C1, send it to another CPU via some fast channel (e.g. shared memory) and them observe another clock value C2 on the second CPU with C2 < C1, even though there is a synchronization operation (i.e. the shared memory read) between the two CPUs.

This brings me to the main question of this thread: How can we do better? A perfect solution (i.e. one that guarantees monotonicity under all circumstances and is still efficient enough) is probably impossible due to physical constraints. But surely things can be improved over this naive approach.

For the discussion, let us assume some basic hardware properties, that make our lives easier:

There is a efficient and accurate local clock source for each CPU. Also assume, that threads can identify the CPU they are on and the local clock value without races. In x86 terms, assume that RDTSCP and the constant TSC feature is available.
However, we cannot assume that clocks on different CPUs have either the same frequency or the same offset.
We also cannot accept solutions that require running times in the order of multiple microseconds, at least not in the average case.

Any ideas? How do other OSs solve this problem?

¹ This happens even for legitimate reasons. I worked with some HPC codes before that really want to be able to do limited profiling and tracing during production to analyze their own behavior on-the-fly (because running external debugging or instrumentation tools increases their running time to unacceptable levels).

Brendan · Post by **Brendan** » Wed Mar 14, 2018 10:46 pm

Hi,

Korona wrote:I currently consider adding timekeeping functionality to my OS. This is a brainstorming thread to discuss the issues that arise during this implementation. Note that I do not want to discuss hardware details like APIC vs. HPET vs. TSC.

I'm not sure that it's possible to discuss issues that arise during implementation without discussing the hardware details that are responsible for those issues.

Korona wrote:Any ideas?

Last year I started switching everything over to "time ranges" to properly cope with uncertainty - e.g. if you ask what the time is you might be told something like "I don't know exactly, but the time must be between 1234500 and 1234567".

For elapsed time, this means you end up with a calculation like:

Code: Select all

    elapsed.min = max(end.min - start.max, 0);
    elapsed.max = end.max - start.min;

The basic idea is that it's impossible to avoid a (hopefully small) amount of uncertainty, and providing "time ranges" allows software to manage the uncertainty in whatever way makes sense for that software. That might be simple averaging ("assumed_time = (time.min + time.max)/2;") or using "worst case" or "best case", or keeping track of uncertainty throughout all calculations (e.g. "speed.min = distance.min / elapsed.max; speed.max = distance.max / elapsed.min;").

Of course for some things the range can change (e.g. NTP, where the range gets narrower as more iterations are done); and for security purposes the kernel may/should deliberately make the range worse (e.g. subtract some random from "time.min" and add some random to "time.max" to intentionally increase the uncertainty and make timing side channels less reliable - e.g. at least as unreliable as a "roll your own counter by wasting an entire CPU to atomically increment a shared counter").

Note 1: in my opinion, "performance" (e.g. how quickly an attacker can successfully exploit your stupidity) is not a valid reason to allow user space code access to the RDTSC and RDTSCP instructions.

Note 2: If you need to know the strict order of events, then you probably shouldn't be using time at all (and should probably be using a counter instead, like "event_number = global_atomic_event_counter++;"), partly because this solves the problem caused when 2 or more things happen at the same time.

Note 3: For profiling, RDTSC is the wrong tool for the job. If performance monitoring counters are too slow then find out why and fix them.

Cheers,

Brendan

OSDev.org

Efficient timekeeping on SMP systems

Efficient timekeeping on SMP systems

Re: Efficient timekeeping on SMP systems