Lockless 32-bit timer to 64-bit counter

OSwhatever · Post by **OSwhatever** » Sat Mar 09, 2013 3:41 pm

There are many systems that only provide a HW 32-bit free running counter but internally in my OS the timer type is 64-bits. Of course this is not a problem when you can detect the wrap around case. In in order avoid problems with preemption I now lock the timer reading so that there is no condition where several CPUs tries updated that hsb part at the same time or it was preempted in that part. So basically it looks like this:

Code: Select all

SavedIrqStatus_t saveStatus = m_lock.LockDisableInterrupts();

uint32_t lsb = 0xffffffff - m_timerBase.Input32(TIMER1VALUE);

if(lsb < m_lastLsbVal)
{
        m_timerHsb += 1;
}

m_lastLsbVal = lsb;

m_lock.UnlockRestoreInterrupts(saveStatus);

This works but it has the problem that it needs to lock every time you need to read the time which happens not too seldom. I've noticed that this lock degrades performance quite a lot in some test cases. In some test cases, user processes will read the time continuously and they start to compete with the kernel scheduler that also needs to read the time and the result is severe degradation. If I remove the lock, the system behaves much better. So what I need is to find out a way to handle the wrap around case in a lockless way, but I can't really find a way to get it right. Do you know a good way?

Nessphoro · Post by **Nessphoro** » Sat Mar 09, 2013 4:02 pm

Well, dedicate one core to be the timer (I mean it is the only core that will update the counter) and then you can just use LOCK to get your ticks in a deterministic manner.

gravaera · Post by **gravaera** » Sat Mar 09, 2013 4:05 pm

Yo:

You can investigate multiple reader locks, or using atomic operations to read the time, and atomic increment to update the time, etc, etc. However you solve the issue of handling wrap-around is up to you, though.

--Peace out,
gravaera

rdos · Post by **rdos** » Sat Mar 09, 2013 4:06 pm

Maybe you could save the last reading (per core), and then add the difference to the last reading to a (per core) current time instead? This assumes cores read the timer often enough.

Besides, do you really need to disable interrupts for this operation? Wouldn't a typical spinlock (without disabling interrupts) work as well (provided you don't read the time from IRQs)?

Also, not all computers have a global 32-bit timer. Some older computers only have PIT-timer, while some others only have the per-core APIC timer.

OSwhatever · Post by **OSwhatever** » Sat Mar 09, 2013 4:17 pm

gravaera wrote:Yo:

You can investigate multiple reader locks, or using atomic operations to read the time, and atomic increment to update the time, etc, etc. However you solve the issue of handling wrap-around is up to you, though.

--Peace out,
gravaera

I've played around with atomic operation but I never get right, there is always a loophole somewhere. It's two values we have to play around with here, the last lsb value and the hsb value.

OSwhatever · Post by **OSwhatever** » Sat Mar 09, 2013 4:20 pm

rdos wrote:Maybe you could save the last reading (per core), and then add the difference to the last reading to a (per core) current time instead? This assumes cores read the timer often enough.

Besides, do you really need to disable interrupts for this operation? Wouldn't a typical spinlock (without disabling interrupts) work as well (provided you don't read the time from IRQs)?

Also, not all computers have a global 32-bit timer. Some older computers only have PIT-timer, while some others only have the per-core APIC timer.

Absolutely you have to disable the interrupt, that's is something basically have to do for almost everything in the kernel, disable interrupts+lock. Otherwise you will end up in deadlock situations because you might use these in the next thread or in the interrupt.

rdos · Post by **rdos** » Sun Mar 10, 2013 3:06 am

OSwhatever wrote:
rdos wrote:Maybe you could save the last reading (per core), and then add the difference to the last reading to a (per core) current time instead? This assumes cores read the timer often enough.

Besides, do you really need to disable interrupts for this operation? Wouldn't a typical spinlock (without disabling interrupts) work as well (provided you don't read the time from IRQs)?

Also, not all computers have a global 32-bit timer. Some older computers only have PIT-timer, while some others only have the per-core APIC timer.
Absolutely you have to disable the interrupt, that's is something basically have to do for almost everything in the kernel, disable interrupts+lock. Otherwise you will end up in deadlock situations because you might use these in the next thread or in the interrupt.

Yes, I forgot. If the operation cannot be accessed in scheduler or in IRQs, a better option is a semaphore, but this is not applicable in this case as it is used in scheduler.

But I'm still puzzled how this simple operation could affect performance so much? Maybe the code in your C-statements is not very optimal? And why save the IRQ-state? Can't you just use cli in the request and sti in release part? You really shouldn't support nesting spinlocks anyway!

Combuster · Post by **Combuster** » Sun Mar 10, 2013 3:47 am

Typically systems implement a compare-and-swap in bus width which means you can update such 64-bit counters at will, make sure there's only one thread actually applying the wraparound, and any readers can just as well grab the value in whole.

if there's no atomic 64-bit operation, you'll probably get to resort to things like the snapshot algorithm (which is wait-free, but at a price).

Brendan · Post by **Brendan** » Sun Mar 10, 2013 5:18 am

Hi,

rdos wrote:Maybe you could save the last reading (per core), and then add the difference to the last reading to a (per core) current time instead? This assumes cores read the timer often enough.

This is the right way to do it - no locks needed anywhere (but some sort of timer IRQ used to make sure the counter is read often enough). You can calculate the maximum amount of time between counter reads easily enough - "2**32/counter_frequency" or "2**32 * time_between_counter_updates". For example, if the hardware updates the counter every 100 ns (or at a rate of 10 MHz) then you'd have to make sure the counter is read at least once every 42949672960 us (or about every 7 minutes).

If an IRQ handler could read the time, or if a task switch can occur while the time is being read, then you would have to disable IRQs (e.g. "pushfd; cli; ...; popfd"), but there's no need for anything more expensive than that.

OSwhatever wrote:Also, not all computers have a global 32-bit timer. Some older computers only have PIT-timer, while some others only have the per-core APIC timer.

In that case you use something radically different (none of the code for "global 32-bit counter" needs to be the same as any of the code for "PIT", or any of the code for "TSC", or...).

Cheers,

Brendan

OSwhatever · Post by **OSwhatever** » Sun Mar 10, 2013 11:58 am

Brendan wrote:Hi,

This is the right way to do it - no locks needed anywhere (but some sort of timer IRQ used to make sure the counter is read often enough). You can calculate the maximum amount of time between counter reads easily enough - "2**32/counter_frequency" or "2**32 * time_between_counter_updates". For example, if the hardware updates the counter every 100 ns (or at a rate of 10 MHz) then you'd have to make sure the counter is read at least once every 42949672960 us (or about every 7 minutes).

If an IRQ handler could read the time, or if a task switch can occur while the time is being read, then you would have to disable IRQs (e.g. "pushfd; cli; ...; popfd"), but there's no need for anything more expensive than that.

In that case you use something radically different (none of the code for "global 32-bit counter" needs to be the same as any of the code for "PIT", or any of the code for "TSC", or...).

Cheers,

Brendan

At a first glance this is the best solution as it only alters CPU private variables and has no locking. However, as it is CPU private data it also means that I have to disable the interrupts while I retrieve the CPU specific data and updating the counter otherwise there can lead to errors due to preemption and also that the thread can migrate to another CPU. I played around with 64-bit cmpxchg and I finally made lockless version that I think should hold water.

In both cases you need to visit the timer often enough so that you don't miss a wrap, though.

Brendan · Post by **Brendan** » Sun Mar 10, 2013 8:54 pm

Hi,

OSwhatever wrote:
Brendan wrote:This is the right way to do it - no locks needed anywhere (but some sort of timer IRQ used to make sure the counter is read often enough). You can calculate the maximum amount of time between counter reads easily enough - "2**32/counter_frequency" or "2**32 * time_between_counter_updates". For example, if the hardware updates the counter every 100 ns (or at a rate of 10 MHz) then you'd have to make sure the counter is read at least once every 42949672960 us (or about every 7 minutes).

If an IRQ handler could read the time, or if a task switch can occur while the time is being read, then you would have to disable IRQs (e.g. "pushfd; cli; ...; popfd"), but there's no need for anything more expensive than that.

In that case you use something radically different (none of the code for "global 32-bit counter" needs to be the same as any of the code for "PIT", or any of the code for "TSC", or...).
At a first glance this is the best solution as it only alters CPU private variables and has no locking. However, as it is CPU private data it also means that I have to disable the interrupts while I retrieve the CPU specific data and updating the counter otherwise there can lead to errors due to preemption and also that the thread can migrate to another CPU. I played around with 64-bit cmpxchg and I finally made lockless version that I think should hold water.

In both cases you need to visit the timer often enough so that you don't miss a wrap, though.

Imagine 2 CPUs both trying to "CMPXCHG" the same location. For "best case" the first CPU fetches the cache line, reads some data from it, then makes the cache line "exclusive" and modifies that cache line; then the second CPU does the same thing. For "worst case" the second CPU fetches the cache line and reads some data from it, but then the cache line is evicted (due to the first CPU making it "exclusive" and modifying it), then the second CPU tries to do "CMPXCHG" and has to fetch the cache line a second time, and sees that data was modified and has to retry.

Now imagine 16 CPUs that all do "CMPXCHG" frequently - that poor little cache line is going to be bouncing all over the place. The time taken to transfer the cache line between CPUs (plus the retries needed when other CPUs have modified it between the original read and the CMPXCHG) adds up to a scalability problem. The more CPUs you've got the worse it gets.

For single-CPU, disabling (and then enabling) IRQs is probably just as fast as CMPXCHG; but there's no scalability problems at all, so (for multi-CPU) it's much much better.

Cheers,

Brendan

xenos · Post by **xenos** » Mon Mar 11, 2013 1:30 am

Brendan wrote:Imagine 2 CPUs both trying to "CMPXCHG" the same location.

Is it really the same location? I thought this method uses one counter per CPU?

OSDev.org

Lockless 32-bit timer to 64-bit counter

Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter

Re: Lockless 32-bit timer to 64-bit counter