this is very platform specific, but I'd like to have some ideas on the latency involved in the memory locking instructions.
best case is basically:
cpu 1 "hey guys, I want to write memory/cache line #n"
other cpus "OK"
cpu 1 "done, I got the new (dirty) cache line#n here, anyone ask me if they need it"
and I guess it just doesnt relinquish the line as long as its local lock is on (lock xchg executing), so as long as it already has ownership of the dirty line it doesnt need to go thru the whole protocol.
but this must be a killer on global locks shared between cpus, anyone got links on actual timing numbers ? "how much time would it take on architecture X to transfer a line from L1 cache to L1 cache between cpus in a best case scenario"
I got a dual opteron and an amd x2 to test but I suspect just putting two cpus fighting for a lock in a loop wouldnt be accurate as some hardware optimizations and bus "opportunity windows" would likely come into play into such an aggressive test.
memory locking latencies
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
ideally, you wouldn't want to wait anyway, so making your code less based upon repeated locking is preferrably the best solution.
Even then, there are some other things you might want to consider:
Reading from memory does not cause the cache to enter a dirty state. If a lock is taken, and threads will only read its state, there will be no cache synchronisation issues, as the dirty line will be flushed to memory and all CPUs can share the copy in the cache without needing further synchronisation (until the lock is released and the lock is written once more)
But memory synchronisation is a bit tricky in intel processors as you cant simply force memory ordering around reads without either making the page uncacheable or using CPU-dependent stuff, whereas if you write, you can add a lock prefix and have something that works even on the most ancient of x86 processors.
But, exact latencies are a thing that are hard to come by, and are different per processor series at best. Just remember that having a 'fight' over a resource takes more time than when one is simply looking at it.
Even then, there are some other things you might want to consider:
Reading from memory does not cause the cache to enter a dirty state. If a lock is taken, and threads will only read its state, there will be no cache synchronisation issues, as the dirty line will be flushed to memory and all CPUs can share the copy in the cache without needing further synchronisation (until the lock is released and the lock is written once more)
But memory synchronisation is a bit tricky in intel processors as you cant simply force memory ordering around reads without either making the page uncacheable or using CPU-dependent stuff, whereas if you write, you can add a lock prefix and have something that works even on the most ancient of x86 processors.
But, exact latencies are a thing that are hard to come by, and are different per processor series at best. Just remember that having a 'fight' over a resource takes more time than when one is simply looking at it.
as hard as you try, you always come to a point where you need frequent locking/unlocking.Combuster wrote:ideally, you wouldn't want to wait anyway, so making your code less based upon repeated locking is preferrably the best solution.
in my kernel I'm focussing on low latency at the cost of performance, to the point where I got a software-driven PCM DAC converter working in user space.
which needs very frequent locking/unlocking to keep latency as low as possible.
(looks pointless but a good way to measure ~20usec latency without an expensive logic probe besides, if you want pure number crunching power just use linux)
there is a lot of information about cache associativity, pipeline stalls, and all that but I could not find any information about multi-processor cache communication delays.
I'm missing those rather important numbers (in my case) to figure out if I can or cannot make it go any faster as I have no idea about that possible bottleneck's severity.