Implementing a lock using non-cacheable memory

u9012063 · Post by **u9012063** » Tue Mar 25, 2014 9:40 pm

Hi folks,

I'm a student in Stony Brook University and trying to implement a lock, ex: spin_lock, in my system using non-cacheable memory. Below is my idea and any comments are very appreciated!

I have a system with 8G ram, 8 core x86 Xeon CPU, and a special *device* on my PCIe bus that could also issue x86 instruction to read/write the memory (ram). The traditional way of spin_lock, which usus test-and-set does not work here because the *device* is not in the cache coherent domain as the rest of the 8 core CPUs. My proposal is to make the address of the lock non-cacheable, so that Xeon cores, and the special device do not cache the lock. So it will be ok when *device* is not in the coherence domain. (is this correct?)

Analysis Scenario:
Assume Xeon core1 and the device is contenting/busy looping for the lock. At the moment when the lock is freed by core2, because the lock is non-cacheable, neither core1 nor device cache the lock. Both core1 and the device immediately issue the "test-and-set" instruction to the address of the lock. Core1's request will come from memory bus, while the device's request comes from PCIe bus to the norhbridge and to the memory bus. I think "test-and-set" is an atomic instruction so that either core1 or device win the bus and successfully set the lock.

I read linux kernel's implementation of test_and_set_bit, all these functions are adding a LOCK as prefix. My understanding is that x86 LOCK will flush the cache and since my lock is non-cacheable, it makes no difference.

Another way to put this question is: can we implement a lock in SMP using memory located in MMIO (non-cacheable) region?

Thnk you for any comments.
William (Cheng-Chun)

thepowersgang · Post by **thepowersgang** » Tue Mar 25, 2014 9:58 pm

If the device can see and honors the LOCK signal, then yes you could. But there are lockless algorithms you could use instead (which is what existing hardware does).

Brendan · Post by **Brendan** » Tue Mar 25, 2014 11:08 pm

Hi,

u9012063 wrote:Another way to put this question is: can we implement a lock in SMP using memory located in MMIO (non-cacheable) region?

Memory Mapped IO is extremely different to "device that reads/writes the memory (ram)". In this case, the device's memory mapped IO area would need to be uncached.

You do not want CPUs contending for a "hardware lock" in the (uncached) MMIO area. If you do this (e.g. all CPUs fighting for the "hardware lock") you will flood the PCI bus with unnecessary traffic and cripple the performance of the entire computer (including anything trying to use the PCI bus for things that have nothing to do with your device).

Instead, you'd want a "software lock" (in cached RAM) so that only a CPU that has successfully acquired the "software lock" attempts to access the "hardware lock" in the (uncached) MMIO area. This means that worst case is one CPU contending with the device (not many CPUs contending with each other and the device).

Also note that thepowersgang is entirely correct - if the device isn't extremely badly designed, then there will be a way to do this with atomic accesses. You may or may not still need the "software lock" (in cached RAM); but should never need the "hardware lock" in the MMIO area.

u9012063 wrote:Analysis Scenario:
Assume Xeon core1 and the device is contenting/busy looping for the lock. At the moment when the lock is freed by core2, because the lock is non-cacheable, neither core1 nor device cache the lock. Both core1 and the device immediately issue the "test-and-set" instruction to the address of the lock. Core1's request will come from memory bus, while the device's request comes from PCIe bus to the norhbridge and to the memory bus. I think "test-and-set" is an atomic instruction so that either core1 or device win the bus and successfully set the lock.

This is quite wrong. Because of the "MESI protocol" that ensures cache coherency; a CPU can perform a LOCKed operation on a cache line (by putting that cache line into the Exclusive state) without flushing the cache and (if the cache line was previously in the Modified state) without any extra bus activity at all.

Cheers,

Brendan

u9012063 · Post by **u9012063** » Wed Mar 26, 2014 2:09 am

Hi Brendan and thepowersgang,

Thanks again for your explanation.

Brendan wrote:
u9012063 wrote:Another way to put this question is: can we implement a lock in SMP using memory located in MMIO (non-cacheable) region?
Memory Mapped IO is extremely different to "device that reads/writes the memory (ram)". In this case, the device's memory mapped IO area would need to be uncached.

You do not want CPUs contending for a "hardware lock" in the (uncached) MMIO area. If you do this (e.g. all CPUs fighting for the "hardware lock") you will flood the PCI bus with unnecessary traffic and cripple the performance of the entire computer (including anything trying to use the PCI bus for things that have nothing to do with your device).

I have two x86 hosts (host1 and host2) connected together with NTB. The host2 could access host1's memory region through the NTB (non-transparent bridge) device and the lock is located at host1's memory. From host2's perspective, the address of the lock memory is located at BAR and I have no choice but making the lock memory uncacheable.

Brendan wrote: Instead, you'd want a "software lock" (in cached RAM) so that only a CPU that has successfully acquired the "software lock" attempts to access the "hardware lock" in the (uncached) MMIO area. This means that worst case is one CPU contending with the device (not many CPUs contending with each other and the device).

I see. So this reduces the problem to one CPU and one device contending the "hardware lock".

Brendan wrote: Also note that thepowersgang is entirely correct - if the device isn't extremely badly designed, then there will be a way to do this with atomic accesses. You may or may not still need the "software lock" (in cached RAM); but should never need the "hardware lock" in the MMIO area.

Sorry I'm not sure I get the idea. Are you saying that it's possible to do it using software lock (in cached RAM)?
My NTB device simply generates PCIe read/write to the lock's memory region in host1. Traditionally, a lock operation typically need "read-modify-write", which in local host needs
1. LOCK prefix
2. Read from memory
3. Write to the memory
How do I check whether my device is correctly designed and can do "read-modify-write" atomically? Is a device capable of issuing LOCK operation?
Or if we reduce the problem to one CPU, one device, one MMIO uncacheable lock, is it possible to have a correctly working lock? (no matter how bad the performance could be)

u9012063 wrote:Analysis Scenario:
Assume Xeon core1 and the device is contenting/busy looping for the lock. At the moment when the lock is freed by core2, because the lock is non-cacheable, neither core1 nor device cache the lock. Both core1 and the device immediately issue the "test-and-set" instruction to the address of the lock. Core1's request will come from memory bus, while the device's request comes from PCIe bus to the norhbridge and to the memory bus. I think "test-and-set" is an atomic instruction so that either core1 or device win the bus and successfully set the lock.

Brendan wrote: This is quite wrong. Because of the "MESI protocol" that ensures cache coherency; a CPU can perform a LOCKed operation on a cache line (by putting that cache line into the Exclusive state) without flushing the cache and (if the cache line was previously in the Modified state) without any extra bus activity at all.

Thanks. Now I understand that the LOCKed operation might only happen in cache line without any bus activity. However, are you saying that even if we use MMIO "uncacheable" lock, the lock could still be in cache line and applicable to the cache coherent protocol?

Thanks again.
Regards,
William

Owen · Post by **Owen** » Wed Mar 26, 2014 2:51 am

The most important question: Does your PCI-E card have any way to assert the lock signal over the PCI-Express bus?

There is no LOCK(#) signal for conventional PCI, so for that it is impossible. Was this feature added with PCI-E? If it wasn't, then you're screwed.

Also: Why is cached vs uncached memory relevant from the host processor side? On x86 platforms, device accesses to memory are snooped and passed through caches.

Brendan · Post by **Brendan** » Wed Mar 26, 2014 3:34 am

Hi,

u9012063 wrote:I have two x86 hosts (host1 and host2) connected together with NTB. The host2 could access host1's memory region through the NTB (non-transparent bridge) device and the lock is located at host1's memory. From host2's perspective, the address of the lock memory is located at BAR and I have no choice but making the lock memory uncacheable.

This is different again. From host1's perspective the device is not MMIO and does reads and writes to normal RAM on host1; and from host2's perspective the device is MMIO and doesn't touch normal RAM on host2.

In this case, regardless of what you do, host2 can't expect "LOCK" to work; and you have to find a way that only uses atomic reads and atomic writes (with no "atomic read-modify-write" at all) on host2; and the area will have to be uncached on host2. Also; because host2 can only do atomic reads and atomic writes, there's no reason for host1 to treat the RAM as uncached.

u9012063 wrote:
Brendan wrote:Also note that thepowersgang is entirely correct - if the device isn't extremely badly designed, then there will be a way to do this with atomic accesses. You may or may not still need the "software lock" (in cached RAM); but should never need the "hardware lock" in the MMIO area.
Sorry I'm not sure I get the idea. Are you saying that it's possible to do it using software lock (in cached RAM)?

It's just a plain old lock that has nothing to do with the device (that's used to ensure only one CPU can attempt to interact with the device at a time). Of course now that I know there's 2 computers involved I need to clarify - you'd have one "software lock" on host1 (to ensure only one CPU on host 1 can access the device at a time), and another "software lock" on host2 (to ensure that only one CPU on host 2 can access the device at a time). However, these software locks won't prevent one CPU on host 1 and another CPU on host 2 from using the device at the same time (that would've been what the "hardware lock" was supposed to be for, before I found out it's not a normal device).

Now the problem is I have no idea what you're using the NTB for. Depending on what you're actually doing; you might be able to split the area into 2 separate areas (e.g. one area for data written by host1 and read from host2, and another area for the reverse) and implement some sort of signalling to determine when its safe for others to read. You might also be able to split the area into many small areas (e.g. one area for each CPU on host1, and another area for each CPU on host2) to get rid of the need for both of the "software locks" (each CPU can write to its own area, with some sort of signalling to determine when its safe for others to read).

Cheers,

Brendan

dschatz · Post by **dschatz** » Wed Mar 26, 2014 8:08 am

You may not be able to depend on hardware synchronization between the two hosts but this doesn't mean all hope is lost. You can use Dekker's Algorithm to provide mutual exclusion between the two hosts as long as they share memory. If you might have multiple threads on each host then you can use a standard lock per host which synchronizes access to the inter-host lock.

u9012063 · Post by **u9012063** » Wed Mar 26, 2014 11:59 pm

Hi,

Thanks for all the valuable feedback. I summarize my understanding as below:
I have a system with 2 x86 server connected together with PCIe cable and NTB (Non-Transparent Bridge)

| Host2 | --------- | NTB | --------- | Host1 |

Host2 RAM: soft_lock , Host1 RAM: soft_lock, hard_lock

The magic of NTB is that host2 can read/write memory address in host1 by issuing read/write to the BAR address.
The read/write from host2 will get translated to Host1 and write to Host1's RAM. Of course, since it's on BAR, the Host2 map the address as uncacheable.

Multiple processes running on host1 and host2 want to share a lock, so I need a global across-machine lock. The "software lock (cacheable)" suggested by Brendan solves the intra-host contention issue, so only one CPU in one host will get the "software lock", As indicated as soft_lock above. To prevent a CPU from host2 and a CPU from host1 enter critical section, we need a global "hardware lock (uncacheable)", denoted as hard_lock. From the host2's perpective, the hard_lock is on its BAR so it is uncacheable.

Further, because read-modify-write requires LOCK# signal, when host2 attempting to acquire lock, the host2's LOCK# won't work on host1's memory bus. And the PCI-E does not support issuing a LOCK# operation. As Owen points out, even though the write issued from host2 to host1's hard_lock is snooped and passed through caches, this is not enough to guarantee read-modify-write (due to no LOCK#), but it indeed guarantees the atomic read and atomic write.

As suggested by Brendan, I could divide the memory into exclusive sections (one region for host1 write, host2 read, and vise versa). Or as proposed by dschatz, since there are only two cpus contending for the lock, I could implement a new lock using Dekker's Algorithm. I will carefully analyze both solutions and thanks for all the comments. And please correct me if I misunderstand!

Regards,
William (Cheng-Chun Tu)

Combuster · Post by **Combuster** » Thu Mar 27, 2014 6:45 am

I would wonder why you want to have a mutex-type relationship with a hardware device in the first place, rather than a message passing architecture in any of its forms.

dschatz · Post by **dschatz** » Thu Mar 27, 2014 7:29 am

u9012063 wrote: As suggested by Brendan, I could divide the memory into exclusive sections (one region for host1 write, host2 read, and vise versa). Or as proposed by dschatz, since there are only two cpus contending for the lock, I could implement a new lock using Dekker's Algorithm. I will carefully analyze both solutions and thanks for all the comments. And please correct me if I misunderstand!

The two approaches are orthogonal. You can partition the memory to avoid the need for synchronization and I strongly recommend doing so if possible. But if you really do need synchronization, it can be achieved using well known algorithms.

u9012063 · Post by **u9012063** » Thu Mar 27, 2014 8:00 am

Combuster wrote:I would wonder why you want to have a mutex-type relationship with a hardware device in the first place, rather than a message passing architecture in any of its forms.

Message passing is indeed another alternative when the system does not have global shared memory. As a result, it is necessary to move data from one local memory to another by means of send/receive/interrupt (please correct me if I'm wrong). Since NTB is a special device providing shared memory across the machine, we are exploring the possibility of mutex-type lock. Thank you.

u9012063 · Post by **u9012063** » Thu Mar 27, 2014 8:20 am

dschatz wrote:
u9012063 wrote: As suggested by Brendan, I could divide the memory into exclusive sections (one region for host1 write, host2 read, and vise versa). Or as proposed by dschatz, since there are only two cpus contending for the lock, I could implement a new lock using Dekker's Algorithm. I will carefully analyze both solutions and thanks for all the comments. And please correct me if I misunderstand!
The two approaches are orthogonal. You can partition the memory to avoid the need for synchronization and I strongly recommend doing so if possible. But if you really do need synchronization, it can be achieved using well known algorithms.

Thanks for reminding me. Actually my use case is that I have a request queue (Linux kernel's linked list) with elements in the queue frequently sent to a storage device to process. In normal case, the queue will have lock to prevent multiple cores accessing at the same time. Assume this queue is in host1's ram, now I want this queue to be able to share by host2 so that host2 could safely insert element into the host1's queue. As a result, I need a lock that could work across machine.

In fact, it's possible that we will have more than 2 hosts, so the Dekker's algorithm might not be sufficient. We are checking the generalization of it.
http://authors.library.caltech.edu/2695 ... _TR_85.pdf
Or do you know any other algorithms that is also applicable here? (which supports N nodes, and no need to have read-modify-write?)
Thank you

Combuster · Post by **Combuster** » Thu Mar 27, 2014 8:37 am

host2 could safely insert element into the host1's queue. As a result, I need a lock

"need" is the wrong term. I'd give host2 a dedicated queue in host1's memory so there's never any write contention, and do away with the lock.

Or do you know any other algorithms that is also applicable here?

See your textbook (or wikipedia). It starts with an L.

dschatz · Post by **dschatz** » Thu Mar 27, 2014 8:39 am

u9012063 wrote: Or do you know any other algorithms that is also applicable here? (which supports N nodes, and no need to have read-modify-write?)

The most common algorithm is Lamport's bakery algorithm which works with N nodes as long as N is constant and known ahead of time.

another algorithm from lamport that works well under low contention: http://research.microsoft.com/en-us/um/ ... -mutex.pdf

OSDev.org

Implementing a lock using non-cacheable memory

Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory

Re: Implementing a lock using non-cacheable memory