OSDev.org

Posted: **Sun Mar 07, 2010 4:00 am**

I thought that it would be better to have read-write-locks, because I have many structures where many threads can read, but only 1 thread can write. So I implemented it, but don´t know if it will work the way I did it or if there is a better way to handle this.

Code: Select all

struct rwLock_t {
	uint32t volatile readers;
	uint32t volatile writer;
};

static inline uint32t rwLockAcquireRead(struct rwLock_t *lock) {
	uint32t res;
	
	asm volatile("0:\ttest $1,(%%edx)\n\t"
				"je 1b\n\t"
				"pause\n\t"
				"jmp 0b\n\t"
				"1:\tlock addl $1,(%%eax)\n\t"
				"test $1,(%%edx)\n\t"
				"je 2b\n\t"
				"lock subl $1,(%%eax)\n\t"
				"pause\n\t"
				"jmp 1b\n\t"
				"2:\tpushfl\n\t"
				"popl %[output]\n\t"
				"cli"
				:[output]"=q"(res)
				:"a"(&lock->readers), "d"(&lock->writers));
	
	return res;
}

static inline void rwLockReleaseRead(struct rwLock_t *lock, uint32t flags) {
	asm volatile("lock subl $1,(%[input])"
				:
				:[input]"q"(&lock->readers));
	
	if(likely(flags & 0x200))
		asm volatile("sti");
}

static inline uint32t rwLockAcquireWrite(struct rwLock_t *lock) {
	uint32t res;
	
	asm volatile("movl $1,%%edx\n\t"
				"0:\ttest $1,(%%eax)\n\t"
				"je 1f\n\t"
				"pause\n\t"
				"jmp 0b\n\t"
				"1:\txchg (%%eax),%%edx\n\t"
				"testl $1,%%edx\n\t"
				"jne 0b\n\t"
				"2:\tcmpl $0,(%%ebx)\n\t"
				"je 3f\n\t"
				"pause\n\t"
				"jmp 2b\n\t"
				"3:\tpushfl\n\t"
				"pop %[output]\n\t"
				"cli"
				:[output]"=q"(res)
				:"a"(&lock->writer), "b"(&lock->readers)
				:"%edx", "memory");
	
	return res;
}

static inline void rwLockReleaseWrite(struct rwLock_t *lock, uint32t flags) {
	asm volatile(""
				:
				:
				:"memory");
	lock->writer= 0;
	
	if(likely(flags & 0x200)
		asm volatile("sti");
}

Posted: **Sun Mar 07, 2010 6:11 am**

Hi,

There's a race-condition in "rwLockAcquireRead()" that could allow reading and writing at the same time. You test if there's currently a writer, (then a writer could acquire the writers lock), then try to acquire the reader's lock.

There's another (minor) race-condition in both "rwLockAcquireRead()" and "rwLockAcquireWrite()". You acquire the lock, (then an IRQ could occur), then you do CLI. For this problem you'd want to disable IRQs, then attempt to acquire, then restore IRQs if you failed to acquire.

The other thing I'd be worried about is starvation - so many threads reading that no writer is ever able to write.

Finally, I'm not too sure if it matters or not, but I'd put "flags" on the clobber lists.

Cheers,

Brendan

Posted: **Sun Mar 07, 2010 7:00 am**

Brendan wrote: There's a race-condition in "rwLockAcquireRead()" that could allow reading and writing at the same time. You test if there's currently a writer, (then a writer could acquire the writers lock), then try to acquire the reader's lock.

Ok, this I don´t get. I look if there is a writer, if not I add 1 to the readers lock and then I look again if there is a writer and if there is I sub 1 from the readers lock and test again if there is a writer and so on.

Brendan wrote: There's another (minor) race-condition in both "rwLockAcquireRead()" and "rwLockAcquireWrite()". You acquire the lock, (then an IRQ could occur), then you do CLI. For this problem you'd want to disable IRQs, then attempt to acquire, then restore IRQs if you failed to acquire.

S***! This also means I have this problem in my spinlock code, so I need to fix it there too. Thanks!

Brendan wrote: The other thing I'd be worried about is starvation - so many threads reading that no writer is ever able to write.

I don´t really know if I understand what you mean, but this problem would then also occur when I use a spinlock for this. But I think as I always set the writer lock and then wait for all readers to finish and if the writer lock is set no new readers can get the lock, I will not have this problem.

Brendan wrote: Finally, I'm not too sure if it matters or not, but I'd put "flags" on the clobber lists.

I haven´t tested it, but as I remember everthing which stands in the input or output list can´t be in the clobbered list, because gcc/gas will give me an error then (I had/have a problem with this ... I had esi as input and changed it in the assembly, but I couldn´t put it into the clobbered list and so had to put it also on the output list).

Edit::

New version:

Code: Select all

static inline uint32t rwLockAcquireRead(struct rwLock_t *lock) {
	uint32t res;
	
	asm volatile("pushfl\n\t"
				"0:\tcli\n\t"
				"test $1,(%%edx)\n\t"
				"je 1b\n\t"
				"sti\n\t"
				"pause\n\t"
				"jmp 0b\n\t"
				"1:\tlock addl $1,(%%eax)\n\t"
				"test $1,(%%edx)\n\t"
				"je 2b\n\t"
				"lock subl $1,(%%eax)\n\t"
				"sti\n\t"
				"pause\n\t"
				"jmp 0b\n\t"
				"2:\tpopl %[output]"
				:[output]"=q"(res)
				:"a"(&lock->readers), "d"(&lock->writers)
				:"flags");
	
	return res;
}

static inline uint32t rwLockAcquireWrite(struct rwLock_t *lock) {
	uint32t res;
	
	asm volatile("movl $1,%%edx\n\t"
				"pushfl\n\t"
				"0:\tcli\n\t"
				"test $1,(%%eax)\n\t"
				"je 1f\n\t"
				"sti\n\t"
				"pause\n\t"
				"jmp 0b\n\t"
				"1:\txchg (%%eax),%%edx\n\t"
				"testl $1,%%edx\n\t"
				"jne 0b\n\t"
				"2:\tcmpl $0,(%%ebx)\n\t"
				"je 3f\n\t"
				"pause\n\t"
				"jmp 2b\n\t"
				"3:\tpop %[output]"
				:[output]"=q"(res)
				:"a"(&lock->writer), "b"(&lock->readers)
				:"%edx", "memory", "flags");
	
	return res;
}

Posted: **Sun Mar 07, 2010 8:55 am**

Hi,

FlashBurn wrote:
Brendan wrote: There's a race-condition in "rwLockAcquireRead()" that could allow reading and writing at the same time. You test if there's currently a writer, (then a writer could acquire the writers lock), then try to acquire the reader's lock.
Ok, this I don´t get. I look if there is a writer, if not I add 1 to the readers lock and then I look again if there is a writer and if there is I sub 1 from the readers lock and test again if there is a writer and so on.

Doh - I think I misread the original code. I also didn't notice the "je 2b" bug (which should've been "je 2f").

FlashBurn wrote:
Brendan wrote: The other thing I'd be worried about is starvation - so many threads reading that no writer is ever able to write.
I don´t really know if I understand what you mean, but this problem would then also occur when I use a spinlock for this. But I think as I always set the writer lock and then wait for all readers to finish and if the writer lock is set no new readers can get the lock, I will not have this problem.

My fault again - your code did/does give writers priority.

I checked through the new code (carefully this time) and couldn't find any bugs!

The only thing I can suggest is using "lock bts $0,(%%eax)" and "jc 0b" instead of "xchg (%%eax),%%edx", "testl $1,%%edx" and "jne 0b" in rwLockAcquireWrite(); and removing the initial "movl $1,%%edx" and removing EDX from the clobber list.

I'd also be tempted to replace each "sti" with the instructions "popfl" and "pushfl". This allows you to use the locks when interrupts must remain disabled (e.g. if a piece of code needs to acquire 2 different locks).

Cheers,

Brendan

Posted: **Sun Mar 07, 2010 9:05 am**

Brendan wrote: I'd also be tempted to replace each "sti" with the instructions "popfl" and "pushfl". This allows you to use the locks when interrupts must remain disabled (e.g. if a piece of code needs to acquire 2 different locks).

Changed, also in my spinlock code, because there I did the same, so now it should also work if I have a lock and need one more.

Brendan wrote: The only thing I can suggest is using "lock bts $0,(%%eax)" and "jc 0b" instead of "xchg (%%eax),%%edx", "testl $1,%%edx" and "jne 0b" in rwLockAcquireWrite(); and removing the initial "movl $1,%%edx" and removing EDX from the clobber list.

Ok, but I used xchg because it uses the lock prefix. As I remember right, you need to use the lock prefix here because it could happen that on smp system you don´t worked on the current value which is in memory, or am I here wrong? I also remember that the bit opcodes (like bt, bts, bsr and so on) are not so fast, or am I here wrong too?

Posted: **Sun Mar 07, 2010 9:21 am**

Brendan wrote:The other thing I'd be worried about is starvation - so many threads reading that no writer is ever able to write.

This is easily solvable if temporary inconsistency is okay (think applying patches to the kernel, where unpatched and patched objects can coexist) and only requires one mutex (for the writers) and an extra layer of indirection (all references point at a pointer to the object, not the object itself).
A writer simply acquires the lock, does the updates on a local copy of the structure, atomically switches the new pointer for the old one and releases the lock. Readers don't have to even touch locks.

FlashBurn wrote:As I remember right, you need to use the lock prefix here because it could happen that on smp system you don´t worked on the current value which is in memory, or am I here wrong?

Yes, that's correct. The lock prefix ensures that no other processor can write to the destination between the read and write issued by the processor it is issued on.

Posted: **Sun Mar 07, 2010 9:58 am**

Hi,

FlashBurn wrote:
Brendan wrote: The only thing I can suggest is using "lock bts $0,(%%eax)" and "jc 0b" instead of "xchg (%%eax),%%edx", "testl $1,%%edx" and "jne 0b" in rwLockAcquireWrite(); and removing the initial "movl $1,%%edx" and removing EDX from the clobber list.
Ok, but I used xchg because it uses the lock prefix. As I remember right, you need to use the lock prefix here because it could happen that on smp system you don´t worked on the current value which is in memory, or am I here wrong? I also remember that the bit opcodes (like bt, bts, bsr and so on) are not so fast, or am I here wrong too?

The newer version (October 2009, mostly for Core2) of Intel's "Optimisation Reference Manual" says:

BTS is 1 cycles latency and 1 cycle throughput.
XCHG is 2 to 3 cycles latency and 1 cycle throughput.
CMP is 1 cycles latency and 0.33 to 0.5 cycles throughput.

AMD's "Software Optimization Guide for AMD64 Processors" (September 2005) says:

"BTS mem16/32/64, imm8" is 5 cycles latency
"XCHG mem16/32/64, reg16/32/64" is 16 cycles
"CMP mreg16/32/64, imm16/32" is 1 cycles latency

My (aging, Pentium 4 era) hard copy of Intel's "Optimisation Reference Manual" says:

BTS is 8 to 9 cycles latency and 1 cycle throughput; but the footnotes say that for complex instructions the cycles are "estimates based on conservative and worst-case estimates".
XCHG is 1.5 cycles latency and 1 cycle throughput.
CMP is 1 cycles latency and 0.5 cycles throughput.

The oldest information I have is "INTEL 80386 PROGRAMMER'S REFERENCE MANUAL 1986", which says:

"BTS r/m32,r32" is 6/13 cycles.
"XCHG r/m32,r32" is 3 cycles.
"CMP r/m32,imm32" is 2/5 cycles.

Of course these numbers don't count a lot of things, like memory/cache access times, etc. I'd assume none of them include the effect of the LOCK prefix for BTS (although, for 80386 the LOCK is "0 cycles"). I'd also assume that for Intel's numbers they're doing "xchg reg32, reg32" (unlike AMD, Intel don't list each different version of the instruction).

Basically, I don't know which is faster on which CPUs (or which CPUs your code will be run on).

Cheers,

Brendan

Posted: **Sun Mar 07, 2010 10:14 am**

Ok, so now I´m using "lock bts" in my read-write-lock and my spinlock code.

Now I have only to debug my avl code once again and then I can go on coding servers

OSDev.org

read write locks

read write locks

Re: read write locks

Re: read write locks

Re: read write locks

Re: read write locks

Re: read write locks

Re: read write locks

Re: read write locks