Page 2 of 3

Re:SMP Compatibility

Posted: Sat Jul 17, 2004 5:54 am
by Schol-R-LEA
Pete wrote: Also. When you have more than one CPU, do they have their own virtual address spaces and just share the physical one or what?
That depends on the multiprocessing architecture. In Symetrical Multiprocessing, which is what is being discussed here (and is the most common approach for multiprocessing in a single motherboard system) the processors share memory and have equal access to the entire memory space and peripherals.

Other designs work differently, and there are many different variations. For example, in centralized multiprocessing or co-processing, there is a single primary processor which uses the other processors to perform tasks on it's behalf; co-processing can be homogenous, in which the CPUs are of the same type, or heterogeneous, with specialized co-processors performing as single focused task. Technically speaking, most single-CPU PCs these days fall into the category of heterogeneous co-processing, as they use specialized GPU and DSP processors for video and audio, but since the co-processors are rarely programmed directly, they are usually treated a single-processor systems.

Non-Uniform Memory Access (NUMA) is similar to SMP, except that each processor has it's own 'local' memory as well as the shared memory; for each processor, accessing it's own local memory is faster than accessing the shared memory. Some NUMA designs allow processors to access other processors' local memory directly, but at an even higher overhead than for shared memory, while other require that the data be transferred to shared memory in order to be accessed by more than one processor.

In parallel processing or multithread processing, CPUs not only share memory, they can share a single program, each processor performing a single thread of execution for the whole. In most systems today, this would be managed in the OS software rather than in hardware, though there are some special-purpose parallel systems in existence such as the infamous 'DES Killer', a computer designed in 1998 to decrypt DES-encrypted messages in a matter of hours (this was done to demonstrate the weakness of the algorithm; not surprisingly, the National Bureau of Standards introduced the AES encryption algorithm shortly afterwards).

All of these designs are what are called 'tightly-coupled' multiprocessing; 'loosely-coupled' designs, such as clustering or distributed processing, do not share memory, and in fact can be completely separate systems. Distributed processing is a general term for networked systems running under a shared operating system, though it is also applied to distributed projects such as SETI@home. Clustering is distributed processing (in the first sense) among a group of physically proximate, closely networked (and usually homogenous) systems.

Comments and corrections welcome.

Re:SMP Compatibility

Posted: Sun Jul 18, 2004 11:26 pm
by Dreamsmith
Pete wrote:Are you suggesting that I add something so that once it gets into the loop instead of just looping it causes a task switch?
That is, in general, how semaphores or mutexes are implemented. Task A acquires mutex M and enters the code section protected by it. A short time later, task B attempts to acquire mutex M. Task B cannot procede while task A is in M. If M were a spinlock, B would waste time on the processor spinning the whole rest of its timeslice, and any subsequent timeslices it received. But since M is not a spinlock but a properly implemented mutex, instead B is immediately removed from the run queue, and not scheduled for processor time until A releases M. When A does release M, it causes an immediate task switch to B, so it may immediately proceed into the critical section it's been waiting for. If M were a spinlock, B would not be immediately woken up when A released it, since the scheduler has no information on who is spinning or why.

For certain short and sweet bits of code, a proper mutex or semaphore may be overkill, but in general, anything more than a couple lines of code needing protection should be protected by a mutex or semaphore, and not be disabling interrupts and using a spinlock if it can be avoided. Spinning is bad. It shouldn't be done except in circumstances where you know for sure it would take longer to switch tasks than it would take to simply spin and wait. Spinlocks certainly shouldn't be used to protect any section of code whose execution time isn't O(1).

Re:SMP Compatibility

Posted: Mon Jul 19, 2004 2:19 am
by Therx
Thanks for all the help. I think I understand most of this now.

Pete

PS Anyone know where Tim's been recently?

Re:SMP Compatibility

Posted: Mon Jul 19, 2004 12:29 pm
by proxy
so i'm implementing my spinlock and was wondering if this is a proper implementation:

(assume m_Lock is initialized to 0)

Code: Select all

   void acquire() { while(test_and_set(1, &m_Lock)) { continue; } }
   void release() { test_and_set(0, &m_Lock); }
assuming test_and_set is implemented as so

Code: Select all

test_and_set:
   movl 4(%esp), %eax   /* get new_value into %eax */
   movl 8(%esp), %edx   /* get lock_pointer into %edx */
   lock         /* next instruction is locked */
   xchgl %eax, (%edx)    /* swap %eax with what is stored in (%edx) */
            /* ... and don't let any other cpu touch that */
            /* ... memory location while you're swapping */
   ret         /* return the old value that's in %eax */
is this generally correct?

Re:SMP Compatibility

Posted: Tue Jul 20, 2004 12:43 am
by Brendan
HI,
proxy wrote: so i'm implementing my spinlock and was wondering if this is a proper implementation:

<snipped>

is this generally correct?
Yes :)

You may wish to add the PAUSE instruction to "acquire()" though - to prevent one logical CPU from slowing down other logical CPUs within the same physical CPU/chip.


Cheers,

Brendan

Re:SMP Compatibility

Posted: Tue Jul 20, 2004 10:44 am
by proxy
glad to see it is on target, i have a question about this sleep. I realize that the lock/xchg instructions will slow down the other processors as it will temporarily have to be lock-step for that one instruction. So my question is how do you recomend I implement a sleep? Are we talking the type that would involve a context switch cause I'm a bit undecided as to whether or not it would break things (the other threads would have interrupts enabled in there context, but so long as they dont touch the spinlock protected resource it shoudl be fine anyway).

thoughts?

proxy

Re:SMP Compatibility

Posted: Wed Jul 21, 2004 1:09 am
by Brendan
Hi,
proxy wrote: glad to see it is on target, i have a question about this sleep. I realize that the lock/xchg instructions will slow down the other processors as it will temporarily have to be lock-step for that one instruction. So my question is how do you recomend I implement a sleep? Are we talking the type that would involve a context switch cause I'm a bit undecided as to whether or not it would break things (the other threads would have interrupts enabled in there context, but so long as they dont touch the spinlock protected resource it shoudl be fine anyway).
"PAUSE" is an actual CPU instruction (similar to "NOP"). It was introduced with Pentium 4 processors, but it's backward compatible with all IA-32 processors as it's coded as a "REP NOP" (which won't actually repeat on older CPUS because NOP wasn't originally meant to).

From Intel's reference manual:
Improves the performance of spin-wait loops. When executing a ?spin-wait loop,? a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

An additional function of the PAUSE instruction is to reduce the power consumed by a Pentium 4 processor while executing a spin loop. The Pentium 4 processor can execute a spin-wait loop extremely quickly, causing the processor to consume a lot of power while it waits for the resource it is spinning on to become available. Inserting a pause instruction in a spin-wait loop greatly reduces the processor?s power consumption.

This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction operates like a NOP instruction. The Pentium 4 and Intel Xeon processors implement the PAUSE instruction as a pre-defined delay. The delay is finite and can be zero for some processors. This instruction does not change the architectural state of the processor (that is, it performs essentially a delaying noop operation).
There's also more information about it's benefits when used in conjunction with hyper-threading in "7.7.6.1. USE THE PAUSE INSTRUCTION IN SPIN-WAIT LOOPS" :)

To avoid any delays when acquiring a lock that is free (the most common case) I tend to do something like:

Code: Select all

getLock:
    mov eax,1
    xchg [lock],eax
    test eax,eax
    je .acquired

.locked
    pause
    xchg [lock],eax
    test eax,eax
    jne .locked

.acquired
    ret
Rather than

Code: Select all

getLock:
    pause
    xchg [lock],eax
    test eax,eax
    jne getLock
    ret
This second version will always execute "pause" at least once, while the first won't...

Cheers,

Brendan

Re:SMP Compatibility

Posted: Wed Jul 21, 2004 3:51 pm
by Dreamsmith
Brendan wrote:"PAUSE" is an actual CPU instruction (similar to "NOP"). It was introduced with Pentium 4 processors, but it's backward compatible with all IA-32 processors as it's coded as a "REP NOP" (which won't actually repeat on older CPUS because NOP wasn't originally meant to).
Cute. It should also be noted that the behavior of REP before any other non-string instruction produces "undefined behavior", according to Intel. In fact, I think it generally does nothing, but I wouldn't bet any money of the fact...

If you're into obfuscating your code, NOP can also be expressed as XCHG AX,AX (which also encodes to 0x90), so PAUSE becomes REP XCHG AX,AX.

Of course, if you actually do this, you should be taken out back and shot... ;)

Re:SMP Compatibility

Posted: Wed Jul 13, 2005 7:15 am
by Kemp

Code: Select all

getLock:
    pause
    xchg [lock],eax
    test eax,eax
    jne getLock
    ret
That piece of code I don't quite get, how could EAX be different to itself? It would have to be changed while being compared which is impossible as far as I know ???

Re:SMP Compatibility

Posted: Wed Jul 13, 2005 7:44 am
by Brendan
Hi,
Kemp wrote:

Code: Select all

getLock:
    pause
    xchg [lock],eax
    test eax,eax
    jne getLock
    ret
That piece of code I don't quite get, how could EAX be different to itself? It would have to be changed while being compared which is impossible as far as I know ???
The "test eax,eax" instruction isn't the same as "cmp eax,eax". It's a shorter version of "test eax,0xFFFFFFFF", or an alternative to "cmp eax,0".

Basically, the code loops until the value stored at "[lock]" is zero while also setting the value at [lock] to 1. If another CPU already has the lock it won't return zero. If the other CPU frees the lock (using "mov dword [lock],0" for e.g.) then the "xchg [lock],eax" part will atomically set the value to non-zero and read the value zero, which ends the loop.

I prefer using the "BTS" (bit test and set) instruction for spinloops:

Code: Select all

getLock:
    lock bts [lock],1
    jc .acquired

.locked:
    pause
    lock bts [lock],1
    jc .locked
.acquired:
    ret
It's functionally equivelent, but an instruction shorter (although optimizing delay loops could be considered a sign of insanity).


Cheers,

Brendan

Re:SMP Compatibility

Posted: Wed Jul 13, 2005 7:52 am
by Kemp
Ah, a good example of reading something how you want to read it rather than how it really is. It might just be me, but the BTS version seems easier to understand from just looking at it rather than having to think (it's nice to have at least one thing that doesn't require too much thought). Also, yes, optimising delay loops is a possible sign of insanity, however writing an OS is a much bigger sign ;)

Re:SMP Compatibility

Posted: Fri Dec 09, 2005 1:55 pm
by Rob
Resurrection of a old thread, but shouldn't that last jump instruction read jnc instead of jc, Brendan?

Re:SMP Compatibility

Posted: Sat Dec 10, 2005 12:04 am
by Brendan
Hi,
Rob wrote:Resurrection of a old thread, but shouldn't that last jump instruction read jnc instead of jc, Brendan?
No - it's the first "jc" that is wrong!

It should be:

Code: Select all

getLock:
    lock bts [lock],1
    jnc .acquired         ; <-- corrected

.locked:
    pause
    lock bts [lock],1
    jc .locked
.acquired:
    ret
An even better way would be:

Code: Select all

getLock:
    lock bts [lock],1
    jc .acquired

.locked:
    pause
    bt [lock],1            ; <-- No lock prefix
    jc .locked

    lock bts [lock],1
    jc .locked

.acquired:
    ret
This version doesn't repeatedly lock the bus when the spinlock is under heavy contention...


[EDIT - added everything below here]

To unlock the lock, you'd only need to use:

Code: Select all

mov dword [lock],0
Which doesn't need a lock prefix, but please note that the "lock" variable must be aligned.

Also, to minimize cache thrashing you should define the "lock" variable like this:

Code: Select all

    align 128
lock:
    dd 0
    align 128
This makes sure that the lock exists in its own cache line, which is not shared by anything else. It can also waste a lot of RAM if you've got a lot of locks.

I'd advise some serious contemplation if you wanted to reduce the amount of wasted RAM - imagine 2 or more CPUs trying to acquire the lock while a third CPU is trying to do something with data in the same cache line (cache misses everywhere and a huge amount of bus traffic).

For similar reasons, it is usually not a good idea to use the same variable for multiple locks (e.g. bit 0 for lock A, bit 1 for lock B, bit 2 for lock C, etc).


Cheers,

Brendan

Re:SMP Compatibility

Posted: Sun Aug 20, 2006 10:28 am
by Kemp
Sorry to resurrect this (very) old thread, but just a quick question. Does you second improved example there suffer the same "jc instead of jnc" problem as the original or is it an intentional change in this case?

Re:SMP Compatibility

Posted: Sun Aug 20, 2006 12:10 pm
by Brendan
Hi,
Kemp wrote:Sorry to resurrect this (very) old thread, but just a quick question. Does you second improved example there suffer the same "jc instead of jnc" problem as the original or is it an intentional change in this case?
Cut & paste is a curse sometimes....

The second example should be:

Code: Select all

getLock:
    lock bts [lock],0
    jnc .acquired

.locked:
    pause
    bt [lock],0
    jc .locked

    lock bts [lock],0
    jc .locked

.acquired:
    ret
Of course macros are more fun...

Code: Select all

%macro LOCK_SECTION 1
    lock bts %1,0
    jnc %%acquired

%%locked:
    pause
    bt %1,0
    jc %%locked

    lock bts %1,0
    jc %%locked

%%acquired:
%endmacro
And checking for mistakes can be a good idea too:

Code: Select all

%macro LOCK_SECTION 1
  %ifdef DEBUGGING
    push eax
  %endif

%%retry:
    lock bts %1,0
    jnc %%acquired

  %ifdef DEBUGGING
    mov eax,%1
    test eax,1
    je %%retry
    shr eax,1
    cmp [gs:thread_ID],eax
    je LOCK_USED_TWICE_FROM_SAME_THREAD
  %endif

%%locked:
    pause
    bt %1,0
    jc %%locked

    lock bts %1,0
    jc %%locked

%%acquired:
  %ifdef DEBUGGING
    mov eax,[thread_ID]
    shl eax,1
    or %1,eax
    pop eax
  %endif
%endmacro


%macro UNLOCK_SECTION 1
  %ifdef DEBUGGING
    push eax
    mov eax,%1
    test eax,1
    je UNLOCK_WITHOUT_LOCK
    shr eax,1
    cmp [gs:thread_ID],eax
    pop eax
    jne UNLOCK_FROM_WRONG_THREAD
  %endif
    mov %1,0
%endmacro

Cheers,

Brendan