SMP Parallel problem

lemonyii · Post by **lemonyii** » Tue Oct 18, 2011 10:21 am

hi guys!
i'm working on SMP now, but i found it's not so easy. after a little job, all cpus come to main():

volatile ulong cpu_count = 0;
volatile struct x64_cpu *pcpuhead;

volatile extern void main( void* memdesc, struct x64_cpu* me ){
volatile static bool wait = true;
	if ( me==null ) {	// bsp
		vga_init();
		vga_prt( "Welcome! Bienvenue! ¡Bienvenidos! Herzlich Willkommen! Benvenuti! " );
		mem_init(memdesc);
		while ( cpu_count-- ) {
			struct x64_cpu *p = alloc( 4096 );
			p->id = cpu_count;
			p->sz = 4096;
			p->next = pcpuhead;
			pcpuhead = p;
			if ( cpu_count==0 ) {	// myself
				me = p;
				__asm(" mov %%rax,%%rsp " : : "a"( (ulong)me + 4096 ) );
			}
		}
		//cpu_init();
		kbd_init();
		wait = false;        //[b]let all APs go[/b]
	} else {
		vga_prt( "C" );       //[b]printed multi times[/b]
		//cpu_init();
		while(wait);         //[b]until i can go[/b]
	}
	vga_prt( "hello!\n" );     //[b]printed only once, by bsp[/b]
	//sched();
	while(1);	
}

there is a problem i think caused by cache. None AP see the "wait" to be false. even i use "wbinvd" after both "wait=false" and "while(wait)", nothing different.

PS. Sorry, my solution here before is deleted because it solved nothing. and follow statements changed.

2. and you can see that i used volatile before many variables, i don't know if it is necessary, is it? surely it didn't work in such situation, and i remembered there is a prefix to tell gcc a variable many change during excution (so it will load from main memory everytime ), what's that and will it work?
3. what can ensure a writethrough to / "readthrough" from main memory?
thank u!

lemonyii · Post by **lemonyii** » Tue Oct 18, 2011 12:01 pm

it's deep night in china now, 2 o'clock. i almost fall in sleep, but my brain work it out: solve it with my lock/unlock function, as metioned at 2F. 2 minutes later, i'm posting here. this is my code:

Code: Select all

	....
        	kbd_init();
		unlock( &wait, sizeof(wait) );
	} else {
		vga_prt( "A" );
		//cpu_init();
		while( lock( &wait, sizeof(wait) ) );
		unlock( &wait, sizeof(wait) );
	}

but i'm thinking about how fast it could be? do instructions with lock prefix write back a whole cache line or just 8 byte(bus wide)?
it's too late now, i have to sleep soon. i'm checking doc from amd/intel tomorrow, and i think i still need advise or critics about this problem.
thank you!

Brendan · Post by **Brendan** » Wed Oct 19, 2011 5:09 am

Hi,

pitfall wrote:As far as I could test them, 'lock' prefixed operations do not explicitly deal with the cache. Instead of this they deal with memory bus locking. The CPU grabs the membus lock for the number of cycles required for the operation to be performed, and then releases it. Other CPU's are just spinning on the membus lock. Then we can call the operation an "atomic" one ^^. Happy for having helped you.

On older CPUs, anything with a LOCK prefix (and a few other things) caused the CPU to lock the bus, then do the "read, modify, write", then unlock the bus. The more CPUs there are sharing a bus the more expensive this gets, and when CPUs are connected by links (e.g. Hyper-transport, Quickpath) and not a common/shared bus it gets messy.

For modern CPUs (maybe "Pentium 4 or later"), where possible, the CPU will do it in cache instead (e.g. put the cache line into the "exclusive" state, then do the "read, modify, write" in the cache). The existing cache coherency mechanism ensures that nothing else can access that cache line while this is happening, even though the bus (or links) aren't locked. It's probably more complex than that though (e.g. there's probably limitations and I'd assume the CPU would fall back to "lock the bus/links" if the access is split across cache line boundaries or page boundaries, or the data being modified is "uncacheable" or caches are disabled).

For lemonyii's problem, there is no "read, modify, write" instruction, and therefore a LOCK shouldn't have been needed (as long as the "wait" variable isn't misaligned). Instead, I think a memory barrier is all that is needed. Wikipedia has a good (and interesting) description of memory barriers, which also explains why "volatile" isn't enough for memory barriers.

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Wed Oct 19, 2011 6:00 am

Is a memory barrier required for this on x86 ? I thought that if core A wrote to a variable then core B would see the new value. Doesn't cache coherency guarantee that ?

I though that making the variable volatile to ensure the compiler does the right thing would be sufficient in this case. A fence would only be required for cache consistency with multiple variables.

Brendan · Post by **Brendan** » Wed Oct 19, 2011 6:48 am

Hi,

gerryg400 wrote:Is a memory barrier required for this on x86 ? I thought that if core A wrote to a variable then core B would see the new value. Doesn't cache coherency guarantee that ?

I though that making the variable volatile to ensure the compiler does the right thing would be sufficient in this case. A fence would only be required for cache consistency with multiple variables.

There's 2 different issues. The first is that the "volatile" keyword in C doesn't do what people think it does. From that wikipedia article:

In C and C++, the volatile keyword was intended to allow C and C++ programs to directly access memory-mapped I/O. Memory-mapped I/O generally requires that the reads and writes specified in source code happen in the exact order specified with no omissions. Omissions or reorderings of reads and writes by the compiler would break the communication between the program and the device accessed by memory-mapped I/O. A C or C++ compiler may not reorder reads and writes to volatile memory locations, nor may it omit a read or write to a volatile memory location, allowing a pointer to volatile memory to be used for memory-mapped I/O.

The C and C++ standards do not address multiple threads (or multiple processors), and as such, the usefulness of volatile depends on the compiler and hardware. Although volatile guarantees that the volatile reads and volatile writes will happen in the exact order specified in the source code, the compiler may generate code (or the CPU may re-order execution) such that a volatile read or write is reordered with regard to non-volatile reads or writes, thus limiting its usefulness as an inter-thread flag or mutex. Preventing such is compiler specific, but some compilers, like gcc, will not reorder operations around in-line assembly code with volatile and "memory" tags, like in: asm volatile ("" : : : "memory"); (See more examples in compiler memory barrier). Moreover, it is not guaranteed that volatile reads and writes will be seen in the same order by other processors due to caching, cache coherence protocol and relaxed memory ordering, meaning volatile variables alone may not even work as inter-thread flags or mutexes.

The second issue is when the write will be visible to other CPUs. 80x86 guarantees that writes will appear to occur in order, but does not guarantee when. For example, if you write to an address and then go into a loop or something (that doesn't do any writes); then that first write could (at least in theory) be postponed until after the loop terminates.

Also note that Intel introduced instructions specifically for memory barriers (MFENCE, LFENCE and SFENCE). For example, if you write to an address and then do an "MFENCE" instruction, then no reads (or writes) will occur until after the write has been made visible to other CPUs. This also doesn't guarantee "when", but does tend to starve the CPU of useful work that might cause the write to be postponed. There are also a group of (older) instructions that force stronger ordering (and behave a little like "MFENCE"): any IO port access, any instruction with a LOCK prefix, and any serialising instruction (e.g. CPUID, WBINVD, INVLPG, IRET, LGDT, etc).

Cheers,

Brendan

lemonyii · Post by **lemonyii** » Wed Oct 19, 2011 8:32 am

so u are implying that "fence-like" instructions have the best performance, then lock and wbinvd? as lock is a serialize instruction, it should be slower than wbinvd because it have to flush all dirty cache lines?

Brendan · Post by **Brendan** » Wed Oct 19, 2011 9:44 am

Hi,

lemonyii wrote:so u are implying that "fence-like" instructions have the best performance, then lock and wbinvd? as lock is a serialize instruction, it should be slower than wbinvd because it have to flush all dirty cache lines?

WBINVD is extremely expensive, and should be avoided as much as possible (fortunately there's very few reasons to ever use it). INVD would be the second most expensive (but fortunately there's no sane reason to ever use it for anything).

MFENCE, LFENCE and SFENCE are probably the fastest. The main disadvantage is that older CPUs don't support them (I checked - they require "Pentium III or later").

LOCK should be faster than accessing an IO port (and much faster than WBINVD), but slower than most of the other serialising instructions (including CPUID, LGDT, LIDT, MOV to debug register, INVLPG, etc).

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Wed Oct 19, 2011 1:41 pm

I understand what volatile does. And I believe volatile is required here to ensure that the compiler generates code in the else-clause (the one that the AP executes) that actually repeatedly loads 'wait' and doesn't simply assume 'wait' is true and do a 'while (true)'.

However, I still think the fence is not required because the BP writes to the variable and then continues on to write to the screen. I can't find any explicit rules but I'm pretty sure that the store buffer is flushed as a matter of course as the bus becomes available and that by the time the BP writes to the screen the 'wait' variable will have been flushed from the store buffer and be visible to the APs. As soon as the store buffer is flushed, cache coherency guarantees correct operation.

Is my understanding incorrect ?

Nugget · Post by **Nugget** » Wed Oct 19, 2011 4:49 pm

Wouldn't the declaration of the "wait" variable be on the stack due to it's local scope? Each thread (on different CPUs) should be using a different stack, so each would see "wait" at a different memory location.

Moving it to global scope should fix this.

gerryg400 · Post by **gerryg400** » Wed Oct 19, 2011 4:54 pm

This is from the Intel manual

Code: Select all

Spin_Lock:
    CMP lockvar, 0          ;Check if lock is free
    JE Get_Lock
    PAUSE                       ;Short delay
    JMP Spin_Lock

Get_Lock: 
    MOV EAX, 1
    XCHG EAX, lockvar     ;Try to get lock
    CMP EAX, 0                ;Test if successful 
    JNE Spin_Lock

Critical_Section:
<critical section code>
    MOV lockvar, 0
    ...
Continue:

The unlock code does not use a LOCK prefix or a fence when it modifies the lockvar but another processor will eventually see the variable become 0 and reach the "Get_Lock:" label.

[edit] fixed spelling

[edit2] Perhaps the PAUSE instruction helps with this ??

gerryg400 · Post by **gerryg400** » Wed Oct 19, 2011 4:54 pm

Nugget wrote:Wouldn't the declaration of the "wait" variable be on the stack due to it's local scope? Each thread (on different CPUs) should be using a different stack, so each would see "wait" at a different memory location.

Moving it to global scope should fix this.

It's static so it should be okay.

Brendan · Post by **Brendan** » Wed Oct 19, 2011 5:38 pm

Hi,

gerryg400 wrote:I understand what volatile does. And I believe volatile is required here to ensure that the compiler generates code in the else-clause (the one that the AP executes) that actually repeatedly loads 'wait' and doesn't simply assume 'wait' is true and do a 'while (true)'.

Something stronger than volatile would be recommended.

For example, consider this code:

Code: Select all

volatile int first = 0;
volatile int second = 0;

void something(void) {
    first = 1;
    second = 2;
    while(FALSE);
}

As far as I understand it, the compiler must do the writes to the volatile variables in the same order, but is free to shift non-volatile pieces of code (like the while loop) before them, and therefore the compiler could do this instead:

Code: Select all

volatile int first = 0;
volatile int second = 0;

void something(void) {
    while(FALSE);
    first = 1;
    second = 2;
}

gerryg400 wrote:However, I still think the fence is not required because the BP writes to the variable and then continues on to write to the screen. I can't find any explicit rules but I'm pretty sure that the store buffer is flushed as a matter of course as the bus becomes available and that by the time the BP writes to the screen the 'wait' variable will have been flushed from the store buffer and be visible to the APs. As soon as the store buffer is flushed, cache coherency guarantees correct operation.

Is my understanding incorrect ?

If the writes to the screen have occurred (and were done by the BSP), then "write-ordered with store forwarding" would imply that the write to "wait" has also occurred. The barrier would be there to make the write to "wait" occur sooner.

If neither a LOCK nor a barrier is necessary; then can any of us be sure there actually is a problem involving the "wait" variable?

Note: There are at least 2 problems in the original code that have nothing to do with the "wait=false" or "while(wait)".

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Wed Oct 19, 2011 5:41 pm

If neither a LOCK nor a barrier is necessary; then can any of us be sure there actually is a problem involving the "wait" variable?

I don't think there is a problem. I think there is a bug elsewhere. I've certainly never used a fence or lock prefix in this situation.

gerryg400 · Post by **gerryg400** » Wed Oct 19, 2011 7:54 pm

Brendan wrote:Note: There are at least 2 problems in the original code that have nothing to do with the "wait=false" or "while(wait)".

Welcome back, by the way.

Lemonyii, here are some of the potential problems I see.

1. The variable 'me' is assigned a value p but that value is not returned to the caller. This may be a problem for you if the calling function expects something in return.

2. Modifying the %rsp whilst you are in a C function is not good for lots of reasons. It's probably still possible to access your function parameters because they are indexed off the %rbp but there is no way to tell the compiler that you've done it and it will assume that the original %rbp and return address are on the new stack. It's very difficult to predict what sorts of problems this might cause and you need to find another way to do this.

3. vga_prt will need to have locking in it. It most likely increments an index into the screen and this needs to be done atomically with the writes to the screen buffer. It may work okay for single character strings and fail for multi character string. I'm only guessing that you don't have locking. This may in fact be your real bug and it might be that the locking that you added to your function 'fixes' the missing locking here.

lemonyii · Post by **lemonyii** » Wed Oct 19, 2011 8:24 pm

Welcome back, by the way.

thank you very much!

1. The variable 'me' is assigned a value p but that value is not returned to the caller. This may be a problem for you if the calling function expects something in return.

no, the main function never return.

2. Modifying the %rsp whilst you are in a C function is not good for lots of reasons. It's probably still possible to access your function parameters because they are indexed off the %rbp but there is no way to tell the compiler that you've done it and it will assume that the original %rbp and return address are on the new stack. It's very difficult to predict what sorts of problems this might cause and you need to find another way to do this.

my code is for x64 arch, so the args is in rdi, rsi, r8, ... no arg in stack unless more than 6 args are passed to here. and no local variable here in stack. but there really might be a problem, we won't know what will gcc do with stack, maybe the address of wait is in that...
i will check it out if this caused problems.
EDIT. i was wrong. the value of p might be affected. but deleting it will do nothing to our problem as i tested.

3. vga_prt will need to have locking in it. It most likely increments an index into the screen and this needs to be done atomically with the writes to the screen buffer. It may work okay for single character strings and fail for multi character string. I'm only guessing that you don't have locking. This may in fact be your real bug and it might be that the locking that you added to your function 'fixes' the missing locking here.

i see this very early, and i wanted to fix it after this is done. so i modified my code to print just a single character. and i have 8 cores emulated, so it's hardly that all cpus print at the same time which lead to only one character printed at last.

and unfortunately, fence instructions doesn't work, at least i didn't make it work.
i will keep trying.
thank u all guys!

OSDev.org

SMP Parallel problem

SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem

Re: SMP Parallel problem