xeyes wrote:Right now most of my issues are 'cross functional', Like some thread has a few locks, then it decides to sleep
Apparently that wasn't the the original intention but a byproduct of situations like "a few layers down, it needs to wait for hard disk".
I see. Yeah, that will be fun once I get to file system code.
xeyes wrote:I can't imagine many things that definitely need MP, it's mostly a throughput or performance enhancement.
Mostly, yes. However, consider that single threaded performance has been mostly stagnating for the last decade. We will likely see more and more cores, and only using a single one means only using a smaller and smaller fraction of the available hardware.
xeyes wrote:That's also my excuse of not touching 64b yet
Your choice. I simply started my kernel as a 64-bit one. I have no intent of ever supporting 32-bit mode for anything other than legacy booting.
xeyes wrote:I probably included too much 'supporting' code and wasn't clear in the comments.
The barrier waits are between the 2 threads. Without both reaching the same point and ready to continue, neither can continue.
[...]
If you believe that there's causality in this universe and time only flows in one direction, sum == 0 shouldn't have happened.
Having now read the link you provided, I understand a bit better. In that case, you have undefined behavior due to a data race on v1 and v2. According to C11's memory model, a data race is when multiple threads access the same variable unsynchronized, and at least one of those accesses is writing. This is the case for v1 and v2 here, which are both being written by one thread and read by the other. In order to resolve the data race, you would have to make the store and the load atomic. And the compiler intrinsic atomic load and store probably contain the required memory barriers.
Which is also why you really ought to either read up on memory ordering or - my preference - use compiler intrinsics for atomic operations, since those already contain the required barriers. I always find it hard to reason about barriers, since sequentially, they don't do anything. It's a similar problem I have with cache management instructions (cache is supposed to be transparent).
So anyway, I tried the test program presented in that link and got the following output:
Code: Select all
1 reorders detected after 608520 iterations
2 reorders detected after 4988844 iterations
3 reorders detected after 11804037 iterations
So I guess my Core i7 is not as reorder happy as the Core 2 Duo of the blog author. However, with atomic instructions in place, i.e. the "transaction" part replaced with
Code: Select all
__atomic_store_n(&X, 1, __ATOMIC_SEQ_CST);
r1 = __atomic_load_n(&Y, __ATOMIC_SEQ_CST);
in thread 1 and similar for thread 2, the compiler just inserts the mfence instruction where the blog author originally put it. And indeed, running the program now for a few minutes fails to produce any output. Since in this case two different memory locations are involved, sequential consistency is the required order, since that is the only order that synchronizes with other operations on different addresses.
I do find it telling that the guy chose to "solve" the problem first by restricting process CPU affinity. Instead of fixing the bug (the data race in this case), he chose to make the whole program worse. That's like integrating a garbage collector into your program because you can't figure out a memory allocation bug. It's just shoddy craftsmanship. If a bridge builder added a pillar to the otherwise free hanging bridge because he couldn't figure out the static issues, we'd call him raving mad, yet here we are, essentially doing the same.