OS Support For Transactional Memory

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 12:31 pm

"0.9 * 5 + 0.1 * (5 + 500 /2)"

If it takes 5 cycles to acquire a lock then you just acquired it with less than 5 cycles as the contention increases. (0.9 * 5)

Plus the amount of time the thread that has the lock currently acquired you just reduced the amount of time the thread will hold the lock from the average.

I am going crazy. Someone help me understand my mistake.

T is the thread's remaining time slice.

I most likely was not concise enough such that I should have said, T is the remaining time for the thread's current slice.

At some point in time (T<10ms).

@Colonel Kernel
The test and set keeps other processors from trying to add a item. There is no que used. The test and set is only unset by the owning processor once the threads have been migrated. >2 processors

jnc100 · Post by **jnc100** » Sat Jun 09, 2007 1:17 pm

He's saying that the average time to acquire a lock is 0.9 * 5 + 0.1 * (5 + 500 /2) clock cycles.

This assumes that there is a 90% probability that the lock is available, and can therefore be accessed in 5 clock cycles, and there is a 10% probability that the lock is not available. In that case, assuming a critical piece of code takes 500 clock cycles then, on average, you will attempt to acquire the lock half way through. Add to that the 5 cycles to acquire the lock at the end and you get the result above.

Regards,
John.

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 1:24 pm

It's fairly simple - the average total cost of any lock would be the chance of no lock contention multiplied by the cost of acquiring the lock when there's no contention, plus the chance of lock contention multiplied by the cost of acquiring the lock when there is contention (plus the cost of freeing the lock).

I can almost see your fundamental reasoning, however I think your equation is wrong. At the same time what you are wanting to do I think is very complex unless you exclude a lot of the factors. Hence my usage of a very simple equation as a rule of thumb, instead of the complicated attempt to do it below.

X is the cost for the spinlock exit instruction sequence in cycles.
E is the cost for the spinlock test and spin in cycles.
C is the protected instruction block cost in cycles.
Z is the processor speed in cycles per (whatever unit of time).
- Q=0.1Z / (E+X+C)
- The 'rough' total number of times the lock would have been acquired.
- 0.1Z / E
- The 'rough' number of times the processor attempting to acquire the lock cycled over "testing and setting".
- W=0.1Z / E * E
- The 'rough' number of cycles the spinning processor took to acquire the lock.
The (worst case scenario) total time the lock was acquired in cycles is the answer; roughly, since I exclude many factors.
This processor would have sat there spinning during this many cycles (worst case scenario): W.

This is much easier I think:
I am sorry I screwed my own equation up (see below).

X is the cost for the spinlock exit instruction sequence in cycles.
E is the cost for the spinlock test and spin in cycles.
C is the protected instruction block cost in cycles.
T is the remaining cycles in the thread's time slice.
S is the cost of a context switch in cycles.
((E + X + C/2) < T) - We should use a spinlock.
((E + X + C/2) > T) && (2S < T) - We should sleep.

The (2S > T) is saying. Hey. If this thread has it's own time to spare then lets take the cost of a context switch out of it's time slice instead of hurting the entire system by unintentional taking it from other time slices since 2S could be greater than T.

He's saying that the average time to acquire a lock is 0.9 * 5 + 0.1 * (5 + 500 /2) clock cycles.

I can not understand how we know that it did not take more than five processor cycles or more concisely more than one cycle of the spin lock to acquire the lock.

I want to know why I feel like someone is throwing bull s.h.i.t. around.

Colonel Kernel · Post by **Colonel Kernel** » Sat Jun 09, 2007 1:44 pm

Kevin McGuire wrote:"0.9 * 5 + 0.1 * (5 + 500 /2)"

If it takes 5 cycles to acquire a lock then you just acquired it with less than 5 cycles as the contention increases. (0.9 * 5)

Plus the amount of time the thread that has the lock currently acquired you just reduced the amount of time the thread will hold the lock from the average.

I am going crazy. Someone help me understand my mistake.

I believe the equation is for "expected value". How much statistics do you know? Mine is a bit shaky, so my explanation may be a bit messed up (and if you know stats, you probably know it better than me, but here goes).

The expected value is sort of like an average, in a sense. If you are going to sample a quantity repeatedly over time, the expected value is a prediction of what you expect the average value to be. So, to put Brendan's equation in words, I would say you expect the thread to spend 5 cycles 90% of the time and 5 cycles plus half the length of the critical section the other 10% of the time. The equation gives you the cost of waiting for the lock, amortized over some hypothetical sample interval (which Brendan assumes in this case will result in lock contention 10% of the time).

I hope that makes sense. Brendan, tell me if I'm explaining that wrong...

The test and set keeps other processors from trying to add a item. There is no que used. The test and set is only unset by the owning processor once the threads have been migrated. >2 processors

Then I guess I'm confused about what you mean by "add an item". Here is your original explanation, with my questions and comments:

each individual processor in a SMP UMA system can have their own thread container.

What do you mean exactly by "container"? I assumed you meant a collection of threads, since you are talking about thread migration. How would you implement such a collection? Were you thinking of an array, or something else? I was assuming a singly-linked list, which is how these things are normally implemented in a scheduler. That's what I meant by "queue".

When migrating a thread you do a test and set (xchg) on a field in each processor's thread container (at the worst case scenario).

What is that field, and what does it mean? Is it the ID of the processor that has "ownership" of the "container" at any given time?

I said test and set not sit and spin.

Ok. Spinning is just testing repeatedly until it can be set, so I assume you don't mean that.

If you test and set with success then you add a entry for this thread to another field.

What is the "another field" you refer to here? Do you mean adding the thread to the container?

Also, what happens if the processor tests, but cannot set because another processor got there first? What does it do? You've already said it doesn't spin...

When the processor does a task switch it checks the migrating thread field and adds any threads added there.

"adds any threads added there" -- adds them to what exactly? Its own scheduler's list of threads local to that processor?

Colonel Kernel · Post by **Colonel Kernel** » Sat Jun 09, 2007 1:49 pm

Kevin McGuire wrote:I want to know why I feel like someone is throwing bull s.h.i.t. around.

Relax.

Your equation looks like a guide towards choosing whether to use a spinlock or a mutex/semaphore. Brendan's equation is not a decision guide, at least not directly. It is an attempt to quantify the overhead of using a spinlock by making some assumptions and predictions. It is a probabilistic statement, not a rule of thumb (which yours is).

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 2:11 pm

0.9 * 5
1.0 * 5.0 = 5.0 (right)
0.0 * 5.0 = 0.0 (wrong) [should be infinity]
0.5 * 5.0 = 2.5 (wrong) [should be more]
0.1 * (5 + 500 /2)
0.0 * (5.0 + 500.0 / 2.0) = 0.0 (right)
1.0 * (5.0 + 500.0 / 2.0) = 255.0 (wrong) [should be infinity]
0.5 * (5.0 + 500.0 / 2.0) = 127.0 (wrong) [should be more]

The expected value is sort of like an average, in a sense. If you are going to sample a quantity repeatedly over time, the expected value is a prediction of what you expect the average value to be. So, to put Brendan's equation in words, I would say you expect the thread to spend 5 cycles 90% of the time and 5 cycles plus half the length of the critical section the other 10% of the time. The equation gives you the cost of waiting for the lock, amortized over some hypothetical sample interval (which Brendan assumes in this case will result in lock contention 10% of the time).

Then how does the outputted average go below the absolute lowest cost?

Its similar to saying:
a > f < b > f < c > f < d
a+b+c+d/4=f

Colonel Kernel wrote: Then I guess I'm confused about what you mean by "add an item". Here is your original explanation, with my questions and comments:

I know what you did wrong. You made a assumption. I did not do a complete detail of the mechanism, since I felt like the topic was about the usage of a spinlock being required. I knew exactly what you did wrong when you tried to give the mechanism a name. I have known all along.

A thread container hold a group of thread using some unknown mechanism.
Code: Select all
```
struct tImaginaryThreadContainer
{
  uint32_t LockMigratingThreadSlot;  /// test and set field.
  UNKNOWN_TYPE MigratingThreads; /// another field (unknown list mechanism)
  ....... implementation ..... /// adds the threads here.
};
```
The MigratingThreads field is only touched by a remote processor when the lock is not set. The lock prevents two or more processors from trying to write the MigratingThreads field at the same time. The lock is not unlocked after usage. The owning processor (local) converts the MigratingThreads into entries into it's own ..implementation.., and unlocks the lock.
If the processor can not set the lock then it trys the next processor. If it can not find any processor to migrate the thread to because of locks then it waits for a unknown amount of time and tries a unknown number of times during this time. If this still does not work then the thread stays on the processor trying to migrate it. (warning: open ended)

Yes. You have it correct except for there being a lock-less que. I never specified it to be a que, and expected it to not be accessed by multiple processors. The point was the lock worked with out spinning on it (which you also understand so I am not trying to assert that you do not to other readers). I appreciate you reviewing the implementation, although I think this could have been a lot shorter in time and text.

bewing · Post by **bewing** » Sat Jun 09, 2007 2:15 pm

In these calculations, I think you are all making a huge mistake by ignoring some hardware details -- that are in a sense hidden from you, yes -- but still exist.

Posit: you have a machine with 8 CPUs. The whole point of a spinlock is that there is a specific memory location that is updated/checked by each CPU in their own spinlock code, repeatedly. Either you work really hard to put this memory location into uncacheable memory -- or it will certainly fall into L1 and L2 cache on every CPU, from being accessed so often.

Even the act of using a spinlock that is "free" requires modifying that memory location twice. If it is in uncacheable memory, those two memory accesses will take the equvalent of several hundred cpu cycles, each. If the spinlock is busy, then the lock goes into a polling loop reading that uncacheable memory location, with that several hundred cpu cycle penalty on every single read.

If the memory address is cacheable, on the other hand, then you just invalidated significant portions of the L1 and L2 caches of all 7 of the other CPUs on the machine. On a page that they are guaranteed to need again very very soon -- because they will be checking this spinlock themselves. Which means that all 7 of those CPUs will need to reload exactly the same page out of ram -- guaranteeing massive contention. Not to mention the tremendous load you place on the cache coherency mechanisms.

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 2:18 pm

@bewing:

We know this. Here is what I wrote in one of my posts:
At the same time what you are wanting to do I think is very complex unless you exclude a lot of the factors.

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 2:31 pm

Colonel Kernel wrote:
Kevin McGuire wrote:I want to know why I feel like someone is throwing bull s.h.i.t. around.
Relax.
Your equation looks like a guide towards choosing whether to use a spinlock or a mutex/semaphore. Brendan's equation is not a decision guide, at least not directly. It is an attempt to quantify the overhead of using a spinlock by making some assumptions and predictions. It is a probabilistic statement, not a rule of thumb (which yours is).

It is mathematics.
Do you know what the word quantify and a inequality is?
Is he writing a new style of mathematics or something?

I say quantify? I have yet to understand the unit of measurement his equation is producing to even consider it being a quantification of something, or having the ability to quantify it further.

What do you think a 'probabilistic statement' is going to produce? I called it a rule of thumb because of all the factors that were not included into the equation which bewing just made even more clear (above).

Colonel Kernel · Post by **Colonel Kernel** » Sat Jun 09, 2007 2:49 pm

Kevin McGuire wrote:It is mathematics.
Do you know what the word quantify and a inequality is?

Yes.

I was trying to be helpful. There is no need to get snarky.

Is he writing a new style of mathematics or something?

No, but he did forget the other side of the equation. Here is the entire thing:

x = 0.9 * 5 + 0.1 * (5 + 500 /2)

where x is the expected number of CPU cycles that a thread will "waste" spinning on a spinlock, assuming that:

the number of cycles to enter and exit a spinlock is 5
the number of cycles it takes to execute the critical section protected by the spinlock is 500, and
there will be contention for the lock 10% of the time.

I say quantify? I have yet to understand the unit of measurement his equation is producing to even consider it being a quantification of something, or having the ability to quantify it further.

I agree with bewing that the unit of measurement is very flawed due to cache effects. I am trying to clarify for you the intent of the equation, not to justify it as being realistic.

What do you think a 'probabilistic statement' is going to produce?

An expected value. A prediction of a quantity ("CPU cycles") based on a set of assumptions.

I called it a rule of thumb because of all the factors that were not included into the equation which bewing just made even more clear (above).

Again, the factors bewing mentioned aside, your equations were along the lines of "if this equation holds, then choose option A, else choose option B" which is more of a boolean expression than a statement of probability. That is what I interpret as being a "rule of thumb". I assume, since you seem to define just about every phrase differently than I do, that you think it means something different.

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 3:07 pm

Colonel Kernel wrote:
Kevin McGuire wrote:It is mathematics.
Do you know what the word quantify and a inequality is?
Yes.

I was trying to be helpful. There is no need to get snarky.

Here is a example of snarky:
http://www.osdev.org/phpBB2/viewtopic.p ... ght=#97622
Here is a extract:
"To be honest, the part of your post above the part I quoted made my eyes glaze over. This happens a lot when I read your posts. I'm beginning to think you're in a perpetual state of confusion."
I am starting to think you actually accidentally described yourself.

Is he writing a new style of mathematics or something?
No, but he did forget the other side of the equation. Here is the entire thing:

x = 0.9 * 5 + 0.1 * (5 + 500 /2)

I think he forgot a large piece of the equation. Shown here:
http://www.osdev.org/phpBB2/viewtopic.p ... =processor speed%20in%20cycles#98422

Again, the factors bewing mentioned aside, your equations were along the lines of "if this equation holds, then choose option A, else choose option B" which is more of a boolean expression than a statement of probability. That is what I interpret as being a "rule of thumb". I assume, since you seem to define just about every phrase differently than I do, that you think it means something different.

See the link above. What he tried to do was very complicated and he did was not successful in modeling it in mathematics.

So here is what I say at the end.. You will love it. Its great.

In any case, I have to get back to work now... Maybe if I get bored I'll try again to decrypt the rest of your post later.
I hope it burns baby. Let it burn good. All night long.

Colonel Kernel · Post by **Colonel Kernel** » Sat Jun 09, 2007 5:00 pm

Kevin McGuire wrote:Here is a example of snarky

Nope, I wasn't being snarky in that thread, just honest. Believe me, there is no malicious intent behind those words. I was merely expressing my frustration at our seeming inability to communicate. Little seems to have changed since then.

See the link above. What he tried to do was very complicated and he did was not successful in modeling it in mathematics.

Yes, I believe you. This is also entirely beside the point I was trying to make. Perhaps I am mistaken about what it is that you are confused about. In any case, I think we've beaten this particular dead horse long enough.

So here is what I say at the end.. You will love it. Its great.In any case, I have to get back to work now... Maybe if I get bored I'll try again to decrypt the rest of your post later.

Heh. I am obviously very bored today.

Actually, just avoiding doing housework...

I hope it burns baby. Let it burn good. All night long.

...Right...

Anyway, back on topic...

@bewing: Can you provide a link to something discussing the concept of "funnels" as it applies to IPC?

Also, how does your implementation avoid the kind of cache-related overheads that you mentioned?

Brendan · Post by **Brendan** » Sat Jun 09, 2007 10:48 pm

Hi,

Wow, OK...

My formula for "estimated cost of a spinlock" is exactly how Colonel Kernel explained:

T = Pa * Ta + Pb * Tb

Where:

T = estimated average cost of the spinlock
Pa = probability that the lock can be acquired without spinning
Pb = probability that the lock can't be acquired without spinning
Ta = average cost of acquiring the lock if no spinning is needed
Tb = average cost of acquiring the lock if spinning is needed

It wasn't meant to be 100% accurate (it's just an estimation about the relative cost), just like my other statement about the estimated cost of a thread switch ("in the order of 5 to 500 times more CPU time") isn't 100% accurate either.

Still, take a look at some basic spinlock code (typically implemented as a macro):

Code: Select all

    lock bts [myLock], 0
    jnc .acquired

.spin:
    pause
    test [myLock], 1
    jne .spin

    lock bts [myLock], 0
    jc .spin

.acquired:

Then there's the code to free the lock:

Code: Select all

    mov dword [myLock],0

In this case, average cost of acquiring the lock if no spinning is needed (Tb) comes from the first 2 instructions and nothing else. On a modern CPU, if "[myLock]" is in the cache, the first instruction ("lock bts [myLock], 0") takes roughly 3 cycles and doesn't involve locking the bus. The second instruction (if it doesn't cause a branch mis-prediction) doesn't cost anything. If "[myLock]" is not in the cache, then you've got a "read for ownership" and bus-locking, and if the branch is mis-predicted there's also a branch mis-prediction penalty (roughly 25 cycles for Pentium 4/Netburst - less for CPUs with shorter pipelines).

For the average cost of acquiring the lock if spinning is needed (Tb), the cost of all the instructions that execute before the lock becomes free can be measured as the number of cycles until some other CPU frees the lock. When the lock does become free it'd take about 30 cycles (including a likely branch mis-prediction) plus the cost of a cache miss to exit the loop.

The cost of freeing the lock should also be included - a possible cache line miss (if someone else tried to acquire the lock while we were in out critical section) plus about 2 cycles.

This gives a more accurate estimate of:

T = Pa * (Ta + Pb * C + Pc * M) + Pd * (L + X + C + Pe * M) + Pf * C + F

Where:

T = estimated average cost of the spinlock
Pa = probability that the lock can be acquired without spinning
Pb = probability that the cache line needs to be fetched when no spinning is needed
Pc = probability that the branch is mis-predicted when no spinning is needed
Pd = probability that the lock can't be acquired without spinning
Ta = minimum cost of acquiring the lock if no spinning is needed
C = cache line fetch cost
M = branch mis-prediction penalty
L = average number of cycles before the lock becomes free
X = minimum cost of exit overhead (without cache line miss or branch mis-predict)
Pe = probability of a branch mis-predict when exiting the spinloop
Pf = probablility that someone tries to acquire the lock before it's freed
F = minimum cost of freeing the lock

Simplified, with guesses based on expected times for Pentium 4, an assumed 100% chance of a branch mis-predict when exiting the spinloop, an assumed 10% chance of branch-mispredict when no spinning is needed, an assumed 25 cycle branch mis-prediction penalty, an assumed 100 cycles cache miss penalty, the realisation that Pf = Pd, and the assumed minimum cost of freeing the lock is 2 cycles it becomes:

T = Pa * (Pb * 100 + 5.5) + (1 - Pa) * (L + 137 + C)

Where:

T = estimated average cost of the spinlock
Pa = probability that the lock can be acquired without spinning
Pb = probability that the cache line needs to be fetched when no spinning is needed
L = average number of cycles before the lock becomes free

Now, if (for a specific lock with specific expected usage) the probability that the lock can be acquired without spinning is 90%, the probability that the cache line needs to be fetched when no spinning is needed is 50%, and the average time before a lock becomes free is 250 cycles (i.e. the average critical section protected by the lock is about 500 cycles), you get:

T = 0.9 * (0.5 * 100 + 5.5) + (1 - 0.9) * (250 + 137)
T = 0.9 * 55.5 + 0.1 * 388
T = 88.75 cycles

In contrast, something like an "FXLOAD/FXSAVE" pair (without any of the other costs of thread switches) is going to cost more than double this in terms of cycles used.

For my OS, thread switch costs include determining which thread to run next, adjusting the amount of time used by the thread, switching register state, and (in all cases) the cost of switching address spaces (TLB flush, etc). For me, the cost of a thread switch is around 2500 cycles. In this case a spinlock is going to be cheaper than a pair of thread switches unless "T" (from the formulas above) is > 5000 cycles (assuming that the thread isn't about to be pre-empted, which is something you can't predict when writing your kernel), which corresponds to insane amounts of lock contention and/or insanely long critical sections (and brings me back to the "3 rules of thumb for micro-kernel locking" I posted earlier, which can be summerized as "use spinlocks unless your code needs to be redesigned").

Of course locks in user-space and device drivers are still a completely different matter...

bewing wrote:In these calculations, I think you are all making a huge mistake by ignoring some hardware details -- that are in a sense hidden from you, yes -- but still exist.

Posit: you have a machine with 8 CPUs. The whole point of a spinlock is that there is a specific memory location that is updated/checked by each CPU in their own spinlock code, repeatedly. Either you work really hard to put this memory location into uncacheable memory -- or it will certainly fall into L1 and L2 cache on every CPU, from being accessed so often.

Under no circumstances should any lock be in uncacheable memory - access to RAM is slow, and the relative difference between CPU speed and RAM speed can be expected to continue growing.

bewing wrote:If the memory address is cacheable, on the other hand, then you just invalidated significant portions of the L1 and L2 caches of all 7 of the other CPUs on the machine. On a page that they are guaranteed to need again very very soon -- because they will be checking this spinlock themselves. Which means that all 7 of those CPUs will need to reload exactly the same page out of ram -- guaranteeing massive contention. Not to mention the tremendous load you place on the cache coherency mechanisms.

First, "page" in your statement above is misleading, as "page" typically means a unit of the virtual memory system (e.g. for 80x86 it's 4096 bytes, or possibly 2 MiB or 4 MiB) while for modern CPUs the cache coherency mechanism typically uses much smaller areas (e.g. 64-byte cache line sizes on modern 80x86).

Secondly, the cache coherency mechanism normally uses MESI (modified, exclusive, shared, invalid) states, with cache snooping. This means that if one CPU has modified a cache line and other CPU/s want that cache line, then the first CPU will send the modified cache line on the bus to memory and other CPU/s can snoop that transaction directly from the bus. I.e. the modified cache line goes from CPU to RAM and other CPU/s, not from CPU to RAM then from RAM to other CPU/s.

For a lock under extremely heavy contention, the same cache line would be in all CPUs caches. When the current lock owner tries to free the lock it'd tell other CPUs to set the cache line to "invalid", then modify the cache line in it's own cache. Soon after the other spinning CPUs will request to read the cache line again, and the CPU that just freed the lock will send the modified cache line across the bus to the other CPUs (and RAM). Now the cache line is "shared" by all CPUs. The other CPUs will realise that the lock is now free and attempt to lock the bus and put the cache line in the "exclusive" state. The first CPU to lock the bus succeeds, all other CPUs change the cache line state to "invalid" and the winner sets the cache line to "exclusive" and does it's "read/modify/write" atomic instruction. Soon after the other spinning CPUs will request to read the cache line again, and the CPU that just acquired the lock will send the modified cache line across the bus to the other CPUs (and RAM). Now the cache line is "shared" by all CPUs again, and we're back where we started.

Basically, under extremely heavy contention, there's no bus traffic at all unless a CPU acquires the lock or frees the lock. When a CPU does acquire or free the lock it causes a request for exclusive access, one or more requests for cache line read, followed by a cache line write.

For a badly designed spinlock, it is much worse. For example, compare the spinlock code near the top of this post to something like this:

Code: Select all

.spin:
    lock bts [myLock], 0
    jc .spin

In this case you would get a huge amount of bus traffic (but that's probably what you'd deserve for using such a badly designed spinlock to begin with).

Cheers,

Brendan

bewing · Post by **bewing** » Sat Jun 09, 2007 11:10 pm

Colonel Kernel wrote:@bewing: Can you provide a link to something discussing the concept of "funnels" as it applies to IPC?

Uhhhhh. Ooops. Probably not. It is such an old concept, and out of favor in current Computer Sciencey circles. I believe it was you who called it simply an "actor model"? It is basically just the concept that you have one "agent" (thread) that owns each resource on the machine. Like the IPC resource. And any other thread on the machine needs to go through the IPC agent thread in order to send/receive an IPC transaction. Since each resource has one and only one owner, the owner of that resource never needs to handle any concurrency issues. Only atomicity issues, regarding the possibility of being interrupted for an extended time, while the resource in question is in an illegal state.

Unless, of course, you are trying to make the point that threads would need IPC to communicate with the IPC manager thread -- which they would have to go through the IPC manager to do. To which I would say that shared memory is a resource that doesn't need an agent thread -- but I realize I'm being a little contradictory with that statement.

Alternately, if you don't want to bring the data to the IPC manager thread, you can implement something like a token ring, and move the IPC manager thread from one CPU to the next, in a circle, and process that CPUs IPC request queue, while the IPC manager is running there. -- Just so long as there is ever only one copy of the thread, it works.

Colonel Kernel wrote:Also, how does your implementation avoid the kind of cache-related overheads that you mentioned?

Only CPU#0 (so to speak) is running kernel threads. The other CPUs are only running userspace apps and perhaps their own individual scheduler. Kernel memory pages are not shared between CPUs at all. So the other CPUs do not have any kernel-related memory in their caches at all -- because they never once accessed it. And CPU#0 is free to hold all its kernel data in fast cached memory.

To accomplish that, threads on the other CPUs must make system requests through queues of shared memory, to the kernel threads on CPU#0, and the results are passed back to the originating threads through shared memory pools dedicated to each CPU. (In my implementation, each CPU has its own system request queue.) Yes, as pointed out before, this means that what I gain through avoiding locks, I lose in needing to do block memory copies. And the userspace app threads on the other CPUs may block if their system requests are not processed before they are allocated their next timeslice.

When CPU#0 is in "kernel mode" -- ready to process system requests, it can efficiently "batch" all the requests from all the other CPUs all at once, before leaving "kernel mode" and going back to scheduling userspace apps on CPU#0, also.

Kevin McGuire · Post by **Kevin McGuire** » Sun Jun 10, 2007 4:30 am

I really like Candy, Solar, and Pype.Clicker. They never had to produce a misconception. Can you believe that? I actually look up to them more than ever now.

I followed along for the first twenty-five percent of your post... I find that you might enjoy the equation, but it serves no real functional value whatsoever. I have no idea why you keep posting it. I am not going to even touch that new equation that has just complicated things much worse, and is built using the same method. Let me do your work for you and show you a method that produces interpretable results.

Pa = .9
Pb = .1
T = Pa * Ta + Pb * Tb
---- new method ----
W = 0.0 * Ta + 1.0 * Tb
B = 1.0 * Ta + 0.0 * Tb
There is a 10% chance of W, and a 90% chance of B.

If you noticed we did not get any crazy values that represent some in-between number of instruction cycles needed. Instead we got a probability of a worst or best case scenario.
You be thinking hey thats what I did.. I was right. Wrong - You made it more apparent by posting another equation.

If you have a equation with multiple probabilities you can multiply them to get a total probability (unlike multiplying the probability with a cost).

T = Pa * Ta + Pb * Tb + Pc * Tc + Pd * Td
------ new method -----
W = 0.0 * Ta + 0.0 * Tb + 1.0 * Tc + 1.0 * Td
B = 1.0 * Ta + 1.0 * Tb + 0.0 * Tc + 0.0 * Td
There is a (Pc * Pd) chance of W, and a (Pa * Pb) chance of B.

So instead of me making this post extremely long and wasting a lot of time; How about you go back and redo your equation to reflect a better way to use probabilities in our situation. Instead, of some strange backwoods math that I have no idea how to interpret.

Just for clarification the W is the worst case scenario cost and B is the best cast scenario cost.