OS Support For Transactional Memory

Colonel Kernel · Post by **Colonel Kernel** » Thu Jun 07, 2007 5:05 pm

bewing wrote:Especially as Brendan admits that he still needs a spinlock (gah! one of the most disgusting cpu-cycle wasting misinventions in history!) to make his STM stuff work. -- Otherwise, the STM sounds ok to me.

You always need spinlocks in the implementation of an OS' memory manager, unless someone very clever can think of lock-free algorithms for everything (which I doubt is possible). If you try to implement the OS itself without spinlocks, I think you'll end up with a chicken-and-egg problem. Even compiler-assisted STM requires some run-time support, and I'm betting you can't avoid locking in the implementation.

Kevin McGuire · Post by **Kevin McGuire** » Thu Jun 07, 2007 6:28 pm

Colonel Kernel wrote:
bewing wrote:Especially as Brendan admits that he still needs a spinlock (gah! one of the most disgusting cpu-cycle wasting misinventions in history!) to make his STM stuff work. -- Otherwise, the STM sounds ok to me.
You always need spinlocks in the implementation of an OS' memory manager, unless someone very clever can think of lock-free algorithms for everything (which I doubt is possible). If you try to implement the OS itself without spinlocks, I think you'll end up with a chicken-and-egg problem. Even compiler-assisted STM requires some run-time support, and I'm betting you can't avoid locking in the implementation.

Just wanted to clarify that you do not need to use a spin lock. You can make the thread sleep and wait. Then considering that a operating system might be designed to not use paging and have the problem of TLB flushes we could say that sleeping threads is still not a valid argument for being wasteful being relative to to our goal which is the implementation of transactional memory..

Just trying to keep the facts straight. I appreciated you helping me keep the facts straight about hash tables and associative arrays. Thanks.

Reference:
http://en.wikipedia.org/wiki/Spinlock
Extract:

NoBody wrote: In software engineering, a spinlock is a lock where the thread simply waits in a loop ("spins") repeatedly checking until the lock becomes available. As the thread remains active but isn't performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread blocks (aka "goes to sleep").

Semaphores.

Brendan · Post by **Brendan** » Thu Jun 07, 2007 7:54 pm

Hi,

About spinlocks....

Spinlocks do waste cycles, however, a mutex (where you switch to another thread and then switch back when the lock is free) can waste even more cycles. Mutexes (if used inappropriately) can also increase contention and latency, and reduce scalability. The same applies to semaphores.

For example, imagine what happens if the lock becomes free immediately after your mutex decides to switch to another thread - in this case you've got a minimum of 2 task switches before you acquire the lock.

I have 3 "rules of thumb" for micro-kernel locking:

Rule 1: Use a spinlock if the critical section is faster than 2 task switches, unless the lock is heavily contended.
Rule 2: If the critical section is not faster than 2 task switches, then redesign your code.
Rule 3: If the lock is heavily contended, then redesign your code.

Of course locks in user-space and device drivers are a completely different matter...

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Thu Jun 07, 2007 11:38 pm

Kevin McGuire wrote:Just wanted to clarify that you do not need to use a spin lock. You can make the thread sleep and wait.

And by putting a thread to sleep, you mean adding it to a queue of other sleeping threads, right? And what happens if more than one processor tries to modify that queue simultaneously? Do you see my point now?

At the lowest level of implementation, there are shared data structures, and you have to protect them from concurrent access. Fundamentally, there are only three ways to do this -- disable interrupts (which only works on a uniprocessor machine), use a lock-free algorithm (for something simple like a queue, this is feasible, but is not going to be feasible in the general case), or use a spinlock. All higher-level abstractions for safe concurrent access to shared data are built up from these primitives.

I suppose once non-virtualized HTM is out there, the kernel itself could use that, but I think it would be a bit complicated to use TM in the implementation of TM.

Kevin McGuire · Post by **Kevin McGuire** » Fri Jun 08, 2007 3:12 am

*shrug*

bewing · Post by **bewing** » Fri Jun 08, 2007 1:19 pm

Um, considering that this thread stayed completely on-topic until it began to deviate over the last 6 or 7 posts -- I kinda hate to do this, but ...

Theoretically, almost all of a computer's timeslices are supposed to be devoted to userspace apps, and therefore:
you can avoid almost all the complications that Colonel Kernel mentioned by simply enforcing that only one particular "core" is ever allowed to run kernel threads. Suddenly, you no longer need to deal with kernel threads on other cores competing for shared kernel resources, concurrently.

Crazed123 · Post by **Crazed123** » Fri Jun 08, 2007 1:33 pm

So then how do the other threads communicate with the kernel thread? In most systems, IPC entails dropping into kernel-space.

Kevin McGuire · Post by **Kevin McGuire** » Fri Jun 08, 2007 3:46 pm

And by putting a thread to sleep, you mean adding it to a queue of other sleeping threads, right? And what happens if more than one processor tries to modify that queue simultaneously? Do you see my point now?

It is just that I do not care to do your work for you and explain your point. If you make a point, make it clear and understandable.

However, just to make a point I will do your work for you in this case. By doing your work for you I will tell you that each individual processor in a SMP UMA system can have their own thread container. When migrating a thread you do a test and set (xchg) on a field in each processor's thread container (at the worst case scenario). I said test and set not sit and spin. If you test and set with success then you add a entry for this thread to another field. When the processor does a task switch it checks the migrating thread field and adds any threads added there. Eliminating the need for a ""spinlock"" as far as the implementation of scheduling interaction between processors goes.

As for the original matter which is a memory manager I still do not see you making a point as to why we have to use a "spinlock" which is interpreted by me as a non-sleeping spinlock just for the sake of clarification which was the original matter. The generic waste CPU cycles thread synchronization method. If you have a reason please do show some proofs as to why you think this is so. Then in that case you can also join in with the research for transactional memory by writing a paper.

I have trouble seeing your point in quite a few cases. As this one just wasted my time having to explain to you how it can be done, and having to argue with you about it.

I said it can be done with out a spinlock, and that is exactly what I mean.
If I tell you the moon is made out of cheese, then go get a ladder.

Colonel Kernel · Post by **Colonel Kernel** » Fri Jun 08, 2007 8:31 pm

Kevin McGuire wrote:I have trouble seeing your point in quite a few cases. As this one just wasted my time having to explain to you how it can be done, and having to argue with you about it.

First, take a deep breath.

I apologize if you took offense from my previous post; none was meant. It's obvious to me that we both have trouble understanding each other. Let's at least try to keep the tone more civil so this conversation remains interesting and relevant.

And now back to the technical bits...

Kevin McGuire wrote:each individual processor in a SMP UMA system can have their own thread container. When migrating a thread you do a test and set (xchg) on a field in each processor's thread container (at the worst case scenario). I said test and set not sit and spin. If you test and set with success then you add a entry for this thread to another field. When the processor does a task switch it checks the migrating thread field and adds any threads added there. Eliminating the need for a ""spinlock"" as far as the implementation of scheduling interaction between processors goes.

Yes. We can summarize this kind of technique as "shared lock-free queues".

As for the original matter which is a memory manager I still do not see you making a point as to why we have to use a "spinlock" which is interpreted by me as a non-sleeping spinlock just for the sake of clarification which was the original matter.

I have a hard time parsing that sentence, so I'll just try explaining where my comment came from. bewing made a comment about spinlocks being wasteful of CPU time and something to be avoided, which I actually agree with in principle. I then gave my opinion that I believe spinlocks are unavoidable in the general case of implementing an OS' memory manager, because it is very difficult to devise lock-free algorithms to update complex shared data structures atomically.

If I understand you correctly, you're proposing one of two things:

Memory management data structures can be kept simple enough that lock-free algorithms will suffice (I interpret your discussion of thread queues to be an analogy in this case), or
You can implement semaphores or mutexes using lock-free techniques

I apologize in advance if point 1 is a straw-man -- I am only guessing that this might be what you're trying to say. I am skeptical that updating several complex data structures atomically can be avoided in the implementation of a MM, if that MM is going to be scalable and reasonably fully-featured. I admit that this is more of a hunch based on my own experience designing my MM, and would be difficult to prove either way. If anybody finds a paper on lock-free MMs, let me know.

Point 2 is more interesting. Since mutexes and semaphores come down to queues and counters, it's not hard for me to believe that you could indeed implement them using only lock-free techniques ("lock-free" locks... how ironic

). While an interesting idea, I defer to Brendan's post for practical arguments on why using blocking locks in low-level MM code might not be such a good idea.

bewing · Post by **bewing** » Fri Jun 08, 2007 8:39 pm

Crazed123 wrote:So then how do the other threads communicate with the kernel thread? In most systems, IPC entails dropping into kernel-space.

Method 1:
The kernel thread, being a kernel thread, has access to all physical memory. There is a pool of memory mapped in userspace for an "IPC write queue", that was created by the kernel or a trusted IPC manager, and the kernel knows its location. As I said for my system, I use funnels for this sort of thing. The userspace threads do not access the queue memory space directly -- they use shared memory to send their IPC message to a "trusted" thread in userspace (i.e. written by ME) that copies the message into the IPC write queue.
When it runs, the kernel thread handler for IPC messages slurps in the queue, resets it, and dispatches all the messages. Each "ring" has its own IPC write queue, and the kernel IPC thread knows the locations of each of them.

Method 2:
Each userspace thread has a small "system request" structure, allocated for them at the time of their creation. When they want to send a message to a kernel process, they write it to the structure, and set a "service me" flag in a well known memory location, then perhaps block, waiting for completion. During its functional timeslice, the kernel thread polls however many "service me" memory locations you choose to create -- uses that to find all threads requesting system services, and processes all the requests all at once.

Method 3:
Each core still needs a scheduler running in ring 0 (which would be a trusted program, but technically not "part" of the kernel), and part of its function could be to forward IPC messages and system requests to the kernel, using a trusted queue system.

Brendan · Post by **Brendan** » Sat Jun 09, 2007 12:37 am

Hi,

bewing wrote:
Crazed123 wrote:So then how do the other threads communicate with the kernel thread? In most systems, IPC entails dropping into kernel-space.
Method 1:
The kernel thread, being a kernel thread, has access to all physical memory. There is a pool of memory mapped in userspace for an "IPC write queue", that was created by the kernel or a trusted IPC manager, and the kernel knows its location. As I said for my system, I use funnels for this sort of thing. The userspace threads do not access the queue memory space directly -- they use shared memory to send their IPC message to a "trusted" thread in userspace (i.e. written by ME) that copies the message into the IPC write queue.
When it runs, the kernel thread handler for IPC messages slurps in the queue, resets it, and dispatches all the messages. Each "ring" has its own IPC write queue, and the kernel IPC thread knows the locations of each of them.

The data is stored in memory, then there's a thread switch, then the data is copied to different memory, then there's another thread switch and the kernel finally does something, and there's yet another thread switch to get back where we started?

Now consider a computer with 8 CPUs, where 7 of those CPUs are trying to send and receive messages to/from each other and the kernel. Do you honestly think your single "trusted" thread in userspace (the one that inserts messages into the kernel's write queue) and your single kernel thread aren't going to be huge bottlenecks? You'll have about half the CPUs doing nothing while they wait for requests to be handled by the kernel thread. Perhaps you could have a "trusted" thread and a kernel thread for each CPU to avoid this bottleneck, but (assuming the kernel thread actually *does* something, which after reading about "Method 3" I'm not too sure of) you'll have problems when 2 or more kernel threads try to modify the same piece of kernel data at the same time.

In any case, I'm guessing this is going to cost a little more than the SYSCALL instruction would (with or without a few spinlocks thrown in)...

bewing wrote:Method 2:
Each userspace thread has a small "system request" structure, allocated for them at the time of their creation. When they want to send a message to a kernel process, they write it to the structure, and set a "service me" flag in a well known memory location, then perhaps block, waiting for completion. During its functional timeslice, the kernel thread polls however many "service me" memory locations you choose to create -- uses that to find all threads requesting system services, and processes all the requests all at once.

The kernel thread continually burns CPU cycles while waiting for something that may or may not happen sooner or later (polling)? At least with spinlocks you know that the lock will become free soon, and you aren't wasting cycles for something you hope might happen.

I'm not sure if you're planning on wasting an entire CPU on this polling (to reduce the time other CPUs need to stand around waiting before the kernel thread gets around to checking the "service me" flags), or if you're planning on wasting a lot of CPU time doing thread switches between wasting time polling (so that an entire CPU isn't wasted). I guess for single-CPU systems you'd have no choice.

bewing wrote:Method 3:
Each core still needs a scheduler running in ring 0 (which would be a trusted program, but technically not "part" of the kernel), and part of its function could be to forward IPC messages and system requests to the kernel, using a trusted queue system.

Rather than going to the shop to buy something you could ask a guy who once met another guy who's related to someone that's got the phone number for a person who works at the shop, and they could ask for them to deliver it, and hopefully some time next week they'll drop it around and you won't need worry about the small chance of needing to wait at the checkout! Is that the sort of thing you mean by "a trusted queue system"?

I'm sorry if all this sounds a bit harsh - you are correct in that it would be theoretically possible to write an OS without any locking at all. I just don't think you understand how impractical it is...

A spinlock takes about 5 cycles to acquire a lock that is free. To acquire a lock that isn't free, it'll take about 5 cycles plus the remainder of the critical section that the lock holder hasn't executed yet (an average of half the length of the critical section). For an example, if the critical section is 500 cycles long, and the lock is not free 10% of the time (i.e. it's under heavy contention), then it works out to an average cost of (roughly) "0.9 * 5 + 0.1 * (5 + 500 /2)" cycles, or about 30 cycles.

To avoid this "insane" waste of CPU time, you end up with thread switches that cost in the order of 5 to 500 times more CPU time. It doesn't make sense.

Cheers,

Brendan

Kevin McGuire · Post by **Kevin McGuire** » Sat Jun 09, 2007 6:21 am

I can not understand (0.9 * 5). Why are you taking a percent?

On a SMP UMA where a spinlock does not yield the following condition could be used as a rule of thumb for when to sleep and when to spin in regard to performance (which was never my regard until I saw the funny math equation).
X is the cost for the spinlock exit instruction sequence.
E is the cost for the spinlock test and spin.
C is the protected instruction block cost.
T is the thread's remaining time slice.
S is the cost of a context switch.
((E + X + C/2) < T) && (2S > T)

Yes. We can summarize this kind of technique as "shared lock-free queues".

No. It is a lock. Just not a spinlock. That was my entire point from the beginning. You do not call everything a spinlock since it is quite confusing when people read that..
Quote: "You always need spinlocks in the implementation of an OS' memory manager, unless someone very clever can think of lock-free algorithms for everything (which I doubt is possible)."
(spinlock || lockfree)
No. Take for instance a uniprocessor which could completely disable interrupts. I will specifically note that you spoke of a memory manager not a transactional memory implementation. I think this was confusing and wrong. There are locks other than a spinlock -- the word used was not concise enough and would be confusing to someone reading your post.

Brendan · Post by **Brendan** » Sat Jun 09, 2007 9:22 am

Hi,

Kevin McGuire wrote:I can not understand (0.9 * 5). Why are you taking a percent?

It's fairly simple - the average total cost of any lock would be the chance of no lock contention multiplied by the cost of acquiring the lock when there's no contention, plus the chance of lock contention multiplied by the cost of acquiring the lock when there is contention (plus the cost of freeing the lock).

Kevin McGuire wrote:On a SMP UMA where a spinlock does not yield the following condition could be used as a rule of thumb for when to sleep and when to spin in regard to performance (which was never my regard until I saw the funny math equation).
X is the cost for the spinlock exit instruction sequence.
E is the cost for the spinlock test and spin.
C is the protected instruction block cost.
T is the thread's remaining time slice.
S is the cost of a context switch.
((E + X + C/2) < T) && (2S > T)

If a time slice is 10 ms (common for both Windows and Linux) then T would be about 10 million cycles for a 2 GHz CPU. If "E + X + C/2" is not less than 10 million cycles, or if "2S" is more than 10 million cycles, then you should become Amish and vow to never touch any device that uses electricity again.

This leaves us with the equivelent of "(several hundred < several million) && (several hundred > several million)". I'm not sure if that means "always use a spinlock" or if it means "always use a mutex" though...

Cheers,

Brendan

bewing · Post by **bewing** » Sat Jun 09, 2007 10:51 am

Heh. Hi Brendan!

1) As you noticed, I was putting down extremely conceptually simple methods of achieving the stated goal, as a proof of concept -- not as a suggestion for a highly optimized solution.

2) Yes, I suspect that trading a set of rep movsd's for all your thread synchronization, locking, blocking, and spinlocks would probably come out in my favor, in terms of cpu cycles. The only way to find out is to try it and see.

3) On a core with 1000 threads, each thread spends 99.9% of it's time "sitting, doing nothing." That is, suspended. That is the time when all its system service calls can be efficiently asynchronously handled. No matter how it gets handled, if the request is completed by the time the thread's next timeslice comes around, there is no bottleneck at all.

4) Mutexes at the ends/beginnings of timeslices, and on system requests (that cannot be completed instantly in any case) are perfectly normal operation -- and it is not reasonable to be counting them as overhead for a particular methodology. All the thread switches you mentioned were of those types.

5) My hypothesis is that every core ALWAYS has at least one "ready to compute" thread in its tasklist at all times, so no core will ever be sitting, waiting for the completion of at least one system request -- to unblock at least one of its threads.

6) On a system with 8 cores, if processing system requests averages more than a couple percent of the total available CPU cycles -- then you've got a serious problem with the efficiency of your API. Using my technique of restricting the kernel to one core, I'd never expect the kernel threads to take more than half the cycles on that core (at peak), and usually less than a percent, I'd guess. No CPU or set of threads should ever send THAT many messages, of that length.

7) I did not say "polling LOOP" in Method 2. It was a single poll over perhaps 32 bytes of memory. Compiled efficiently, it could be done in 16 opcodes, and one cache fill.

Colonel Kernel · Post by **Colonel Kernel** » Sat Jun 09, 2007 11:00 am

Kevin McGuire wrote:I can not understand (0.9 * 5). Why are you taking a percent?

The word I would use to describe the 0.9 and 0.1 in Brendan's equation is "probability".

No. It is a lock. Just not a spinlock.

I was replying specifically to this part of your explanation:

When migrating a thread you do a test and set (xchg) on a field in each processor's thread container (at the worst case scenario). I said test and set not sit and spin. If you test and set with success then you add a entry for this thread to another field.

This "test and set" sequence, also known as "compare-and-swap", is the building block of all lock-free techniques for atomically updating shared memory. I realize that your intention was to show how to implement mutexes or semaphores using this technique, which I hadn't considered before. To me, these two things exist at different levels of abstraction. At a high level, you've implemented a kind of lock (not a spinlock, I agree). At the low level, where you're managing concurrent access to the queue that represents that lock, you're using compare-and-swap, which is a lock-free technique. I hope this clears up the terminology.

That was my entire point from the beginning. You do not call everything a spinlock since it is quite confusing when people read that..

What you described is definitely not a spinlock, I agree. I was not claiming that it was. I was ignoring mutexes and semaphores however, because 1) all the implementations of mutexes and semaphores that I'd seen so far rely on spinlocks (but as you pointed out, this is not the only way to do it), and 2) Like Brendan, I believe it is impractical to use locks that block the acquiring thread in performance-sensitive parts of a kernel, like a memory manager.

Quote: "You always need spinlocks in the implementation of an OS' memory manager, unless someone very clever can think of lock-free algorithms for everything (which I doubt is possible)."
(spinlock || lockfree)
No. Take for instance a uniprocessor which could completely disable interrupts.

I mentioned that already too:

At the lowest level of implementation, there are shared data structures, and you have to protect them from concurrent access. Fundamentally, there are only three ways to do this -- disable interrupts (which only works on a uniprocessor machine), use a lock-free algorithm (for something simple like a queue, this is feasible, but is not going to be feasible in the general case), or use a spinlock. All higher-level abstractions for safe concurrent access to shared data are built up from these primitives.

How many uniprocessor machines do you think will be in widespread use in 3-5 years? Even the QNX real-time OS is designed for use with SMP, so it's not just in desktops and servers where uniprocessors are disappearing. In other words, I don't think it's practical anymore to rely on disabling interrupts.

I will specifically note that you spoke of a memory manager not a transactional memory implementation. I think this was confusing and wrong.

The discussion started (wayyyy back) with Brendan proposing STM implemented in the OS' memory manager. The paper on XTM also discussed this technique. Page-based STM involves managing transaction read and write sets via the paging mechanism, so it makes sense to make it part of the MM. That's why I've been discussing locking in the MM -- because Brendan was talking about OS-level STM support when he mentioned using a spinlock.

There are locks other than a spinlock -- the word used was not concise enough and would be confusing to someone reading your post.

True, there are locks other than a spinlock. From my point of view, spinlocks and other types of locks exist at different levels of abstraction, as I explained above. I'll try to elaborate further.

I think the easiest way to make the distinction is to consider which part of a thread's execution you think of as the "critical section". Consider a mutex implemented as a queue of waiting threads protected by a spinlock. Imagine that a user-mode thread wants to enter a critical section because it's going to access memory in its process' address space that is shared with other threads of that process. From that thread's point of view, the mutex is the lock.

Now look at it from the kernel's point of view. When the user-mode thread makes the "Mutex_lock" system call, it's now kernel code that is running on that thread. That kernel code needs to access a data structure (the waiting thread queue of the mutex) that is shared by potentially all threads in the system, since it is in the kernel's address space. So from the kernel's point of view, its own implementation of "Mutex_lock" is another critical section. In this example, the kernel protects that critical section with a spin-lock. It could use "compare-and-swap" techniques instead, as you say.

This is what I mean by different locks being at different levels of abstraction. The mutex is a lock, but it is not a primitive mechanism -- it is further composed of other mechanisms (queue of waiting threads plus a means to protect it).