OS Support For Transactional Memory

Colonel Kernel · Post by **Colonel Kernel** » Wed May 30, 2007 8:57 am

Brendan wrote:Their results show that (for their implementation) pure paged-based STM is between 3 times slower and 8.3 times slower than XTM with hardware support, and the difference depends on how much overflows the hardware (where paged-based STM is 8.3 times slower than "HTM with no need for virtualization").

If you assume that "HTM with no need for virtualization" is as fast as we could reasonably expect, then IMHO 8.3 times slower is quite acceptable considering that HTM can't exist until CPU manufacturers implement it while page-based STM can exist now.

Now we just come down to practical arguments.

With a nearly order of magnitude performance penalty, I'd say this "unaccelerated" page-based TM is only useful for some situations -- not for all situations in which you would otherwise use locks.

You can go ahead and implement it though -- then BCOS can be the first OS to support the fancy new HTM hardware when it finally hits the market.

@ruibo: Thanks - it's a nice find!

I second that -- thanks!

Brendan · Post by **Brendan** » Wed May 30, 2007 10:15 am

Hi,

Colonel Kernel wrote:If you assume that "HTM with no need for virtualization" is as fast as we could reasonably expect, then IMHO 8.3 times slower is quite acceptable considering that HTM can't exist until CPU manufacturers implement it while page-based STM can exist now.

Now we just come down to practical arguments.

With a nearly order of magnitude performance penalty, I'd say this "unaccelerated" page-based TM is only useful for some situations -- not for all situations in which you would otherwise use locks.[/quote]

At the moment (given that my primary concern is making parallel programming easier rather than performance), my only remaining reservation is how useful page-based STM (and transactional memory in general) would be for my OS (rather than how useful it'd be for OSs in general).

My original intent (for as much as is practical) was to rely on non-blocking message passing where each piece of data is owned by a single thread and is never accessed by other threads (with enforcement provided by the OS in the form of a relatively large amount (>= 1 GB) of thread specific space). In this case the need for other forms of re-entrancy protection is reduced (as compared to more traditional systems, and providing support for both traditional locking and STM for any modifyable data left in "process space" may be overkill.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Wed May 30, 2007 10:35 am

Brendan wrote:My original intent (for as much as is practical) was to rely on non-blocking message passing where each piece of data is owned by a single thread and is never accessed by other threads (with enforcement provided by the OS in the form of a relatively large amount (>= 1 GB) of thread specific space).

I forgot about BCOS' "thread-space". Message passing definitely makes more sense in that context.

Ultimately, TM is a way to allow us to continue writing bad code (i.e. -- code the mutates global state), it just shifts the "badness" around by making the overall system more complex.

Candy · Post by **Candy** » Wed May 30, 2007 1:21 pm

Colonel Kernel wrote:
Brendan wrote:My original intent (for as much as is practical) was to rely on non-blocking message passing where each piece of data is owned by a single thread and is never accessed by other threads (with enforcement provided by the OS in the form of a relatively large amount (>= 1 GB) of thread specific space).
I forgot about BCOS' "thread-space". Message passing definitely makes more sense in that context.

Ultimately, TM is a way to allow us to continue writing bad code (i.e. -- code the mutates global state), it just shifts the "badness" around by making the overall system more complex.

Hate to hijack the thread, but since the thread has slowed and it's on topic, what about using join calculus instead? That is, explicitly defining a function with multiple entry points, where each needs to be called at least once for a specific one to complete, where that function then receives all arguments and returns a value? The other calls return instantly (post message kind of thing) and the other function is synchronous. There's talk of it on the Boost mailinglist recently.

Colonel Kernel · Post by **Colonel Kernel** » Wed May 30, 2007 2:27 pm

Candy wrote:Hate to hijack the thread, but since the thread has slowed and it's on topic, what about using join calculus instead? That is, explicitly defining a function with multiple entry points, where each needs to be called at least once for a specific one to complete, where that function then receives all arguments and returns a value? The other calls return instantly (post message kind of thing) and the other function is synchronous. There's talk of it on the Boost mailinglist recently.

I'd like to see such a library available for C++, if it's easier to use than Boost Lambda and Boost Bind.

I believe the parallel features of C Omega are based on the join calculus. It has a language construct called "chords" that represent the different "entry points" to a function. It's pretty neat, actually.

Crazed123 · Post by **Crazed123** » Thu May 31, 2007 8:01 am

Couldn't we represent that kind of thing very easily using threads?

Thread A waits for 4 messages. This makes it block.
Threads B, C, D, and E send messages at different times. Depending on the design of the system (synchronous or asynchronous) this can make them block.
Once the kernel detects one message waiting for delivery to each message thread A wants, it unblocks thread A and delivers the messages when thread A runs. In a synchronous system, it will also unblock B, C, D and E at this point.

Colonel Kernel · Post by **Colonel Kernel** » Thu May 31, 2007 8:11 am

Crazed123 wrote:Couldn't we represent that kind of thing very easily using threads?

Thread A waits for 4 messages. This makes it block.
Threads B, C, D, and E send messages at different times. Depending on the design of the system (synchronous or asynchronous) this can make them block.
Once the kernel detects one message waiting for delivery to each message thread A wants, it unblocks thread A and delivers the messages when thread A runs. In a synchronous system, it will also unblock B, C, D and E at this point.

That is more or less what goes on in C Omega, except that there is language support for expressing the pattern of messages that A is waiting for.

Crazed123 · Post by **Crazed123** » Thu May 31, 2007 10:03 am

C Omega looks like yet another cool thing invented by Microsoft Research from which the rest of us will never see the full benefits -- kind of like Singularity.

Then again, it is based on the heathen language C#, which M$ created by extending C with features stolen from Delphi.

Candy · Post by **Candy** » Thu May 31, 2007 10:06 am

Crazed123 wrote:Couldn't we represent that kind of thing very easily using threads?

Thread A waits for 4 messages. This makes it block.
Threads B, C, D, and E send messages at different times. Depending on the design of the system (synchronous or asynchronous) this can make them block.
Once the kernel detects one message waiting for delivery to each message thread A wants, it unblocks thread A and delivers the messages when thread A runs. In a synchronous system, it will also unblock B, C, D and E at this point.

The point is that the other four threads aren't blocked and that the entire synchronization is implicitly in the code - you don't have to mention explicitly what you're doing in terms of mutexes and so on, that's done for you.

Crazed123 · Post by **Crazed123** » Thu May 31, 2007 10:27 am

I think whether the four threads block in Comega or not depends on whether you declared the applicable method of the chord async or not.

Also, if one implemented the blocking/non-blocking of sending threads in the kernel, could one implement mutual exclusion via the message-passing system? If one could, it would be nice to write programs in terms of synchronizing IPC and let the operating system handle the down-and-dirty mutual exclusion work internally.

Brendan · Post by **Brendan** » Fri Jun 01, 2007 2:10 am

Hi,

Crazed123 wrote:Also, if one implemented the blocking/non-blocking of sending threads in the kernel, could one implement mutual exclusion via the message-passing system? If one could, it would be nice to write programs in terms of synchronizing IPC and let the operating system handle the down-and-dirty mutual exclusion work internally.

You could implement mutual exclusion via. messaging-passing. If there's only 2 threads involved then a thread can send a "give me the lock" message to the other thread if it doesn't already have it. You could also implement a "lock manager" that accepts "give me lock X" messages and "I'm finished with lock X messages".

However, in my experience software that's designed for message passing never needs mutual exclusion. It's a bit hard to explain, but instead of asking yourself "how do I protect FOO from multiple threads" you end up asking yourself "how do I split FOO across multiple threads so no data is shared".

In general I use one of 2 different approaches - "pipelining" (where each thread does some processing and sends it's results to the next thread in the pipeline), or "encapsulation" (where all operations that effect a piece of data are shifted into a seperate thread and other threads ask this thread to do any operations on the data instead of doing the operations themselves).

Pipelining works well if you're converting data from one or more formats into one or more other formats (I'm currently messing with a compiler/assembler implemented as 5 or more stages). Encapsulation works well when pipelining isn't suitable.

For an example, consider a word-processor. At the middle of the word-processor you might have a "document thread" that stores the current document and does all operations on that document. Then you could use pipelining when a file is opened, with one thread that reads the data from disk and sends it's output to a conversion thread, and a conversion thread that converts the data into a common format and sends it to the document thread. The same can happen when a file is saved (the document thread sends the data to a conversion thread that sends the data to a "write file/s" thread).

Then you'd have a "user interface" thread that interacts with the document thread, where the user interface thread asks the document thread to do modifications and searches, and the document thread sends data that should be displayed to the user interface thread.

If the document thread becomes a bottleneck, you could split it into several threads (e.g. one document thread per 1000 lines). If you want a multi-user word-processor you can use multiple user interface threads (I've done this before).

With non-blocking messaging, it's possible to have everything happening at the same time - e.g. while a document is being loaded from disk it can be being converted into a common format, being inserted into the document, being displayed and modified by the user as it becomes available, being converted into another format and being saved to disk.

All of this can be done without any mutual exclusion (or traditional locking of any kind).

Cheers,

Brendan

Candy · Post by **Candy** » Fri Jun 01, 2007 9:21 am

Uhm... yeah, I've been trying to get that kind of processing done too. Have a demo program that I can't demo since it's proprietary that decodes network traffic and does that fully parallel as described here, IE, using forwarding, messages, each protocol a single decode. It sends the output to a number of windows which then display what the data does - net result is that you have a very scalable program that can do anything simultaneously with your data, coming from half a dozen sources in two dozen different protocols in any kind of stacks. It can even do ad-hoc packet mangling, you connect it to two serial ports and it mangles the traffic in between whilst forwarding so you can test how your stack acts on bad cables, whilst having settably bad cables.

Crazed123 · Post by **Crazed123** » Fri Jun 01, 2007 10:19 am

Brendan: The encapsulation concept sounds conspicuously like object-oriented programming with one thread per object.

Colonel Kernel · Post by **Colonel Kernel** » Fri Jun 01, 2007 11:11 am

Crazed123 wrote:Brendan: The encapsulation concept sounds conspicuously like object-oriented programming with one thread per object.

Otherwise known as the "actor" model.

bewing · Post by **bewing** » Thu Jun 07, 2007 3:25 pm

In the olden days (the 1970's) it was called a "funnel" model. I am using generalized funnels everywhere through my OS and applications to completely avoid locking under all circumstances. All this newfangled stuff sounds too complex to me. Especially as Brendan admits that he still needs a spinlock (gah! one of the most disgusting cpu-cycle wasting misinventions in history!) to make his STM stuff work. -- Otherwise, the STM sounds ok to me. But funnels still sound better, and those godawful locks still sound like the worst.