Machine identification

Hyperdrive · Post by **Hyperdrive** » Thu Feb 12, 2009 9:25 am

Hi,

JackScott wrote:I have a quick question. Does this work via 56K modems, or do you require more like 10-gigabit fibre optic ethernet to do this? I'm suspecting the latter. If you have time (and are able to legally), an overview of the hardware you're planning to use would be quite interesting.

Just common hardware. Our cluster machines have...

CPU: AMD Athlon 64 X2 4200+
mainboard: MSI K8T Neo2 V2.0
graphics: ATI Radeon 9200 PRO
NIC: Intel 82540EM Gigabit Ethernet (Intel E1000)
RAM: 2 GB (I think)

And our network is switched gigabit ethernet.

[EDIT: mainboard corrected]
[EDIT 2: RAM added]

EDIT 3: lspci scan now attached

--TS

Brendan · Post by **Brendan** » Thu Feb 12, 2009 10:46 am

Hi,

Hyperdrive wrote:well let me put some things straight here. I hope explaining the concepts make everything a bit clearer.

Thanks for that!

Hyperdrive wrote:Questions?

Yes - scalability and performance?

I've read research papers on (non-distributed) transactional memory and read through discussion from (what I considered) experts, and it all gave me the same impression - with any collisions performance goes bad fast (work lost due to roll-backs), well designed explicit locking can always give better scalability/performance, and the main (only?) advantage of transactional memory is that it makes it easier for someone who doesn't necessarily have the skills or time for explicit locking to write correct code. Once you add "distributed" into it the latencies for page fetch, validation, etc increase and the amount of time that a collision can occur increases, and the chance of collisions increases.

Research papers that compare distributed shared memory to message passing also indicate that distributed shared memory is slower than message passing, but systems that use distributed shared memory can be easier to program for.

Basically to me it looks like you've done everything possible to make programming easier, which is why I'm curious about the scalability/performance...

Hyperdrive wrote:In traditional approaches for distributed applications you would pass messages around over the network. The problem is, the larger your application becomes, the more complex your protocol and the application implementation will be.

In all cases, a sane programmer will split a large/complex application into small/simple pieces. For message passing, each small/simple piece has it's own small/simple protocol - it only become complex if you fail to split it into manageable pieces. However, this depends a lot on what sort of messaging - asynchronous messaging is more complex than synchronous messaging (but has better performance/scalability).

Cheers,

Brendan

Hyperdrive · Post by **Hyperdrive** » Thu Feb 12, 2009 11:55 am

Brendan wrote:I've read research papers on (non-distributed) transactional memory and read through discussion from (what I considered) experts, and it all gave me the same impression - with any collisions performance goes bad fast (work lost due to roll-backs), well designed explicit locking can always give better scalability/performance

[EDIT]
For non-distributed environments you may be right. But when it comes to distribution I'm not sure that's true. Locking in distributed systems isn't that easy.
[/EDIT]

Brendan wrote:the main (only?) advantage of transactional memory is that it makes it easier for someone who doesn't necessarily have the skills or time for explicit locking to write correct code. Once you add "distributed" into it the latencies for page fetch, validation, etc increase and the amount of time that a collision can occur increases, and the chance of collisions increases.

Research papers that compare distributed shared memory to message passing also indicate that distributed shared memory is slower than message passing, but systems that use distributed shared memory can be easier to program for.

That's entirely true.

Brendan wrote:Basically to me it looks like you've done everything possible to make programming easier, which is why I'm curious about the scalability/performance...

Around 80% of the IT costs in a company is due to buggy software. (That's just my estimation, I don't have some figures at hand, but see for example http://www.nist.gov/public_affairs/releases/n02-10.htm?). Bugs come, among other things, from overly complex software. And distributed environments are even more complex. Then the main idea is to reduce complexity, i.e. making programming easier. By reducing complexity for the programmer, it's very likely we see a significant decrease in bug counts. Lowering bug counts means less money wasting.

The other way around -- increasing performance gets harder the more you do it. It's relatively easy to get a 30% increase when your application is performing at 50% of the theoretical maximum. But it's incredibly hard to squeeze another 1% out of your application when it's performing at 97% of the theoretical max. Besides, an increase of performance saves you less money than eliminating bugs.

Bottom line: Good software is better then good performance. So making programming easier is better than to push the envelope in terms of performance.

Brendan wrote:
Hyperdrive wrote:In traditional approaches for distributed applications you would pass messages around over the network. The problem is, the larger your application becomes, the more complex your protocol and the application implementation will be.
In all cases, a sane programmer will split a large/complex application into small/simple pieces. For message passing, each small/simple piece has it's own small/simple protocol - it only become complex if you fail to split it into manageable pieces. However, this depends a lot on what sort of messaging - asynchronous messaging is more complex than synchronous messaging (but has better performance/scalability).

The pieces themselves are manageable. But by splitting you often increase the need for communication in distributed environments. If your application is growing, it gets harder to do it right. Which component needs data? Which data? When? How fast (timeouts)? How is it reachable? And even if you can answer all that, can you guarantee that the whole system has a consistent view?

Take WoW for example, which uses message passing. Why did they choose a client/server approach? Every avatar movement is reported to the server and other clients rely on the view the server provides. They could have choosen to do it more like peer to peer, but they didn't. Why? It's just too hard. Maintaining consistency with message passing is incredibly hard and implementing that very error-prone.

It's not fun if you want to ax your opponent but your application wasn't notified that the opponent ran away. It's no fun if you team up and the great battle is starting, but everyone sees different scenarios (Fred sees 3 opponents to the left while Alex sees only a big fat monster to the right. (Note: I really have no clue about WoW gaming. I just hear every now and then about "the adventures" of others. But I hope you get my point.)).

--TS

Love4Boobies · Post by **Love4Boobies** » Thu Feb 12, 2009 6:45 pm

Hyperdrive, are you aware of this document?

Brendan · Post by **Brendan** » Fri Feb 13, 2009 4:57 am

Hi,

First, please don't take my reply literally! Let me play "the Devil's advocate" for a minute - hopefully you'll be able to shoot down my arguments with ease...

Hyperdrive wrote:
Brendan wrote:I've read research papers on (non-distributed) transactional memory and read through discussion from (what I considered) experts, and it all gave me the same impression - with any collisions performance goes bad fast (work lost due to roll-backs), well designed explicit locking can always give better scalability/performance
[EDIT]
For non-distributed environments you may be right. But when it comes to distribution I'm not sure that's true. Locking in distributed systems isn't that easy.
[/EDIT]

For message passing, with careful design no locks are needed. Basically each small/simple piece has exclusive access to it's own data.

For an example, imagine something like a telephone directory (a database of names, with address and phone numbers). You could have one thread for each letter of the alphabet, so that all the names that begin with the letter 'A' are only ever handled by thread #0, names beggining with 'B' are handled by thread #1, etc. Then (if necessary) you have a main thread that receives requests and forwards them to the appropriate worker thread. For example, you might want to add an entry for "Fred", so the main thread forwards this request to worker thread #5, and worker thread #5 creates the new entry with no locking at all because thread #5 is the only thread that has access to the data.

That's only the beginning though - for messaging it's relatively easy to extend this. You could add some redundancy (have 2 threads for 'A', 2 threads for 'B'), but why not go further? Have N threads for each group of worker threads, where requests that modify data go to all threads in the group, and requests that only find/read data are sent to 2 of the threads in the group where the results from these threads can be compared (to detect software/hardware glitches), and if a glitch is found you could ask a third thread and find out which of the previous results weren't right (and test the old/dodgy thread a little and terminate it if it keeps failing, and spawn a new thread to replace it). Then you could add load balancing to this - send requests to find/read data to the least recently used threads, and if they're all busy spawn some more threads to handle the load (and if they're all idle maybe terminate some spare threads). You can do all of this without any locking at all.

There are cases here that look like they need some locking. For example, if someone changes their name from "Albert" to "Bob" then you might want to delete an entry that starts with 'A' and create a new entry that starts with 'B', and make these changes appear atomic. In this case it's still simple (you'd do it in the main thread) - all you'd need is a flag for each group of worker threads that says whether requests are queued (postponed until the atomic operation completes) or forwarded. This isn't really a lock though (it's more like a "queue requests" flag) because only one thread ever touches it.

This should scale fairly well until the single main thread becomes a bottleneck. If that's likely you'd just have more main threads. In this case the main threads would need to synchronize things with each other (e.g. the atomic operations and load balancing would need to be coordinated, but normal requests could be done without any coordination).

Also, all of this could be implemented one step at a time. For example, start with the simple worker thread then build on that. This is made easier because the main thread/s don't need to care how the worker threads operate, and the worker threads don't need to care how the main thread/s operate. The only thing that matters is they talk the same protocol. Maybe this is just a simple example, but then maybe it's just a small piece of something larger, where the "something larger" only needs to talk the same protocol as the main thread/s and doesn't need to know anything about how the main thread/s operate (or that the worker threads even exist).

For the same thing with distributed transactional shared memory, if each computer shares the same data then you end up false sharing, and even if you made everything 4 KiB large to avoid that then you'd still be constantly waiting for pages of data to be transfered over the network. Of course you could split things apart (just like with the messaging example) to avoid this, but then there's no need for any shared memory or any transactional memory in the first place (except for creating a "performance crippled" emulation of messaging). To make it worse, the first version of the code might be both easy to write and correct, but it'll also be unusable due to horrendous performance, and the programmer won't know what to do to fix it because the OS is hiding all the details (and eventually after several rewrites, the unfortunate programmer will come to realise there's nothing they can do to make the performance acceptable, and that adding more hardware just makes things worse).

Hyperdrive wrote:
Brendan wrote:Basically to me it looks like you've done everything possible to make programming easier, which is why I'm curious about the scalability/performance...
Around 80% of the IT costs in a company is due to buggy software. (That's just my estimation, I don't have some figures at hand, but see for example http://www.nist.gov/public_affairs/releases/n02-10.htm?). Bugs come, among other things, from overly complex software. And distributed environments are even more complex. Then the main idea is to reduce complexity, i.e. making programming easier. By reducing complexity for the programmer, it's very likely we see a significant decrease in bug counts. Lowering bug counts means less money wasting.

I agree - using a large/expensive server instead of a distributed system would reduce complexity. Complexity is only needed for better performance (e.g. if the application needs more processing power than one expensive server can provide).

NIST takes a different approach - to reduce the cost of buggy software, make it easier to find bugs. From your own link:

NIST wrote:The path to higher software quality is significantly improved software testing. Standardized testing tools, suites, scripts, reference data, reference implementations and metrics that have undergone a rigorous certification process would have a large impact on the inadequacies currently plaguing software markets. For example, the availability of standardized test data, metrics and automated test suites for performance testing would make benchmarking tests less costly to perform. Standardized automated testing scripts, along with standard metrics, also would provide a more consistent method for determining when to stop testing.

Hyperdrive wrote:The other way around -- increasing performance gets harder the more you do it. It's relatively easy to get a 30% increase when your application is performing at 50% of the theoretical maximum. But it's incredibly hard to squeeze another 1% out of your application when it's performing at 97% of the theoretical max. Besides, an increase of performance saves you less money than eliminating bugs.

Except the performance problem is caused by the memory model used by the OS, which isn't something an application can fix. If an application programmer does everything they possibly can to improve performance but the application still only performs at %10 of theoretical maximum performance, then is it realistic to ask them to rewrite the OS and the application to squeeze another 85% out of it?

Hyperdrive wrote:Bottom line: Good software is better then good performance. So making programming easier is better than to push the envelope in terms of performance.

For desktop systems, I agree completely. The type of distributed system you're designing isn't used for desktop systems - it's mainly used for HPC. On average, programmers that write software for desktop use are "average". Programmers that write software for large servers are typically "above average", and programmers that write software for HPC are "above above average" (in that they've been writing software that scales to > 1000 nodes for a decade or so already). I'm not convinced that these "above above average" programmers really need something that makes programming easier at the expense of performance.

Hyperdrive wrote:
Brendan wrote:
Hyperdrive wrote:In traditional approaches for distributed applications you would pass messages around over the network. The problem is, the larger your application becomes, the more complex your protocol and the application implementation will be.
In all cases, a sane programmer will split a large/complex application into small/simple pieces. For message passing, each small/simple piece has it's own small/simple protocol - it only become complex if you fail to split it into manageable pieces. However, this depends a lot on what sort of messaging - asynchronous messaging is more complex than synchronous messaging (but has better performance/scalability).
The pieces themselves are manageable. But by splitting you often increase the need for communication in distributed environments. If your application is growing, it gets harder to do it right. Which component needs data? Which data? When? How fast (timeouts)? How is it reachable? And even if you can answer all that, can you guarantee that the whole system has a consistent view?

Take WoW for example, which uses message passing. Why did they choose a client/server approach? Every avatar movement is reported to the server and other clients rely on the view the server provides. They could have choosen to do it more like peer to peer, but they didn't. Why? It's just too hard. Maintaining consistency with message passing is incredibly hard and implementing that very error-prone.

Peer-to-peer only works if you can trust the peers, and for WoW there's probably plenty of people willing to hack the game's code to get an advantage. Client/server means the (trusted) server can double check everything that it receives from the (untrusted) clients to make sure nothing dodgy is going on.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Fri Feb 13, 2009 9:46 am

There is one thing that I think is frequently missed in the argument between shared memory and message-passing: Message-passing leads to better encapsulation and loose coupling of components. Shared memory often (but not always) makes it all too easy to share global state, write code with hard-to-detect side effects, etc.

I have some (recent) experience working on a highly scalable system that basically uses both of these techniques, and indeed, the parts that rely on shared state are much harder to understand, debug, etc. I just fixed an integration bug the day before yesterday that was triggered by a seemingly unrelated change in another part of the system (can't go into details since it's proprietary

). This wouldn't have happened if certain parts of the global state weren't co-mingled the way they were. It's almost like what Brendan said about "false sharing", but in terms of correctness and maintainability instead of performance.

Hyperdrive · Post by **Hyperdrive** » Fri Feb 13, 2009 9:53 am

Hi,

Brendan wrote:
Hyperdrive wrote:Locking in distributed systems isn't that easy.
For message passing, with careful design no locks are needed. Basically each small/simple piece has exclusive access to it's own data.

For an example, imagine something like a telephone directory (a database of names, with address and phone numbers). You could have one thread for each letter of the alphabet, so that all the names that begin with the letter 'A' are only ever handled by thread #0, names beggining with 'B' are handled by thread #1, etc. Then (if necessary) you have a main thread that receives requests and forwards them to the appropriate worker thread. For example, you might want to add an entry for "Fred", so the main thread forwards this request to worker thread #5, and worker thread #5 creates the new entry with no locking at all because thread #5 is the only thread that has access to the data.
[snipped]

here you have switched the approach. That's the very mistake that is often made: the prospensity to centralize things. Your main thread is a central point, the coordinator. There's no true distribution. The positive effects you gain come from having a central coordinator, not from doing message passing smartly.

Also your approach lacks the ability to have data "as close as possible". If someone requests the phone number for "Fred" it asks your main thread, waits, and gets the data. Fine. But you can't cache the data, because Fred could have moved and now has a different phone number. Every time someone wants to call Fred (or whomever) he has to ask the database.

My approach is that any node could have any data at any time: replication and migration both allowed. Your approach allows neither of them.

Brendan wrote: This should scale fairly well until the single main thread becomes a bottleneck. If that's likely you'd just have more main threads. In this case the main threads would need to synchronize things with each other (e.g. the atomic operations and load balancing would need to be coordinated, but normal requests could be done without any coordination).

Ah, here we go... At the time you de-centralize you get into trouble. Now image you have n main threads in a system with n nodes, each running one main thread...

Also, all of this could be implemented one step at a time. For example, start with the simple worker thread then build on that. This is made easier because the main thread/s don't need to care how the worker threads operate, and the worker threads don't need to care how the main thread/s operate. The only thing that matters is they talk the same protocol. Maybe this is just a simple example, but then maybe it's just a small piece of something larger, where the "something larger" only needs to talk the same protocol as the main thread/s and doesn't need to know anything about how the main thread/s operate (or that the worker threads even exist).

Brendan wrote: For the same thing with distributed transactional shared memory, if each computer shares the same data then you end up false sharing,

Admittedly, false sharing is a severe problem.

Brendan wrote: and even if you made everything 4 KiB large to avoid that

That won't fix it in all cases, but that's another story...

Brendan wrote: then you'd still be constantly waiting for pages of data to be transfered over the network.

Why? You can replicate (i.e. cache) data. Only if the page was invalidated, you have to request it again. Writes are made less often than reads. And even more often there are only a few nodes that write. So collisions are rare and mostly the data is cached.

Brendan wrote:Of course you could split things apart (just like with the messaging example) to avoid this, but then there's no need for any shared memory or any transactional memory in the first place (except for creating a "performance crippled" emulation of messaging).

There's no need to split things apart. If data is shared then communication is needed (and you get higher latencies) - regardless if messaging or distributed memory is used. If the data isn't used by someone else (i.e. no sharing), there's no communication - regardless if messaging or the distributed memory is used.

Brendan wrote: To make it worse, the first version of the code might be both easy to write and correct, but it'll also be unusable due to horrendous performance, and the programmer won't know what to do to fix it because the OS is hiding all the details (and eventually after several rewrites, the unfortunate programmer will come to realise there's nothing they can do to make the performance acceptable, and that adding more hardware just makes things worse).

Well, I think for a first write the performance would be "acceptable". And I admit, that programmers need a good understanding how the TDM works, to improve performance. And yes, the room for improvements is somewhat limited because there is a high level of abstraction. So the OS has to provide as many optimizations as possible.

And if nothing is helping you can always get away with providing weaker consistency models (transactional consistency is a very strong one). Then the application can decide which consistency model fits and uses that one. Or better: Provide a simple (!) interface to plug-in consistency models. That's not impossible - we have plans to do so.

Brendan wrote:Basically to me it looks like you've done everything possible to make programming easier, which is why I'm curious about the scalability/performance...

Hyperdrive wrote: Around 80% of the IT costs in a company is due to buggy software. (That's just my estimation, I don't have some figures at hand, but see for example http://www.nist.gov/public_affairs/releases/n02-10.htm?). Bugs come, among other things, from overly complex software. And distributed environments are even more complex. Then the main idea is to reduce complexity, i.e. making programming easier. By reducing complexity for the programmer, it's very likely we see a significant decrease in bug counts. Lowering bug counts means less money wasting.
I agree - using a large/expensive server instead of a distributed system would reduce complexity. Complexity is only needed for better performance (e.g. if the application needs more processing power than one expensive server can provide).

A large/expensive server is just the same. It's only very tightly coupled. The same arguments apply.

Brendan wrote: NIST takes a different approach - to reduce the cost of buggy software, make it easier to find bugs. From your own link:
NIST wrote:The path to higher software quality is significantly improved software testing. Standardized testing tools, suites, scripts, reference data, reference implementations and metrics that have undergone a rigorous certification process would have a large impact on the inadequacies currently plaguing software markets. For example, the availability of standardized test data, metrics and automated test suites for performance testing would make benchmarking tests less costly to perform. Standardized automated testing scripts, along with standard metrics, also would provide a more consistent method for determining when to stop testing.

Providing the link to the NIST report I aimed at substantiating my argument that buggy software is incredibly costly (using the headline as an "eye-catcher"). As you said, they take a different approach. I don't think they do the right thing.

They don't solve the problem, they just propose using better tools to see the problem more clearly. Read as: They don't prevent bugs to be made, they just put more effort in finding the bugs. Once you have to do extensive testing, you've already lost the war. NIST just closes the doors more loudly than others after their horses have bolted. [Hm... As a non-native speaker I hope you'll understand this proverb variation... Do you?]

Brendan wrote:
Hyperdrive wrote:The other way around -- increasing performance gets harder the more you do it. It's relatively easy to get a 30% increase when your application is performing at 50% of the theoretical maximum. But it's incredibly hard to squeeze another 1% out of your application when it's performing at 97% of the theoretical max. Besides, an increase of performance saves you less money than eliminating bugs.
Except the performance problem is caused by the memory model used by the OS, which isn't something an application can fix. If an application programmer does everything they possibly can to improve performance but the application still only performs at %10 of theoretical maximum performance, then is it realistic to ask them to rewrite the OS and the application to squeeze another 85% out of it?

I don't think that's the case. I agree that message passing could outperform the TDM (if well done), but I won't expect such horrible performances for the TDM. Granted, I have no proof - but you don't have one either.

Brendan wrote:
Hyperdrive wrote:Bottom line: Good software is better then good performance. So making programming easier is better than to push the envelope in terms of performance.
For desktop systems, I agree completely. The type of distributed system you're designing isn't used for desktop systems - it's mainly used for HPC. On average, programmers that write software for desktop use are "average". Programmers that write software for large servers are typically "above average", and programmers that write software for HPC are "above above average" (in that they've been writing software that scales to > 1000 nodes for a decade or so already). I'm not convinced that these "above above average" programmers really need something that makes programming easier at the expense of performance.

Why not using it for desktop systems? We have a virtual 3D world as a use case on top of our TDM. It's an "desktop application", isn't it? I would expect there are more applications for which a TDM is a excellent approach.

For HPC you need the above above programmers because they can handle the complexity the programming model imposes if you want to write well scaling software. But they are humans also and so the software they write has bugs, too. I'd expect above above programmers could also write well scaling TDM software - but they don't have such a hard time to get it right and there will be fewer bugs.

And as I said, performance isn't everything. Does the best performance always justify a long development time + much money for paying the above above programmers (they don't work for free) + downtimes due to bugs (that's the very costly part)?

Brendan wrote:
Hyperdrive wrote:
Brendan wrote: In all cases, a sane programmer will split a large/complex application into small/simple pieces. For message passing, each small/simple piece has it's own small/simple protocol - it only become complex if you fail to split it into manageable pieces. However, this depends a lot on what sort of messaging - asynchronous messaging is more complex than synchronous messaging (but has better performance/scalability).
The pieces themselves are manageable. But by splitting you often increase the need for communication in distributed environments. If your application is growing, it gets harder to do it right. Which component needs data? Which data? When? How fast (timeouts)? How is it reachable? And even if you can answer all that, can you guarantee that the whole system has a consistent view?

Take WoW for example, which uses message passing. Why did they choose a client/server approach? Every avatar movement is reported to the server and other clients rely on the view the server provides. They could have choosen to do it more like peer to peer, but they didn't. Why? It's just too hard. Maintaining consistency with message passing is incredibly hard and implementing that very error-prone.
Peer-to-peer only works if you can trust the peers, and for WoW there's probably plenty of people willing to hack the game's code to get an advantage. Client/server means the (trusted) server can double check everything that it receives from the (untrusted) clients to make sure nothing dodgy is going on.

Sorry, I don't think that's the point. It's only an excuse and not that bad - but nothing more. Apart from the trust/security issues, I strongly doubt they could get the whole thing right in a fully distributed fashion.

--TS

Hyperdrive · Post by **Hyperdrive** » Fri Feb 13, 2009 10:14 am

Colonel Kernel wrote:There is one thing that I think is frequently missed in the argument between shared memory and message-passing: Message-passing leads to better encapsulation and loose coupling of components. Shared memory often (but not always) makes it all too easy to share global state, write code with hard-to-detect side effects, etc.

I have some (recent) experience working on a highly scalable system that basically uses both of these techniques, and indeed, the parts that rely on shared state are much harder to understand, debug, etc. I just fixed an integration bug the day before yesterday that was triggered by a seemingly unrelated change in another part of the system (can't go into details since it's proprietary ). This wouldn't have happened if certain parts of the global state weren't co-mingled the way they were. It's almost like what Brendan said about "false sharing", but in terms of correctness and maintainability instead of performance.

I believe for message passing the developers think harder about what the other components really have to know. For shared memory many take the bait to share everything because it's so easy. (You know about PHP's "register globals"?

). We just need more discipline there. But that's all a question of design and not of the underlying concept. You can always misuse the concept. Therefore you can always have good and bad software designs. Just pick your favourite

--TS

Hyperdrive · Post by **Hyperdrive** » Fri Feb 13, 2009 10:18 am

Love4Boobies wrote:Hyperdrive, are you aware of this document?

Yes, I am. But I considered to be too complex to do in early stages (of the OS start). Many thanks, anyway.

Brendan · Post by **Brendan** » Fri Feb 13, 2009 8:59 pm

Hi,

Hyperdrive wrote:
Brendan wrote: then you'd still be constantly waiting for pages of data to be transfered over the network.
Why? You can replicate (i.e. cache) data. Only if the page was invalidated, you have to request it again. Writes are made less often than reads. And even more often there are only a few nodes that write. So collisions are rare and mostly the data is cached.

I guess that's my main point - transactional distributed shared memory works well when no synchronization is needed anyway (read-only), but goes bad fast when it is actually needed (writes).

Writes might be very rare, but they're also extremely expensive.

For an (admittedly dodgy) example, if there's 25 nodes with 2 GHz dual core CPUs, then you're looking at a cluster capable of doing around 100 billion cycles per second. Now, imagine if the application has a piece of data that's constantly being read and extremely rarely being written to. A write causes pages to be invalidated and downloaded again, but also means that work done can't be committed and needs to be retried. If one write costs a total of 10 billion cycles on average (due to networking delays and lost work), then how many writes can you do per second to end up with 50% of the theoretical maximum performance? 5 of them? That would work out to one write for every 20 billion cycles.

If you double the number of nodes then you double the number of writes, and because each write effects each node you double the cost of each write. That's 4 times as much overhead - doubling the number of nodes reduces the amount of work the cluster gets done.

Cheers,

Brendan

thooot · Post by **thooot** » Sat Feb 14, 2009 1:05 pm

One problem I can see with having a transactional system is when the apps running on your OS need to communicate with the world outside the cluster. If they aren't aware of when transactions start and end then they may send some data multiple times to an external machine (if a transaction gets aborted and restarted). Depending on the protocol they're using this could cause data corruption or functionality errors. The alternative would be to have the apps aware of the transactional nature of the system and force them to use a transactional protocol. Then if a transaction gets aborted you would need to let the app perform any work it needs to abort the transaction. This would increase app & protocol complexity, potentially introducing bugs

.

Why are you using transactions rather than just doing something similar to the way cache coherency works (you sort of are already with the invalidation, but you have transactions on top of it)? This would allow apps to not have to be aware of the way the system works internally. And if as you say there are few writes, then there will be few page invalidations and it should have the same performance characteristics as your transactional model (possibly better as you won't have to roll back transactions and redo them).

*EDIT: Actually to get around the whole outside world issue you could buffer any network send traffic and commit the process first before sending the data.

Hyperdrive · Post by **Hyperdrive** » Wed Feb 25, 2009 5:02 am

Hi,

sorry, the last 2 weeks I was very busy finishing my diploma thesis, so I had no time to respond here. But now it's submitted and I'm back...

Brendan wrote:
Hyperdrive wrote:
Brendan wrote: then you'd still be constantly waiting for pages of data to be transfered over the network.
Why? You can replicate (i.e. cache) data. Only if the page was invalidated, you have to request it again. Writes are made less often than reads. And even more often there are only a few nodes that write. So collisions are rare and mostly the data is cached.
I guess that's my main point - transactional distributed shared memory works well when no synchronization is needed anyway (read-only), but goes bad fast when it is actually needed (writes).

Writes might be very rare, but they're also extremely expensive.

For an (admittedly dodgy) example, if there's 25 nodes with 2 GHz dual core CPUs, then you're looking at a cluster capable of doing around 100 billion cycles per second. Now, imagine if the application has a piece of data that's constantly being read and extremely rarely being written to. A write causes pages to be invalidated and downloaded again, but also means that work done can't be committed and needs to be retried. If one write costs a total of 10 billion cycles on average (due to networking delays and lost work), then how many writes can you do per second to end up with 50% of the theoretical maximum performance? 5 of them? That would work out to one write for every 20 billion cycles.

If I understand you correctly you are assuming that every single write has high costs. That's not true. First, there may be writes that don't lead to conflicts and nothing bad happens (no rollback, no lost time). Second, regardless if there were 1 conflict or 1000 conflicts in many cases the costs are always the same because the whole run of a transactional task is validated. If you consider both arguments the costs per write are not that high as you maybe thought.

Some experiments show, that it scales very well. Granted, that depends largely on the work load. Many applications may have no problem, others may favour a less strict consistency. We are well aware of this and are planning to support multiple consistency models. But we believe that the transactional consistency should be the primary one and will always remain the load-bearing consistency (for "data" like object references and such).

Brendan wrote: If you double the number of nodes then you double the number of writes, and because each write effects each node you double the cost of each write. That's 4 times as much overhead - doubling the number of nodes reduces the amount of work the cluster gets done.

That's not the case. With doubling the number of nodes you increase the probability of getting conflicts and therefore maybe the amount of work the cluster gets done is reduced, but not with a 4x-impact, not even anything near to that.

--TS

Brendan · Post by **Brendan** » Wed Feb 25, 2009 11:03 pm

Hi,

Hyperdrive wrote:If I understand you correctly you are assuming that every single write has high costs. That's not true. First, there may be writes that don't lead to conflicts and nothing bad happens (no rollback, no lost time). Second, regardless if there were 1 conflict or 1000 conflicts in many cases the costs are always the same because the whole run of a transactional task is validated. If you consider both arguments the costs per write are not that high as you maybe thought.

Let me simplify my argument...

Some workloads are insanely easy to do in parallel - a class of problems called "embarrassingly parallel". For these workloads it doesn't matter what you do you'll get very good results, and you'd even get extremely good results (for a large group of single-CPU computers) if you used DOS. Saying your technique works well for these problems doesn't really help.

At the other extreme there's workloads that are virtually impossible to do in parallel. In these cases, regardless of how parallelism is acheived you can't get worthwhile performance improvements by using a second core in the same computer, and using more than one computer is pointless.

The interesting part is in the middle - the range of workloads that fall between these extremes.

Now, draw a graph. Label the X-axis "difficulty" and on the left (at (0,0) write "embarrassingly parallel" and on the right write "impossible". Now label the Y-axis - at the top write "lots of nodes" and at the bottom at (0,0) write "zero nodes". The idea is to plot statistics showing the point where adding more nodes doesn't help performance, for a range of workloads.

Draw a line in blue for "distributed with messaging", starting from the right (impossible, 1 node) that curves up towards infinity on the left (embarrassingly parallel, as many nodes as you can chuck at it). Now draw a line in red starting from the same place on the right (impossible, 1 node) that also curves up towards infinity on the left (embarrassingly parallel, as many nodes as you can chuck at it). Now make sure the red line is well below the blue line in the middle, and remains below the blue line for virtually all of the graph.

If you like, you could probably gather real statistics and (attempt to) prove me wrong...

Cheers,

Brendan

Hyperdrive · Post by **Hyperdrive** » Sun Mar 01, 2009 1:57 pm

Hi,

Brendan wrote:
Hyperdrive wrote:If I understand you correctly you are assuming that every single write has high costs. That's not true. First, there may be writes that don't lead to conflicts and nothing bad happens (no rollback, no lost time). Second, regardless if there were 1 conflict or 1000 conflicts in many cases the costs are always the same because the whole run of a transactional task is validated. If you consider both arguments the costs per write are not that high as you maybe thought.
Let me simplify my argument...

Some workloads are insanely easy to do in parallel - a class of problems called "embarrassingly parallel". For these workloads it doesn't matter what you do you'll get very good results, and you'd even get extremely good results (for a large group of single-CPU computers) if you used DOS. Saying your technique works well for these problems doesn't really help.

At the other extreme there's workloads that are virtually impossible to do in parallel. In these cases, regardless of how parallelism is acheived you can't get worthwhile performance improvements by using a second core in the same computer, and using more than one computer is pointless.

The interesting part is in the middle - the range of workloads that fall between these extremes

I think we talked past each other. I talked about types of workloads as in access patterns, you talk about types of workloads as in how hard it is to parallelize a job/task/algorithm. These are different things.

My point here is that we talked about scalability issues - the question was wether the transactional distributed memory scales well or not. We discussed that writes can get very expensive - but that depends on the situation. It's not a matter about how good the job is parallelizable, it's about which access patterns the job shows.

There may be workloads with TDM-friendly access patterns and there may be others. Hopefully we can provide a suitable consistency model for the latter. If not, we loose in this case.

<OT>
But which systems don't have this behaviour? You provide some paradigms the user can work with. Depending on the use case it will or won't fit. And every time a user phones you and complains about that the paradigms you provided are not suitable for his needs, you can add a paradigm that will make him happy. But on the next day another user phones you and he complains that you just added a paradigm but for him it was totally silly because he needed another. Well, you'll loose the war regardless what you do.

The only thing you can really do is providing things you think that have a chance to be useful for some people. They'll use your product, the others will use another one.

The final questions are:
(1) How many people are willing to use your product?
(2) Given the answer for (1) - is the development effort worth it?
(3) Are the people that probably will use your product a clientele you want support? (You maybe don't want to support script kiddies or something...)
(4) Are you really convinced about your product idea (thinking it is a good one), even though a huge amount of people are (at least) very sceptical? Do you think it is worth doing it nonetheless, because there is a chance that future developments will help your idea becoming important?

Question (4) is relatively hard. That's a more for researchers than for business people, I know. But guess what, I'm more of a researcher than a buisness man - at least at the moment.
</OT>

We believe strongly that there is a whole range of applications with workloads that are perfectly fitting with our TDM paradigm.

Can you prove me wrong?

--TS

Brendan · Post by **Brendan** » Sun Mar 01, 2009 9:17 pm

Hi,

Hyperdrive wrote:I think we talked past each other. I talked about types of workloads as in access patterns, you talk about types of workloads as in how hard it is to parallelize a job/task/algorithm. These are different things.

That doesn't change my argument...

Hyperdrive wrote:There may be workloads with TDM-friendly access patterns and there may be others. Hopefully we can provide a suitable consistency model for the latter. If not, we loose in this case.

There aren't any workloads with TDM-friendly access patterns - there's only workloads where the crappy characteristics of TDM aren't as obvious. For all workloads and all access patterns, you can get better performance and better scalability with shared memory and traditional locking, and much better performance and much better scalability with messaging (and careful design).

The question isn't whether TDM sucks or not, the question is how badly TDM sucks. Does it suck so badly that no sane person will ever want to use it, or is the level of suckage small enough to justify it's use in some cases? IMHO the level of suckage is small enough to justify it's use in some cases, but not any of the cases where it's necessary to distribute a large application across many computers.

Of course distributed shared memory sucks too (even though it sucks less than TDM) - there's a reason why large distributed clusters use messaging (e.g. MPI) instead of distributed shared memory (e.g. OpenMP), and it has nothing to do with "programmer friendliness"...

Cheers,

Brendan

OSDev.org

Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification

Re: Machine identification