Concise Way to Describe Colour Spaces

embryo2 · Post by **embryo2** » Thu Jul 09, 2015 11:39 am

Brendan wrote:
embryo2 wrote:It can make things like problem monitoring easier to implement. If we have a high level code (or bytecode) then it is possible to insert some checks, specific for particular hardware, when the code is being compiled.
Sure, every time you read a variable you'd check if it's the same value that was stored in the variable last "somehow", every time you add 2 numbers you follow that by a check (e.g. "c = a + b; if( c-a != b ) .."), every time you multiply you follow it with a check (e.g. "c = a * b; if(c / a != b) ..."), etc. Of course it's going to be much much slower, you won't be able to use 2 or more CPUs to spread the load, and it's still going to fail when (e.g.) the code itself is corrupted or (e.g.) the CPU fails or (e.g.) someone unplugs the wrong power or network cable.

If you are talking about memory failures, then yes, it's hard to defend against them. But there is a plenty of hardware beside of the memory. So, when someone unplugs the network cable we have (in case of good design) something like a layered exception handling, when first the driver sets it's output accordingly, next the tcp/ip implementation pops an exception to the upper level of an application, that is currently working with the network. And even if there is no exception handlers, defined by a programmer, even then the VM has an ability to safely kill just one thread without the whole application being affected (would it be the runtime checks or whatever, that the VM allows us to implement).

Brendan wrote:
embryo2 wrote:It helps in practice. If a handler has a bug, then it's execution is aborted with an exception and it's thread just stops running (if VM was designed by competent architects), but all other threads still work and application looses just some small part of it's functionality.
That's extremely naive at best. In practice those threads and will crash at "unfortunate" times (e.g. in the middle of modifying data while holding several locks) and you'll be screwed.

Well, you can look at many web sites and it is possible to find there exactly the "extremely naive" situation. The server just works, the site just works, but one page is not displayed because of a bug.

And of course, the lock keeping can be done by the VM by using thread id for determining what thread should be resumed after the problem thread crashed. If there is the concurrent data modification problem, then yes, the problem can spread to two threads, but still it is not the full application crash. But most often the concurrent data modification is not a problem because usually each handler modifies only it's personal data.

Brendan wrote:
embryo2 wrote:but the VM's bugs affect millions of programs, so they can be detected very quickly. Then, despite of the bug complexity, the required improvement will be made in a short time.
More like the opposite - every time they add new features to the VM they introduce more bugs that effect millions of programs.

If a bug affects million of programs and nobody cares, then it seems to me that there is no bug at all.

Brendan wrote:
embryo2 wrote:And if you know about the problem, would you expect that a program with race condition will get the same result all the time? I think - no. So, first we should ensure there is no race condition and only after it we can compare outputs of a program from different computers.
First we ensure there's no race conditions "somehow" (with magic or prayer?); then we compare outputs of a program from different computers to both detect and avoid problems caused by race conditions (which makes it easy to detect and correct those race conditions that we "ensured" couldn't happen "somehow")?

We ensure that there is no race condition by designing the code, that works with private data, for example. And if we know that the code modifies the data, that is shared across many threads, then what for we need to compare the output of such threads? Should we wonder if there will be a difference or should we cry because there is no difference?

Brendan wrote:it's better to give users the ability to choose whether to use it or not (instead of simply assuming all the software that will ever be run on the OS will always be "not important enough").

Yes, it's better, but the development time cries for mercy.

Brendan wrote:Also note that I'm planning a distributed system. The chance of one computer failing might be "acceptably low"; but when you have 100 computers working together the chance that one of those computers will fail is 100 times higher than "acceptably low".

The granularity of your "distribution" is much lower than it is for existing solutions. So, the overhead of your solution will be much bigger, than the overhead of the existing solutions (heavy networking because of string comparison calls, for example). Then why anybody need such replacement for existing solutions?

Brendan wrote:
embryo2 wrote:Just let the running function to finish and prevent new calls from accessing the old function (by changing it's address). It's simple.
So now you need some sort of synchronisation point at the start and end of every function, plus some way to determine which functions use which data structures, and then you're still completely screwed if the function has some sort of main loop and you never leave that function (until/unless you exit the process). It's "simple" (like, winning the national lottery is simple - you just buy a ticket)!

No, actually it is you, who need the synchronization et al on every call. But I never proposed the variant of low level function interaction using something like messaging. Instead, I prefer to look at coarse grained solutions, when the synchronization code takes just a tiny fraction of the function's execution time (like it is for web services and all the like).

Brendan wrote:If the box excludes everything unnecessary (e.g. one "box"/virtual address space for the application and a separate "box"/virtual address space for the library); and if there's a way to transfer information between "boxes" (e.g. messages) then it'd work because it's what I'm doing. The only difference is that you're bloated/inefficient VMs to create the boxes

No, my box prevents access to the data, that was chosen by a developer to be inaccessible for a library. So, my box ensures security of the data in different way than your. But without any messaging for every call overhead.

Brendan wrote:which creates a whole new "VM can touch everything" security problem that's just as bad as the "library can touch everything" problem that you were trying to solve.

Well, then what about the "OS can touch everything" security problem? Why do you think that the latter is any better than the former? And if you think they are comparable, then why do you stress for the VM's problem only?

Brendan · Post by **Brendan** » Thu Jul 09, 2015 5:12 pm

Hi,

embryo2 wrote:
Brendan wrote:
embryo2 wrote:It can make things like problem monitoring easier to implement. If we have a high level code (or bytecode) then it is possible to insert some checks, specific for particular hardware, when the code is being compiled.
Sure, every time you read a variable you'd check if it's the same value that was stored in the variable last "somehow", every time you add 2 numbers you follow that by a check (e.g. "c = a + b; if( c-a != b ) .."), every time you multiply you follow it with a check (e.g. "c = a * b; if(c / a != b) ..."), etc. Of course it's going to be much much slower, you won't be able to use 2 or more CPUs to spread the load, and it's still going to fail when (e.g.) the code itself is corrupted or (e.g.) the CPU fails or (e.g.) someone unplugs the wrong power or network cable.
If you are talking about memory failures, then yes, it's hard to defend against them. But there is a plenty of hardware beside of the memory. So, when someone unplugs the network cable we have (in case of good design) something like a layered exception handling, when first the driver sets it's output accordingly, next the tcp/ip implementation pops an exception to the upper level of an application, that is currently working with the network. And even if there is no exception handlers, defined by a programmer, even then the VM has an ability to safely kill just one thread without the whole application being affected (would it be the runtime checks or whatever, that the VM allows us to implement).

Um, in this case it failed (the application can't continue because it can't accesses whatever it needed the network for) and your "exceptions" nonsense does nothing to ensure the application continues working correctly; and to make things far worse you've killed a thread that's probably holding multiple locks where other threads are likely to be waiting for it to finish doing something (and/or waiting for those locks to be released) and therefore you've probably completely destroyed the entire application. It does not allow recovery.

embryo2 wrote:
Brendan wrote:
embryo2 wrote:It helps in practice. If a handler has a bug, then it's execution is aborted with an exception and it's thread just stops running (if VM was designed by competent architects), but all other threads still work and application looses just some small part of it's functionality.
That's extremely naive at best. In practice those threads and will crash at "unfortunate" times (e.g. in the middle of modifying data while holding several locks) and you'll be screwed.
Well, you can look at many web sites and it is possible to find there exactly the "extremely naive" situation. The server just works, the site just works, but one page is not displayed because of a bug.

I looked at many web sites. They're all using asynchronous messages (in the form of HTTP requests and replies over TCP/IP). Only some of them have "failover", which means that for a lot of them if the web server crashes the entire site is offline.

embryo2 wrote:And of course, the lock keeping can be done by the VM by using thread id for determining what thread should be resumed after the problem thread crashed. If there is the concurrent data modification problem, then yes, the problem can spread to two threads, but still it is not the full application crash. But most often the concurrent data modification is not a problem because usually each handler modifies only it's personal data.

If a program has 100 threads that all rely on the same global data structure (which is actually very common), and a thread crashes while modifying that data structure (leaving the global data structure in an inconsistent/corrupted state), then all 100 threads are effected and not 2.

If you think concurrent data modification is not a problem because usually each thread only modifies its own personal data; then I can only assume you've never written any software that uses threads.

embryo2 wrote:
Brendan wrote:
embryo2 wrote:but the VM's bugs affect millions of programs, so they can be detected very quickly. Then, despite of the bug complexity, the required improvement will be made in a short time.
More like the opposite - every time they add new features to the VM they introduce more bugs that effect millions of programs.
If a bug affects million of programs and nobody cares, then it seems to me that there is no bug at all.

When (e.g.) Oracle do an urgent security update that fixes 19 critical vulnerabilities (some of which could have led to compromising the system despite the "sandbox"); would they be bugs that nobody cares about?

embryo2 wrote:
Brendan wrote:
embryo2 wrote:And if you know about the problem, would you expect that a program with race condition will get the same result all the time? I think - no. So, first we should ensure there is no race condition and only after it we can compare outputs of a program from different computers.
First we ensure there's no race conditions "somehow" (with magic or prayer?); then we compare outputs of a program from different computers to both detect and avoid problems caused by race conditions (which makes it easy to detect and correct those race conditions that we "ensured" couldn't happen "somehow")?
We ensure that there is no race condition by designing the code, that works with private data, for example. And if we know that the code modifies the data, that is shared across many threads, then what for we need to compare the output of such threads? Should we wonder if there will be a difference or should we cry because there is no difference?

The "shared nothing" approach (where entities/threads only work on private data) also has race conditions (e.g. software that expects communication to occur in a certain order). It doesn't solve the problem. Your "design the code so that there's no race conditions" is pure wishful thinking.

embryo2 wrote:
Brendan wrote:it's better to give users the ability to choose whether to use it or not (instead of simply assuming all the software that will ever be run on the OS will always be "not important enough").
Yes, it's better, but the development time cries for mercy.

Yes; but spending extra time to maximise the chance that the OS is significantly better than existing OSs is a lot more sensible than quickly producing something worthless.

embryo2 wrote:
Brendan wrote:Also note that I'm planning a distributed system. The chance of one computer failing might be "acceptably low"; but when you have 100 computers working together the chance that one of those computers will fail is 100 times higher than "acceptably low".
The granularity of your "distribution" is much lower than it is for existing solutions. So, the overhead of your solution will be much bigger, than the overhead of the existing solutions (heavy networking because of string comparison calls, for example). Then why anybody need such replacement for existing solutions?

Programmers will need to intelligently split things up into processes. You wouldn't put a trivial string comparison (which is mostly a single "rep cmpsb" instruction) into a separate process.

embryo2 wrote:
Brendan wrote:
embryo2 wrote:Just let the running function to finish and prevent new calls from accessing the old function (by changing it's address). It's simple.
So now you need some sort of synchronisation point at the start and end of every function, plus some way to determine which functions use which data structures, and then you're still completely screwed if the function has some sort of main loop and you never leave that function (until/unless you exit the process). It's "simple" (like, winning the national lottery is simple - you just buy a ticket)!
No, actually it is you, who need the synchronization et al on every call. But I never proposed the variant of low level function interaction using something like messaging. Instead, I prefer to look at coarse grained solutions, when the synchronization code takes just a tiny fraction of the function's execution time (like it is for web services and all the like).

Wrong. E.g. a service receives a request, executes any number of functions with no "per function" synchronisation whatsoever; then sends a reply.

For a more specific example; let's say there's service that does arbitrary precision maths (using rational numbers in the "numerator/divisor * 2**exponent" form). You send it the request "simplify the expression: x/3 * (y+6) * (42 / 6) + 9 + z" and it does a massive amount of processing and returns the reply "6/3 * x * y + 14 * x + 9 + z".

Of course the OS might be using redundancy for the arbitrary precision maths service; so when the application sends the request the OS forwards it to 3 different instances of the service, 2 of them might return the reply "6/3 * x * y + 14 * x + 9 + z" and one might return the reply "6/3 * x * y + 14 * x + 11 + z" (or not return a reply at all), then the OS will compare them and find that one of them failed (and maybe replace the failed instance for next time) and give the application the "6/3 * x * y + 14 * x + 9 + z" reply; and this happens without the application knowing or caring that redundancy was being used and without the application knowing or caring that there was a failure.

embryo2 wrote:
Brendan wrote:If the box excludes everything unnecessary (e.g. one "box"/virtual address space for the application and a separate "box"/virtual address space for the library); and if there's a way to transfer information between "boxes" (e.g. messages) then it'd work because it's what I'm doing. The only difference is that you're bloated/inefficient VMs to create the boxes
No, my box prevents access to the data, that was chosen by a developer to be inaccessible for a library. So, my box ensures security of the data in different way than your. But without any messaging for every call overhead.

And without any of the other benefits; and relying on the severely flawed assumption that the VM is secure.

Basically; I have one single simple and elegant thing (asynchronous message passing) that has multiple benefits (flexibility, scalability, fault tolerance, availability, security); and you have an ugly/eclectic mixture of multiple different things (managed code, VMs, exceptions, whatever) where each thing is only for one purpose and is inferior for that purpose.

embryo2 wrote:
Brendan wrote:which creates a whole new "VM can touch everything" security problem that's just as bad as the "library can touch everything" problem that you were trying to solve.
Well, then what about the "OS can touch everything" security problem? Why do you think that the latter is any better than the former? And if you think they are comparable, then why do you stress for the VM's problem only?

The OS can't touch everything. The micro-kernel can; but it's a tiny piece of the OS that's relatively simple and far less likely to have security vulnerabilities than a huge bloated mess of complicated code (your VM). Of course when I say "the micro-kernel can" what I actually mean is that in a distributed system the micro-kernel on one computer can touch anything on that computer, but can't touch things on a different computer and therefore can't actually touch "everything".

Cheers,

Brendan

Antti · Post by **Antti** » Fri Jul 10, 2015 1:54 am

May I submit a feature request? Although applications do not care about redundancy and its transparency, system administrators may care about them in the interest of monitoring the distributed OS's general health. Perhaps this thing is already taken into account but it would be really nice to have these kind of features elegantly integrated into the OS's message passing system. What would be more interesting (for system administrators who just drink coffee in their offices and monitor the system) than see a graphical map of computers and some kind of statistic information about the message traffic?

Of course all the GUIs with bells and whistles for doing this should not be important at this point. The important thing is to have a proper backend for this from the very beginning. Trying to retrofit this later sounds like a bad idea.

Rusky · Post by **Rusky** » Fri Jul 10, 2015 12:45 pm

Brendan wrote:
embryo2 wrote:If you are talking about memory failures, then yes, it's hard to defend against them. But there is a plenty of hardware beside of the memory. So, when someone unplugs the network cable we have (in case of good design) something like a layered exception handling, when first the driver sets it's output accordingly, next the tcp/ip implementation pops an exception to the upper level of an application, that is currently working with the network. And even if there is no exception handlers, defined by a programmer, even then the VM has an ability to safely kill just one thread without the whole application being affected (would it be the runtime checks or whatever, that the VM allows us to implement).
Um, in this case it failed (the application can't continue because it can't accesses whatever it needed the network for) and your "exceptions" nonsense does nothing to ensure the application continues working correctly; and to make things far worse you've killed a thread that's probably holding multiple locks where other threads are likely to be waiting for it to finish doing something (and/or waiting for those locks to be released) and therefore you've probably completely destroyed the entire application. It does not allow recovery.

You're both wrong. Throwing an exception runs destructors and/or finally blocks, which releases locks. However, this doesn't mean exceptions make reliability easier- it's actually more complicated to write "exception safe" code that doesn't leave things in a bad state when an exception is thrown unexpectedly. The correct solution is to encode the possible error conditions (network cable unplugged, etc.) in the API, and force clients to handle them, rather than expect a VM to clean things up. A VM may be able to kill threads and release their resources, but it can never know how to restore the application's state- that's the application's job.

embryo2 · Post by **embryo2** » Fri Jul 10, 2015 2:17 pm

Brendan wrote:
embryo2 wrote:If you are talking about memory failures, then yes, it's hard to defend against them. But there is a plenty of hardware beside of the memory. So, when someone unplugs the network cable we have (in case of good design) something like a layered exception handling, when first the driver sets it's output accordingly, next the tcp/ip implementation pops an exception to the upper level of an application, that is currently working with the network. And even if there is no exception handlers, defined by a programmer, even then the VM has an ability to safely kill just one thread without the whole application being affected (would it be the runtime checks or whatever, that the VM allows us to implement).
Um, in this case it failed (the application can't continue because it can't accesses whatever it needed the network for) and your "exceptions" nonsense does nothing to ensure the application continues working correctly; and to make things far worse you've killed a thread that's probably holding multiple locks where other threads are likely to be waiting for it to finish doing something (and/or waiting for those locks to be released) and therefore you've probably completely destroyed the entire application. It does not allow recovery.

It recovers perfectly when user plugs the cable back. After next access to the network aware code it just won't throw an exception. And it is possible because the application wasn't killed when the thread hit the bug, so the application just starts normal processing with a new thread and, of course, free locks, removed by the VM.

Brendan wrote:I looked at many web sites. They're all using asynchronous messages (in the form of HTTP requests and replies over TCP/IP). Only some of them have "failover", which means that for a lot of them if the web server crashes the entire site is offline.

It seems you overlooked something important when was looking at the sites. They just work despite of the fact that some page doesn't.

Another point here is the difference between a server, a site and a page. The page is the smallest part of a web application that is visible in the form of the content of a browser window. The web application is a set of pages, related in some way that is understandable by application users. The site usually corresponds to a domain name and can include one or more web applications. And finally the server is a hardware, that runs one or more sites. So, your vision of the situation "if the web server crashes the entire site is offline" is incorrect, because you stress the hardware failure while we were talking here about safe software failure handling.

And yet another point is about HTTP. It's request-response cycle is synchronous and doesn't resemble your messaging idea.

Brendan wrote:If a program has 100 threads that all rely on the same global data structure (which is actually very common), and a thread crashes while modifying that data structure (leaving the global data structure in an inconsistent/corrupted state), then all 100 threads are effected and not 2.

Yes, if there is such situation, then 100 threads can produce incorrect results. But such situation is not as common as you see it. If you look at a GUI desktop application or at a web site, then you can notice that these represent just a set of handlers with independent data structures.

Brendan wrote:If you think concurrent data modification is not a problem because usually each thread only modifies its own personal data; then I can only assume you've never written any software that uses threads.

Or you can rethink your vision of a typical application design.

Brendan wrote:When (e.g.) Oracle do an urgent security update that fixes 19 critical vulnerabilities (some of which could have led to compromising the system despite the "sandbox"); would they be bugs that nobody cares about?

Your assumption was about some very slow VM bug fixing. My answer was about millions of programs that can be effected and a quick bug fix as a response to the massive outcry from users. Your next response was about some hidden and never fixed bugs. My answer was about very low importance of the bugs that never fixed. Now you show us the proof of quick bug fixing (confirming my previous post) and ask me about importance of the fixed bugs.

Well, what can I say in response to such exchange of messages? May be I can repeat the obvious - if there are important bugs, then they are fixed quickly, but if there are bugs of low importance, then it is possible that they can be fixed after some, possibly long, delay.

Brendan wrote:The "shared nothing" approach (where entities/threads only work on private data) also has race conditions (e.g. software that expects communication to occur in a certain order). It doesn't solve the problem. Your "design the code so that there's no race conditions" is pure wishful thinking.

My approach minimizes the problem. While your approach maximizes costs. Your way of reliability improvement requires at least three computers and heavy communication overhead. My way (or just a traditional way) requires one computer and creates no communication overhead, while allows for some problems to pop up once in ten years.

Brendan wrote:Programmers will need to intelligently split things up into processes. You wouldn't put a trivial string comparison (which is mostly a single "rep cmpsb" instruction) into a separate process.

But your "separate process per library" approach just encourages a developer to create something very far from intelligent. The developer should always think about your messaging and how it affects performance instead of thinking about the actual task he is coding.

Brendan wrote:a service receives a request, executes any number of functions with no "per function" synchronisation whatsoever; then sends a reply.

If a function belongs to a library, then you are screwed. Every time there is a library we have the messaging overhead (interprocess communication). While traditional approach has no such performance problem at all.

Brendan wrote:Of course the OS might be using redundancy for the arbitrary precision maths service; so when the application sends the request the OS forwards it to 3 different instances of the service, 2 of them might return the reply "6/3 * x * y + 14 * x + 9 + z" and one might return the reply "6/3 * x * y + 14 * x + 11 + z" (or not return a reply at all), then the OS will compare them and find that one of them failed (and maybe replace the failed instance for next time) and give the application the "6/3 * x * y + 14 * x + 9 + z" reply; and this happens without the application knowing or caring that redundancy was being used and without the application knowing or caring that there was a failure.

Yes, it should work, but for the price of 3 computers and at least one more extra computer (to cover the communication expenses) instead of just 1. And the software complexity (for you and for the developers, who use your OS) increases a lot. And all this is invented just for some seldom situation to show an advantage of your OS.

May be the reason is much simpler - you have many computers and want them to do something useful, so, you invented the redundancy approach that utilizes all your hardware.

Brendan wrote:Basically; I have one single simple and elegant thing (asynchronous message passing) that has multiple benefits (flexibility, scalability, fault tolerance, availability, security); and you have an ugly/eclectic mixture of multiple different things (managed code, VMs, exceptions, whatever) where each thing is only for one purpose and is inferior for that purpose.

Your "elegant" thing requires a lot of "not so elegant" helpers for it to be able to fulfil your promises. For every feature in the list (flexibility, scalability, fault tolerance, availability, security) you need to write a mixture of multiple different things where each thing is only for one purpose. It's like "everything is a file" approach. For memory region to look like file it is required to write a lot of code only for one purpose, despite of the "elegantness" of the initial abstraction.

Brendan wrote:The OS can't touch everything. The micro-kernel can

VM also has it's internal structure and I can go along the way of separation of it's details to show you the analogy, you are trying not to see. But such discussion is absolutely pointless.

Brendan wrote:Of course when I say "the micro-kernel can" what I actually mean is that in a distributed system the micro-kernel on one computer can touch anything on that computer, but can't touch things on a different computer and therefore can't actually touch "everything".

Well, distributed VM isn't such a rare animal, so you still try to escape the problem in your argumentation about "the VM can touch everything". Yes, the VM can (not a distributed variant), but OS also can, so, should we throw away all OSes?

Brendan · Post by **Brendan** » Fri Jul 10, 2015 7:24 pm

Hi,

Antti wrote:May I submit a feature request? Although applications do not care about redundancy and its transparency, system administrators may care about them in the interest of monitoring the distributed OS's general health. Perhaps this thing is already taken into account but it would be really nice to have these kind of features elegantly integrated into the OS's message passing system. What would be more interesting (for system administrators who just drink coffee in their offices and monitor the system) than see a graphical map of computers and some kind of statistic information about the message traffic?

Of course all the GUIs with bells and whistles for doing this should not be important at this point. The important thing is to have a proper backend for this from the very beginning. Trying to retrofit this later sounds like a bad idea.

There's going to have to be multiple different views (e.g. "hardware view" showing devices and their connections, "software view" showing computers, processes, threads; etc); with a way to get statistics on various things (RAM consumption, CPU usage, messages sent/received, etc); possibly with colour coding; and possibly with the ability to generate graphs, etc.

For statistics, I'm planning to keep track of number of messages sent and received for each thread, for each process and for each computer. I'm not planning to keep track of more detailed information; like which threads communicate with each other or statistics for individual message types - I'd worry the amount of data involved would be excessive and would change too rapidly.

Cheers,

Brendan

Brendan · Post by **Brendan** » Fri Jul 10, 2015 8:10 pm

Hi,

Rusky wrote:
Brendan wrote:
embryo2 wrote:If you are talking about memory failures, then yes, it's hard to defend against them. But there is a plenty of hardware beside of the memory. So, when someone unplugs the network cable we have (in case of good design) something like a layered exception handling, when first the driver sets it's output accordingly, next the tcp/ip implementation pops an exception to the upper level of an application, that is currently working with the network. And even if there is no exception handlers, defined by a programmer, even then the VM has an ability to safely kill just one thread without the whole application being affected (would it be the runtime checks or whatever, that the VM allows us to implement).
Um, in this case it failed (the application can't continue because it can't accesses whatever it needed the network for) and your "exceptions" nonsense does nothing to ensure the application continues working correctly; and to make things far worse you've killed a thread that's probably holding multiple locks where other threads are likely to be waiting for it to finish doing something (and/or waiting for those locks to be released) and therefore you've probably completely destroyed the entire application. It does not allow recovery.
You're both wrong. Throwing an exception runs destructors and/or finally blocks, which releases locks. However, this doesn't mean exceptions make reliability easier- it's actually more complicated to write "exception safe" code that doesn't leave things in a bad state when an exception is thrown unexpectedly. The correct solution is to encode the possible error conditions (network cable unplugged, etc.) in the API, and force clients to handle them, rather than expect a VM to clean things up. A VM may be able to kill threads and release their resources, but it can never know how to restore the application's state- that's the application's job.

If an application is using a service (that happens to be running on another computer) and the user unplugs the network cable; then I'd rather the OS automatically start a new instance of that service to handle the application's request/s; so that there's no reason for the application to know or care that there was a problem (and no need for the application to deal with this situation at all, with exceptions or anything else). Note: while this sounds nice and easy (and it is) there is a limitation - the service must be "stateless".

Also note that (in my case) processes won't be using networking directly. The only API is "send message" and "get message"; and threads just send messages to other threads (without knowing or caring if the receiving thread is on the same computer or not). This has a few consequences; including the fact that it's impossible to guarantee a message will be delivered successfully before the (asynchronous) "send message" function returns.

Cheers,

Brendan

Rusky · Post by **Rusky** » Fri Jul 10, 2015 10:23 pm

Sure. You only need error handling at the levels of the system that you want to handle it. So a "network cable unplugged" error wouldn't be passed up to applications in the case of stateless services, but it would need to be handled in the actual networking code implementing "send message."

Brendan · Post by **Brendan** » Fri Jul 10, 2015 11:04 pm

Hi,

embryo2 wrote:
Brendan wrote:
embryo2 wrote:If you are talking about memory failures, then yes, it's hard to defend against them. But there is a plenty of hardware beside of the memory. So, when someone unplugs the network cable we have (in case of good design) something like a layered exception handling, when first the driver sets it's output accordingly, next the tcp/ip implementation pops an exception to the upper level of an application, that is currently working with the network. And even if there is no exception handlers, defined by a programmer, even then the VM has an ability to safely kill just one thread without the whole application being affected (would it be the runtime checks or whatever, that the VM allows us to implement).
Um, in this case it failed (the application can't continue because it can't accesses whatever it needed the network for) and your "exceptions" nonsense does nothing to ensure the application continues working correctly; and to make things far worse you've killed a thread that's probably holding multiple locks where other threads are likely to be waiting for it to finish doing something (and/or waiting for those locks to be released) and therefore you've probably completely destroyed the entire application. It does not allow recovery.
It recovers perfectly when user plugs the cable back. After next access to the network aware code it just won't throw an exception. And it is possible because the application wasn't killed when the thread hit the bug, so the application just starts normal processing with a new thread and, of course, free locks, removed by the VM.

It recovers, if the user plugs the cable back in, and if the application developer spent time developing and testing the extra code needed to handle that case. Of course even if it does recover it's probably too late to matter anyway (e.g. the application has probably been "frozen" waiting for half an hour and the user probably gave up in disgust and terminated the application themselves).

embryo2 wrote:
Brendan wrote:I looked at many web sites. They're all using asynchronous messages (in the form of HTTP requests and replies over TCP/IP). Only some of them have "failover", which means that for a lot of them if the web server crashes the entire site is offline.
It seems you overlooked something important when was looking at the sites. They just work despite of the fact that some page doesn't.

So you're saying that it works fine when there is no failure at all (e.g. the web server correctly returns a "page not found" error response)?

embryo2 wrote:Another point here is the difference between a server, a site and a page. The page is the smallest part of a web application that is visible in the form of the content of a browser window. The web application is a set of pages, related in some way that is understandable by application users. The site usually corresponds to a domain name and can include one or more web applications. And finally the server is a hardware, that runs one or more sites. So, your vision of the situation "if the web server crashes the entire site is offline" is incorrect, because you stress the hardware failure while we were talking here about safe software failure handling.

Sorry - this was confusing ("server" means multiple things). The web server (e.g. Apache) is software, and it runs on a server (hardware). If the software (e.g. Apache) crashes or the hardware it relies on fails, then (without "fail-over"/redundancy) the entire site is offline.

embryo2 wrote:
Brendan wrote:If a program has 100 threads that all rely on the same global data structure (which is actually very common), and a thread crashes while modifying that data structure (leaving the global data structure in an inconsistent/corrupted state), then all 100 threads are effected and not 2.
Yes, if there is such situation, then 100 threads can produce incorrect results. But such situation is not as common as you see it. If you look at a GUI desktop application or at a web site, then you can notice that these represent just a set of handlers with independent data structures.

This is like deciding it's safe for children to play on a busy highway after looking at a quite street for 5 minutes and not seeing any traffic.

embryo2 wrote:
Brendan wrote:When (e.g.) Oracle do an urgent security update that fixes 19 critical vulnerabilities (some of which could have led to compromising the system despite the "sandbox"); would they be bugs that nobody cares about?
Your assumption was about some very slow VM bug fixing. My answer was about millions of programs that can be effected and a quick bug fix as a response to the massive outcry from users. Your next response was about some hidden and never fixed bugs. My answer was about very low importance of the bugs that never fixed. Now you show us the proof of quick bug fixing (confirming my previous post) and ask me about importance of the fixed bugs.

Well, what can I say in response to such exchange of messages? May be I can repeat the obvious - if there are important bugs, then they are fixed quickly, but if there are bugs of low importance, then it is possible that they can be fixed after some, possibly long, delay.

If there are millions of programs effected by bugs in the VM, then fixing the VM can fix millions of programs. You think this is good because you only want to look at the "fix millions of programs" part. I think this is bad because millions of programs were effected by bugs in the VM in the first place.

embryo2 wrote:
Brendan wrote:The "shared nothing" approach (where entities/threads only work on private data) also has race conditions (e.g. software that expects communication to occur in a certain order). It doesn't solve the problem. Your "design the code so that there's no race conditions" is pure wishful thinking.
My approach minimizes the problem. While your approach maximizes costs. Your way of reliability improvement requires at least three computers and heavy communication overhead. My way (or just a traditional way) requires one computer and creates no communication overhead, while allows for some problems to pop up once in ten years.

Your approach relies on luck and wishful thinking, and doesn't satisfy any of the objectives.

Note: My approach doesn't require 3 computers, it's just able to spread load over multiple computers if/when it's beneficial.

embryo2 wrote:
Brendan wrote:Programmers will need to intelligently split things up into processes. You wouldn't put a trivial string comparison (which is mostly a single "rep cmpsb" instruction) into a separate process.
But your "separate process per library" approach just encourages a developer to create something very far from intelligent. The developer should always think about your messaging and how it affects performance instead of thinking about the actual task he is coding.

Wrong. The task the developer is coding is just different. E.g. instead of writing an entire application, they'd be writing one of multiple pieces that communicate and would be more able to focus on that piece.

embryo2 wrote:
Brendan wrote:a service receives a request, executes any number of functions with no "per function" synchronisation whatsoever; then sends a reply.
If a function belongs to a library, then you are screwed. Every time there is a library we have the messaging overhead (interprocess communication). While traditional approach has no such performance problem at all.

A service is not a "library of functions" like you're thinking. It's more like client/server where there's 2 processes communicating (except that it's peer-to-peer with multiple processes). It's like you're complaining that (e.g.) if web browser asks web server to do trivial things (string comparisons) the overhead would be bad; and you're right (the overhead would be bad in that case), but you're wrong because services aren't used for trivial things like that in the first place.

embryo2 wrote:
Brendan wrote:Of course the OS might be using redundancy for the arbitrary precision maths service; so when the application sends the request the OS forwards it to 3 different instances of the service, 2 of them might return the reply "6/3 * x * y + 14 * x + 9 + z" and one might return the reply "6/3 * x * y + 14 * x + 11 + z" (or not return a reply at all), then the OS will compare them and find that one of them failed (and maybe replace the failed instance for next time) and give the application the "6/3 * x * y + 14 * x + 9 + z" reply; and this happens without the application knowing or caring that redundancy was being used and without the application knowing or caring that there was a failure.
Yes, it should work, but for the price of 3 computers and at least one more extra computer (to cover the communication expenses) instead of just 1. And the software complexity (for you and for the developers, who use your OS) increases a lot. And all this is invented just for some seldom situation to show an advantage of your OS.

May be the reason is much simpler - you have many computers and want them to do something useful, so, you invented the redundancy approach that utilizes all your hardware.

In my computer room there's a network of 28 computers on a LAN. For total resources it adds up to about 70 CPUs, 32 GPUs and 150 GiB of RAM. My project is a distributed system. The challenge is to allow software to make use of all those resources. I should be able to have a single user that's using a single application that uses all those resources all by itself (maybe a 3D game of massive proportions). I should also be able to plug in 100 keyboards and 100 monitors and have 100 users running a 300 normal desktop/GUI applications (where if some CPUs get overloaded the OS just shifts load to different CPUs on the LAN). If one of those 100 users is doing something that requires extremely high reliability, then why not let them have some redundancy?

Note that when there's only a single computer with a single CPU, the OS can still run 3 instances of a service on that one computer to provide a little extra reliability. This isn't as good (e.g. if that single CPU blows up then you can't expect things to keep running) and that single CPU might struggle to cope with the extra load; but it still works without 3 computers.

Finally; don't forget that the same software works "as is" for all situations and programmers don't have to care about any of this. It's just threads communicating with messages; and it doesn't matter (to programmers or their software) if there's lots of computers or just one, or if there's redundancy or not.

embryo2 wrote:
Brendan wrote:Basically; I have one single simple and elegant thing (asynchronous message passing) that has multiple benefits (flexibility, scalability, fault tolerance, availability, security); and you have an ugly/eclectic mixture of multiple different things (managed code, VMs, exceptions, whatever) where each thing is only for one purpose and is inferior for that purpose.
Your "elegant" thing requires a lot of "not so elegant" helpers for it to be able to fulfil your promises. For every feature in the list (flexibility, scalability, fault tolerance, availability, security) you need to write a mixture of multiple different things where each thing is only for one purpose. It's like "everything is a file" approach. For memory region to look like file it is required to write a lot of code only for one purpose, despite of the "elegantness" of the initial abstraction.

There's some code in the kernel to handle messages, processes/services, and redundancy (but half of it has to exist in some form anyway, for any multi-tasking OS of any description). There are no "helpers" in any process or anywhere else in user-space.

embryo2 wrote:
Brendan wrote:Of course when I say "the micro-kernel can" what I actually mean is that in a distributed system the micro-kernel on one computer can touch anything on that computer, but can't touch things on a different computer and therefore can't actually touch "everything".
Well, distributed VM isn't such a rare animal, so you still try to escape the problem in your argumentation about "the VM can touch everything". Yes, the VM can (not a distributed variant), but OS also can, so, should we throw away all OSes?

You should start by figuring out the difference between an OS and a kernel.

After that, maybe calculate "total amount of code that could screw you". For a micro-kernel you're typically looking at about 128 KiB of code (less for most micro-kernels) written by a small number of people; and for "VM plus kernel" it's a significantly higher risk.

Cheers,

Brendan

embryo2 · Post by **embryo2** » Sat Jul 11, 2015 12:08 pm

Rusky wrote:Throwing an exception runs destructors and/or finally blocks, which releases locks.

Yes, you're right. Sorry for my memory issue (the hardware problem

).

Rusky wrote:However, this doesn't mean exceptions make reliability easier- it's actually more complicated to write "exception safe" code that doesn't leave things in a bad state when an exception is thrown unexpectedly.

Unfortunately the biggest part of my experience is Java based (a lot of exceptions et al), so, can you explain the other ways to write a code "that doesn't leave things in a bad state"?

Rusky wrote:The correct solution is to encode the possible error conditions (network cable unplugged, etc.) in the API, and force clients to handle them

What is the important difference between result code and exception here? Exceptions are enforced by VM, but result codes just can be ignored; exceptions deliver some extended information about the problem, but result codes just leave a developer with 32 bits of information. What are the benefits of using result codes (except the almost invisible performance gain)?

Rusky wrote:A VM may be able to kill threads and release their resources, but it can never know how to restore the application's state- that's the application's job.

Yes, and it stresses the importance of a good application design instead of some universal OS based features like messaging.

embryo2 · Post by **embryo2** » Sat Jul 11, 2015 1:17 pm

Brendan wrote:If an application is using a service (that happens to be running on another computer) and the user unplugs the network cable; then I'd rather the OS automatically start a new instance of that service to handle the application's request/s; so that there's no reason for the application to know or care that there was a problem (and no need for the application to deal with this situation at all, with exceptions or anything else). Note: while this sounds nice and easy (and it is) there is a limitation - the service must be "stateless".

Do you mean the OS should start a remote service locally? But what if there's no binaries for the service implementation on the client OS side?

Brendan wrote:Also note that (in my case) processes won't be using networking directly. The only API is "send message" and "get message"; and threads just send messages to other threads (without knowing or caring if the receiving thread is on the same computer or not).

It's the same levelled approach. Your set of levels differs from the set required for SQL queries to be executed, but the base idea is just identical - a piece of code hasn't to know about some underlying complexity.

Brendan wrote:It recovers, if the user plugs the cable back in, and if the application developer spent time developing and testing the extra code needed to handle that case.

No, the same piece of code is invoked in the same way would there be a plugged or unplugged cable. If the cable is unplugged, then an exception is thrown. If the cable is pugged in then there is a normal execution of the code. In case of stateless service no more coding is required. In case of stateful service some transaction support is recommended. So, in both cases the application will recover successfully.

Brendan wrote:So you're saying that it works fine when there is no failure at all (e.g. the web server correctly returns a "page not found" error response)?

No, I say that it works fine when there is a response with the code 500.

Brendan wrote:If there are millions of programs effected by bugs in the VM, then fixing the VM can fix millions of programs. You think this is good because you only want to look at the "fix millions of programs" part. I think this is bad because millions of programs were effected by bugs in the VM in the first place.

For an OS the situation is identical - millions of programs are effected by a bug in the OS. Is this bad? Yes. But what can we do?

Brendan wrote:Note: My approach doesn't require 3 computers, it's just able to spread load over multiple computers if/when it's beneficial.

Without 3 computers it is impossible to implement hardware failure protected system, because your result voting approach just doesn't work for 2 computers and useless for 1.

Brendan wrote:The task the developer is coding is just different. E.g. instead of writing an entire application, they'd be writing one of multiple pieces that communicate and would be more able to focus on that piece.

So, it requires a paradigm shift. May be that's why you was talking about fresh developers without any experience with the existing development methods.

Brendan wrote:A service is not a "library of functions" like you're thinking.

Your initial words were about something like restarting a library. That's why I suppose you are going to implement a library of functions with some strange call convention.

Brendan wrote:It's more like client/server where there's 2 processes communicating (except that it's peer-to-peer with multiple processes). It's like you're complaining that (e.g.) if web browser asks web server to do trivial things (string comparisons) the overhead would be bad; and you're right (the overhead would be bad in that case), but you're wrong because services aren't used for trivial things like that in the first place.

Then it's all about the traditional coarse grained approach, like is the case for web services and web sites, but with different transport protocol. Wouldn't it be easier just to get some existing implementation of the RPC with the underlying protocols already implemented?

Brendan wrote:In my computer room there's a network of 28 computers on a LAN. For total resources it adds up to about 70 CPUs, 32 GPUs and 150 GiB of RAM. My project is a distributed system. The challenge is to allow software to make use of all those resources. I should be able to have a single user that's using a single application that uses all those resources all by itself (maybe a 3D game of massive proportions). I should also be able to plug in 100 keyboards and 100 monitors and have 100 users running a 300 normal desktop/GUI applications (where if some CPUs get overloaded the OS just shifts load to different CPUs on the LAN). If one of those 100 users is doing something that requires extremely high reliability, then why not let them have some redundancy?

Yes. I agree, that your situation is demanding for some software support. But may be it worth to read more about distributed system architecture before you start to implement your messaging solution?

Brendan wrote:Note that when there's only a single computer with a single CPU, the OS can still run 3 instances of a service on that one computer to provide a little extra reliability. This isn't as good (e.g. if that single CPU blows up then you can't expect things to keep running) and that single CPU might struggle to cope with the extra load; but it still works without 3 computers.

For distributed systems it is the well known case of the software redundancy. There is no hardware failure protection, but the software crash because of some specific state dependent bugs can be recovered.

Brendan wrote:Finally; don't forget that the same software works "as is" for all situations and programmers don't have to care about any of this. It's just threads communicating with messages; and it doesn't matter (to programmers or their software) if there's lots of computers or just one, or if there's redundancy or not.

Programmers wouldn't care if they managed to make the paradigm shift from traditional development methods to the message oriented development of the kind that you are proposing here.

Brendan wrote:There's some code in the kernel to handle messages, processes/services, and redundancy (but half of it has to exist in some form anyway, for any multi-tasking OS of any description). There are no "helpers" in any process or anywhere else in user-space.

Who defines message routing? What kinds of protocols will be involved? Who cares about network topology? What overhead expects the system administrator that will be working with your system? What new tools should be used by administrators?

There are many questions like those written above, so, the full implementation of your messaging will pull a lot of additional code, even if you still aren't see it.

Brendan wrote:calculate "total amount of code that could screw you". For a micro-kernel you're typically looking at about 128 KiB of code (less for most micro-kernels) written by a small number of people; and for "VM plus kernel" it's a significantly higher risk.

Yes, "VM plus kernel" requires more code. There even were times when there was no code at all, but, for some reason, almost nobody today sees the contemporary life as more risky than the life in the times when there were no code at all.

Rusky · Post by **Rusky** » Sat Jul 11, 2015 1:35 pm

embryo2 wrote:
Rusky wrote:However, this doesn't mean exceptions make reliability easier- it's actually more complicated to write "exception safe" code that doesn't leave things in a bad state when an exception is thrown unexpectedly.
Unfortunately the biggest part of my experience is Java based (a lot of exceptions et al), so, can you explain the other ways to write a code "that doesn't leave things in a bad state"?

If you use synchronized blocks to lock/unlock things, you're obviously doing things in between that shouldn't be interrupted. So if you get unwound by an exception instead of by finishing the synchronized block, it will leave you in the same state as if you had forgotten the lock and got preempted. Writing code without this problem is called being "exception safe" and basically involves a lot of subtle tricks involving destructors or finally blocks. It's one of the downsides of exception support, and the reason that (for example) Rust has the ability to throw but not to catch exceptions (they become thread-level aborts).

embryo2 wrote:
Rusky wrote:The correct solution is to encode the possible error conditions (network cable unplugged, etc.) in the API, and force clients to handle them
What is the important difference between result code and exception here? Exceptions are enforced by VM, but result codes just can be ignored; exceptions deliver some extended information about the problem, but result codes just leave a developer with 32 bits of information. What are the benefits of using result codes (except the almost invisible performance gain)?

Exceptions are enforced by the language (not the VM), but in the wrong way- they kill you at runtime if you don't handle them. Simple 32-bit result codes are not what I'm talking about- properly-encoded error conditions are part of the function type and require you to explicitly handle the error. One example is a Result type that holds either the normal result or an error (with just as much information as an exception), but not both, forcing the program to handle both cases at compile time. Another example that Brendan has brought up before is passing in error handling continuations to functions, so that callers are forced, at compile time, to provide handlers for each error condition. Of course, sometimes the proper thing to do with an error is to pass it back to your caller- the difference here is that the compiler forces the programmer to make that decision, instead of potentially forgetting or not noticing the possibility of an exception.

Brendan · Post by **Brendan** » Sat Jul 11, 2015 11:04 pm

Hi,

embryo2 wrote:
Brendan wrote:If an application is using a service (that happens to be running on another computer) and the user unplugs the network cable; then I'd rather the OS automatically start a new instance of that service to handle the application's request/s; so that there's no reason for the application to know or care that there was a problem (and no need for the application to deal with this situation at all, with exceptions or anything else). Note: while this sounds nice and easy (and it is) there is a limitation - the service must be "stateless".
Do you mean the OS should start a remote service locally? But what if there's no binaries for the service implementation on the client OS side?

I mean, the OS starts another instance of the service "somewhere" (on any computer that is still part of the group of computers/cluster). This could be either local or remote. Note: it doesn't matter much where, except for performance. For performance, "where" is a compromise between CPU load and communication overhead; and I'll be adding a little information into the executable's header (e.g. "average amount of processing per request") so that the OS can make a more effective decision. For example; if something uses a massive amount of CPU time but has very little communication then you'd want it on whichever computer has the least work to do (regardless of communication costs), and if something doesn't use much CPU time but communicates a lot then you're going to want it "close" to whatever uses it instead (regardless of existing load on the computer).

If there's no binary for the service on the computer that's chosen; then the OS uses a distributed file system so the computer will just have to fetch the file. More specifically, the computer will try to fetch an "already optimised for this computer" native version of the executable, and if that doesn't exist it will try to fetch the "portable byte-code" version of the executable file and compile it for the specific computer. In either case it will try to cache the executable locally after it's been fetched (to avoid fetching/compiling if its needed again later).

If there's no executable for the service on the computer that's chosen and the executable file can't be obtained from the distributed file system; then the OS fails to start a new instance of the service.

embryo2 wrote:
Brendan wrote:Also note that (in my case) processes won't be using networking directly. The only API is "send message" and "get message"; and threads just send messages to other threads (without knowing or caring if the receiving thread is on the same computer or not).
It's the same levelled approach. Your set of levels differs from the set required for SQL queries to be executed, but the base idea is just identical - a piece of code hasn't to know about some underlying complexity.

Yes.

embryo2 wrote:
Brendan wrote:It recovers, if the user plugs the cable back in, and if the application developer spent time developing and testing the extra code needed to handle that case.
No, the same piece of code is invoked in the same way would there be a plugged or unplugged cable. If the cable is unplugged, then an exception is thrown. If the cable is pugged in then there is a normal execution of the code. In case of stateless service no more coding is required. In case of stateful service some transaction support is recommended. So, in both cases the application will recover successfully.

If the user never plugs the cable back in then the application won't recover successfully (because the OS lacks the ability to start a new instance of the service that the application requires). If the user does plug the cable in then it might recover successfully; but someone is going to have to write code to handle that (e.g. catch to exception and retry) and test to make sure it works properly (the code to handle it doesn't just magically appear out of thin air). Of course in my case the code to handle it doesn't magically appear out of thin air either; but it is built into the kernel and normal/application programmers don't need to do anything.

embryo2 wrote:
Brendan wrote:If there are millions of programs effected by bugs in the VM, then fixing the VM can fix millions of programs. You think this is good because you only want to look at the "fix millions of programs" part. I think this is bad because millions of programs were effected by bugs in the VM in the first place.
For an OS the situation is identical - millions of programs are effected by a bug in the OS. Is this bad? Yes. But what can we do?

What you can do is minimise the amount of code that processes depend on, that could have bugs.

More specifically, if a program has no bugs (e.g. has been extensively tested by things like static analysis, unit tests and people/users), you want to minimise the chance that it becomes buggy later on because something it depends on (e.g. kernel, virtual machine, dynamically linked libraries) was updated and has introduced new bugs that didn't exist before and couldn't have been found during that extensively testing. To minimise the chance of that happening you minimise the amount of code processes are forced to depend on (e.g. don't have any virtual machine or any dynamically linked libraries; and minimise the amount of code in the kernel).

embryo2 wrote:
Brendan wrote:Note: My approach doesn't require 3 computers, it's just able to spread load over multiple computers if/when it's beneficial.
Without 3 computers it is impossible to implement hardware failure protected system, because your result voting approach just doesn't work for 2 computers and useless for 1.

No; it still works (even with only 1 computer); it's just less effective. For example (for the "1 computer" case) it would still guard against transient faults, and still guard against RAM faults (because each instance of the service is using different physical pages of RAM); and if the computer has 2 or more CPUs it can guard against one CPU failing (even if that CPU fails while an instance of the service is using it and that instance of the service has to be terminated).

embryo2 wrote:
Brendan wrote:It's more like client/server where there's 2 processes communicating (except that it's peer-to-peer with multiple processes). It's like you're complaining that (e.g.) if web browser asks web server to do trivial things (string comparisons) the overhead would be bad; and you're right (the overhead would be bad in that case), but you're wrong because services aren't used for trivial things like that in the first place.
Then it's all about the traditional coarse grained approach, like is the case for web services and web sites, but with different transport protocol.

Yes - in various ways it's similar to (e.g.) a web browser that's communicating with a web server that's communicating with an SQL server.

embryo2 wrote:Wouldn't it be easier just to get some existing implementation of the RPC with the underlying protocols already implemented?

It might be easier - I don't know. I personally find it easier to invent my own way; and harder to figure out the details of someone else's method and then ensure my implementation matches their behaviour perfectly. Of course "easier" has nothing to do with how good it is.

The problem with (e.g.) RPC is that it's synchronous (designed to mimic the behaviour of function calls), which means that you can't just send 10 requests to 10 different computers and do other work while you're waiting for the replies to get all 11 computers doing useful work in parallel. To achieve that with RPC you'd have to spawn 10 new threads and do one "request+reply" RPC on each thread, which ends up causing more overhead (thread creation, thread switching, etc) and is a very ugly (e.g. needing locks and stuff to manage state) and is much more error prone.

embryo2 wrote:
Brendan wrote:In my computer room there's a network of 28 computers on a LAN. For total resources it adds up to about 70 CPUs, 32 GPUs and 150 GiB of RAM. My project is a distributed system. The challenge is to allow software to make use of all those resources. I should be able to have a single user that's using a single application that uses all those resources all by itself (maybe a 3D game of massive proportions). I should also be able to plug in 100 keyboards and 100 monitors and have 100 users running a 300 normal desktop/GUI applications (where if some CPUs get overloaded the OS just shifts load to different CPUs on the LAN). If one of those 100 users is doing something that requires extremely high reliability, then why not let them have some redundancy?
Yes. I agree, that your situation is demanding for some software support. But may be it worth to read more about distributed system architecture before you start to implement your messaging solution?

You think I've never read anything about distributed systems before?

Note that most distributed systems suck - e.g. they require manual configuration and assigned roles, and don't dynamically shift things around to cope with hardware failures or to balance load.

embryo2 wrote:
Brendan wrote:There's some code in the kernel to handle messages, processes/services, and redundancy (but half of it has to exist in some form anyway, for any multi-tasking OS of any description). There are no "helpers" in any process or anywhere else in user-space.
Who defines message routing? What kinds of protocols will be involved? Who cares about network topology? What overhead expects the system administrator that will be working with your system? What new tools should be used by administrators?

You're changing the subject. There is no "helper code" in processes to help with things like fault tolerance (unlike your "exceptions" idea where the exception handlers are helper code, and unlike your "VM" idea where the virtual machine is all helper code).

The kernel has to take care of routing. The networking protocol used will depend on the hardware/network card drivers (e.g. raw ethernet packets for typical LANs). Private messaging protocols (e.g. used internally within an application) are determined by the application developer. Public messaging protocols have a formal standardisation process I've mentioned before. Administrators will have a tool to view statistics for both hardware and software (that I've also mentioned before) but this isn't "new" (in that most OSs have something like this anyway).

Administrators and maintenance people will also have a tool to receive notifications from the OS and resolve issues. For example; if the OS detects that a user's mouse is due for cleaning it gets added as a low priority job in the "maintenance tool", and when admin/maintenance staff do that job they tell the OS it's been completed. This is used for scheduled maintenance, hardware failures, and recommendations (where OS recommends hardware changes when it detects bottlenecks and/or when hardware is getting close to its "mean time between failures").

Cheers,

Brendan

embryo2 · Post by **embryo2** » Sun Jul 12, 2015 12:37 pm

Rusky wrote:If you use synchronized blocks to lock/unlock things, you're obviously doing things in between that shouldn't be interrupted. So if you get unwound by an exception instead of by finishing the synchronized block, it will leave you in the same state as if you had forgotten the lock and got preempted. Writing code without this problem is called being "exception safe" and basically involves a lot of subtle tricks involving destructors or finally blocks. It's one of the downsides of exception support, and the reason that (for example) Rust has the ability to throw but not to catch exceptions (they become thread-level aborts).

If the words "a lot of subtle tricks" mean just try-catch-finally blocks, then I strongly disagree. It introduces additional level of indentation and a few extra lines of code, but I just can't call it using words "subtle" or "tricks" (or even "a lot" of such mess).

And Java, like Rust, (or is it more correct to say "Rust like Java"?) also supports so called runtime exceptions, that aren't require a developer to write exception handlers.

Rusky wrote:Exceptions are enforced by the language (not the VM)

If we want to be precise then we should write something like this - one part of exception architecture is enforced by compiler, which is written according to the rules, found in a language; and second part is enforced by VM (runtime exceptions, for example), which again is written according to the language's rules.

Rusky wrote:Simple 32-bit result codes are not what I'm talking about- properly-encoded error conditions are part of the function type and require you to explicitly handle the error. One example is a Result type that holds either the normal result or an error (with just as much information as an exception), but not both, forcing the program to handle both cases at compile time.

How something like C style union can force a developer to write an exception handler?

Rusky wrote:Another example that Brendan has brought up before is passing in error handling continuations to functions, so that callers are forced, at compile time, to provide handlers for each error condition.

Java way of hiding those additional parameters is more concise.

embryo2 · Post by **embryo2** » Sun Jul 12, 2015 1:21 pm

Brendan wrote:I mean, the OS starts another instance of the service "somewhere" (on any computer that is still part of the group of computers/cluster). This could be either local or remote. Note: it doesn't matter much where, except for performance. For performance, "where" is a compromise between CPU load and communication overhead; and I'll be adding a little information into the executable's header (e.g. "average amount of processing per request") so that the OS can make a more effective decision. For example; if something uses a massive amount of CPU time but has very little communication then you'd want it on whichever computer has the least work to do (regardless of communication costs), and if something doesn't use much CPU time but communicates a lot then you're going to want it "close" to whatever uses it instead (regardless of existing load on the computer).

If there's no binary for the service on the computer that's chosen; then the OS uses a distributed file system so the computer will just have to fetch the file. More specifically, the computer will try to fetch an "already optimised for this computer" native version of the executable, and if that doesn't exist it will try to fetch the "portable byte-code" version of the executable file and compile it for the specific computer. In either case it will try to cache the executable locally after it's been fetched (to avoid fetching/compiling if its needed again later).

Ok, I see your point. But this algorithm is the exactly the helper I was talking about in previous message. So, instead of "elegant and clean" messaging solution you will have a set of implementations of different helper algorithms, that handles all special cases by defining one algorithm per each case. In the example above we see the cases for service deployment and start-up mixed with load optimization job and the need for software repository. And for each case you have to invent and implement a helper algorithm.

Brendan wrote:If the user never plugs the cable back in then the application won't recover successfully (because the OS lacks the ability to start a new instance of the service that the application requires).

Yes, the ability to start a backup service is a nice thing. But it has nothing to do with messaging, because the messaging is just one of possible transport protocols below the service level.

Brendan wrote:If the user does plug the cable in then it might recover successfully; but someone is going to have to write code to handle that (e.g. catch to exception and retry) and test to make sure it works properly (the code to handle it doesn't just magically appear out of thin air). Of course in my case the code to handle it doesn't magically appear out of thin air either; but it is built into the kernel and normal/application programmers don't need to do anything.

In my case it is the VM where the handling code is located. If a developer doesn't implement an exception handler, then the VM catches the exception outside of the root thread's method (function) and ensures that there always will be required bookkeeping.

Brendan wrote:What you can do is minimise the amount of code that processes depend on, that could have bugs.

The minimization only is not a silver bullet. More complex solutions with a lot of code can be more safe than solutions with lesser complexity and code volume. For example a database plus it's client represent a lot of code in total, but such solution is much safer than a freshly invented file based storage, for example.

Brendan wrote:it still works (even with only 1 computer); it's just less effective. For example (for the "1 computer" case) it would still guard against transient faults, and still guard against RAM faults (because each instance of the service is using different physical pages of RAM); and if the computer has 2 or more CPUs it can guard against one CPU failing (even if that CPU fails while an instance of the service is using it and that instance of the service has to be terminated).

So, it works just partially, for some hardware or some special kinds of problems.

Brendan wrote:The problem with (e.g.) RPC is that it's synchronous (designed to mimic the behaviour of function calls), which means that you can't just send 10 requests to 10 different computers and do other work while you're waiting for the replies to get all 11 computers doing useful work in parallel. To achieve that with RPC you'd have to spawn 10 new threads and do one "request+reply" RPC on each thread, which ends up causing more overhead (thread creation, thread switching, etc) and is a very ugly (e.g. needing locks and stuff to manage state) and is much more error prone.

In fact your solution also requires locks and stuff to manage state. Every time your message is posted the lock is required to ensure the queue structure (the state) is not corrupted by multiple threads. So, here we see the standard solution is on par with yours. Also the thread creation do not required because of the widely used thread pooling approach. And thread switching also should be implemented in your solution (just because it has threads).

Brendan wrote:Note that most distributed systems suck - e.g. they require manual configuration and assigned roles, and don't dynamically shift things around to cope with hardware failures or to balance load.

Cloud computing offers dynamic reconfiguration variants. And automatic load balancers are used everywhere. So, your variant isn't the best in this area.

Brendan wrote:You're changing the subject. There is no "helper code" in processes to help with things like fault tolerance (unlike your "exceptions" idea where the exception handlers are helper code, and unlike your "VM" idea where the virtual machine is all helper code).

I'm not changing the subject, but I point to the directions where you can find the helpers. And your way of defending your idea by merging everything into the kernel doesn't look good, because I can counter with merging of the kernel and VM.

Brendan wrote:if the OS detects that a user's mouse is due for cleaning it gets added as a low priority job in the "maintenance tool", and when admin/maintenance staff do that job they tell the OS it's been completed.

If an OS is able to automatically detect the need for mouse cleaning, then it should detect the change after the mouse was cleaned. And by the way, such detection isn't a trivial thing.

OSDev.org

Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces

Re: Concise Way to Describe Colour Spaces