Page 1 of 1

Opinion: InterProcess Object vs. Interpreted Code

Posted: Thu Oct 05, 2017 8:08 am
by SukantPal
Newer projects like Singularity and some like a JavaOS, are running programs that are interpreted. This means that there would be a performance overhead for all user-mode software and services running on the system due to not running directly on the machine. This has improved other things like -
1. Programs can come in one address space on hardware leading to better TLB usage and less context switch.
2. IPC is faster due to being in one address space.

This leads to the following questions:
1. If a program is immensely large then (for 32-bit systems), it would need a separate address space, then some context switches would occur leading to complexity for the software?
2. Also, the advantage of being a interpreted software is gone, Am I right?

InterProcess Objects -
Kernel-memory is shared among all address spaces. Object can be used for IPC and here I will refer to them as IPO. If a IPO can be manipulated in kernel-mode only using a interpreted script/code, then that should be faster than making the user-mode interpreted right. Also, security b/w processes would be implemented in hardware only.

When a object changes a signal-handler can be implemented, so that the process will be notified of a change-in-state of the object.

Example -
Let's say a system-service implements the graphical-widgets in a kernel IPO. There is a list of widgets like (one) window, multiple text-boxes, buttons, images, text, etc. We want to change the RGB color value of a button. So, client-process (say calculator) will use a IPO-script loaded by the system-service to manipulate the objects. It can push a RGB-color paramater to a function in the script and directly manipulate the object and no need for message-passing. Also, the system-service can regularly update the screen's framebuffer, so no need for notifying it for the change, because on updating the screen it will get know automatically, right.

Security for IPO can be implemented by listing a set of functions that have authority to change the object in the kernel. The system-service would declare the IPO and only it would have the right to set its security parameters - which scripts/functions/code could change the object.

Is my model correct/usable for implementing in existing software and in my OS/kernel?

Re: Opinion: InterProcess Object vs. Interpreted Code

Posted: Thu Oct 05, 2017 7:29 pm
by Brendan
Hi,

Some random notes to start with..

Singularity used a managed language that was compiled to native (it wasn't interpreted). This still causes a performance loss (e.g. run-time checks where compiler couldn't avoid it, inability to use extremely optimised assembly, inability to use self-modified and/or run-time generated code, etc).

The supposed performance advantages all come from retaining TLBs and nowhere else. The TLB usage advantages of "single address space" are greatly exaggerated (and the exaggeration backed up by very biased micro-benchmarks). For newer CPUs (that support address space IDs, where TLB entries can be tagged and not flushed when the virtual address space changes); there should be no advantage to "single address space" at all. For older CPUs (that don't support address space IDs); the TLB uses a "(pseudo) least recently used" eviction policy that typically causes one process' TLB entries to be evicted while other processes run, so that when you switch back to that process none of its TLB entries are left so you still get all of the TLB misses that "single address space" was supposed to avoid. It's this last case where very biased/misleading micro-benchmarks are often used (mostly involving rapidly switching between processes while doing almost nothing, so that the TLB entries don't become evicted like they would for normal software, and benefits that won't exist in practice are measured).

The other performance problem (with any form of TLB retention - both address space IDs and "single address space") is multi-CPU TLB shootdown. About half of the time (assuming "lazy TLB invalidation" is used where possible), when one CPU modifies a paging it has to inform other CPUs that any old TLB entries need to be invalidated, and this ends up being quite expensive. Without any form of TLB retention this can be avoided under various circumstances (e.g. single-threaded process, where you know no other CPU can have any stale TLB entries), and with any form of TLB retention this can be avoided. What this means is that (especially for systems with lots of CPUs, and especially when running many single-threaded processes or when processes are confined to a single NUMA domain) the increased multi-CPU TLB shootdown overhead can be greater than the (partially illusionary) benefits of retaining TLBs, resulting in TLB retention schemes causing worse performance than a traditional "one address space per process" approach.

However performance is not the only issue. For security it's a disaster because it's relies on all hardware being 100% perfect and all management code (e.g. the compiler or interpreter and its native run-time code) being 100% perfect. In reality neither hardware nor management code has ever been 100% perfect (there's always errata, glitches, compiler bugs, etc), so the security relies on a fantasy. To improve security you really want multiple levels, such that if one level of security fails (e.g. the compiler) there's still other levels of security (e.g. hardware protection mechanisms). The other issue is hardware failures and fault isolation. For example, with managed code there's no protection against sporadic RAM errors (and research done by google suggest a rate of one RAM error per GiB per month; which is about 1 RAM error per day for a computer with 32 GiB of RAM).
SukantPal wrote:1. If a program is immensely large then (for 32-bit systems), it would need a separate address space, then some context switches would occur leading to complexity for the software?
2. Also, the advantage of being a interpreted software is gone, Am I right?
If a program is immensely large (e.g. large "in memory" database and/or memory mapped files) then for typical 64-bit systems (where you only have 48-bits of virtual address space) it would need a separate address space. For old 32-bit systems it's significantly worse, but I doubt there's many good reasons to support 32-bit CPUs for a single address space OS. If a process needs a separate address space then it destroys the (lack of) advantages when switching between that process and other processes, but other processes that still share the same address space would still get the (lack of) advantages.

Note: Recently Intel released a "5-level paging" extension (which increases virtual address size to 57 bits in long mode) because 48-bit virtual addressing was becoming too limiting (for a single process on a traditional OS) for some people (mostly large servers).
SukantPal wrote:InterProcess Objects -
Kernel-memory is shared among all address spaces. Object can be used for IPC and here I will refer to them as IPO. If a IPO can be manipulated in kernel-mode only using a interpreted script/code, then that should be faster than making the user-mode interpreted right. Also, security b/w processes would be implemented in hardware only.
This depends on too many things, how IPC works, how code is interpreted, etc. For example, there's a massive difference between "synchronous IPC" (task switch required every time any message is sent/received) and "asynchronous IPC" (task switch not required when messages are sent/received); and there's a massive difference between "user-space uses high-end JIT to interpret" and "kernel uses crude/simple interpretation".

Note that as far as I'm concerned "ideal" is asynchronous IPC where sender and receiver are running in parallel on different CPUs and using shared memory buffers/queues (so messages can be sent/received without the kernel being involved at all); and where there's no task switches for any reason (not for IPC and not for scheduling - e.g. more CPUs than running threads) and CPU's caches (and TLBs) aren't shared by multiple tasks. Of course this set of conditions wouldn't happen too often, and wouldn't last when they do happen (would only exist temporarily due to flow control, scheduling, etc).
SukantPal wrote:When a object changes a signal-handler can be implemented, so that the process will be notified of a change-in-state of the object.
A signal would be more expensive than handling the message in user-space - same privilege level switching and/or task switching, with additional difficulty for CPU's speculative/out-of-order pipelines due to breaking normal (more predictable) control flow.
SukantPal wrote:Example -
Let's say a system-service implements the graphical-widgets in a kernel IPO. There is a list of widgets like (one) window, multiple text-boxes, buttons, images, text, etc. We want to change the RGB color value of a button. So, client-process (say calculator) will use a IPO-script loaded by the system-service to manipulate the objects. It can push a RGB-color paramater to a function in the script and directly manipulate the object and no need for message-passing. Also, the system-service can regularly update the screen's framebuffer, so no need for notifying it for the change, because on updating the screen it will get know automatically, right.
In this case kernel will spend 6 days just figuring out which script (for which object) the caller wanted the kernel to interpret before anything gets done; and that's ignoring all the "script management" (creating/checking new scripts, removing/deleting old scripts, creating/checking new objects, removing/deleting old objects, etc).

I'd rather do "bundling". E.g. rather than just sending a tiny "change the RGB colour of this one little thing" request I'd construct a list of any number of requests ("change the RGB colour of this, and that, and move this over there, and delete that, and...") and send a list/bundle of requests as a single message. Essentially; instead of "the overhead of one message per per request" you get "a fraction of the overhead of one message per request".
SukantPal wrote:Security for IPO can be implemented by listing a set of functions that have authority to change the object in the kernel. The system-service would declare the IPO and only it would have the right to set its security parameters - which scripts/functions/code could change the object.
Security for this would be a fantasy (the same as the security of single address space - relying on 100% perfect hardware and compiler/interpreter software, which have never existed and probably will never exist) except that the consequences would be significantly worse because it's in kernel space.

Note: it's relatively easy to write a simple interpreter (e.g. using a "fetch_next_instruction(); switch(instruction) { case ....}" type of thing) but the performance is extremely bad; and if you improve performance beyond "disgusting" (e.g. use a full high-performance JIT approach) the complexity becomes many order of magnitude higher and the chance of bugs/problems (and security holes) becomes many order of magnitude higher too.


Cheers,

Brendan

Re: Opinion: InterProcess Object vs. Interpreted Code

Posted: Thu Oct 05, 2017 8:58 pm
by SukantPal
Hi Brendan,

In your ideal situation - shared memory with message queues/buffers, there is also a security flaw, right. If two processes - system service and a client-process are communicating, there is a chance of corrupting the message queue used by the service to collect messages from all clients. Okay, you can say that there will be a separate message queue for each client - then 4 or 8-KB are wasted for a simple message queue for each client.

Even in asynchronous message-passing, you must have two user-to-kernel and kernel-to-user switches. Same in the IPO, right.

I wanted to say that if a IPO can be used in kernel space, then that would be more secure because the kernel will check permissons/out-of-bound checks/whatsoever. Interpreted code should be fast enough, because it will be in minute/little quantity. Changing a object's state is just a few lines of code & instructions on the CPU. Adding a few more due to interpretation shouldn't affect.

Now, for multiple processes, say 15 processes. To inform all of them, you would have to send 15 messages/signals/other IPC to them. With, IPO wouldn't you be far off better than all that.

As you are stating that asynchronous IPC is far better than synchronous IPC, kernel IPOs will allow asynchronous IPC. Whenever the system-service will check the object-state it will get the idea that a change has occured.

Bundling - With interpreted-script in the kernel, you can bundling several requests together. You can make a list of functions to invoke as the caller and send that to the kernel to do on your behalf. Like

Code: Select all

$Script Comment
change_rgb(U32);
insert_button(BUTTON_HANDLE);
This would make the kernel invoke both functions, satisfying the bundling feature.

Question -
Wouldn't a IPO help in using RCU (Read-Copy-Update) feature? System-services can update the object in user-space and then write the object to kernel-space in one-go.

Re: Opinion: InterProcess Object vs. Interpreted Code

Posted: Fri Oct 06, 2017 2:12 am
by linguofreak
SukantPal wrote:
Even in asynchronous message-passing, you must have two user-to-kernel and kernel-to-user switches.
Not quite true. If the two processes are running on different CPUs, it is possible, if each is doing other work while waiting for the other to respond, for them to communicate without either invoking the kernel or being task-switched out.

Re: Opinion: InterProcess Object vs. Interpreted Code

Posted: Fri Oct 06, 2017 2:59 am
by Brendan
Hi,
SukantPal wrote:In your ideal situation - shared memory with message queues/buffers, there is also a security flaw, right. If two processes - system service and a client-process are communicating, there is a chance of corrupting the message queue used by the service to collect messages from all clients. Okay, you can say that there will be a separate message queue for each client - then 4 or 8-KB are wasted for a simple message queue for each client.
It'd be a shared buffer for each "sender and receiver pair" (e.g. with 100 clients talking to the same server, you'd have 100 shared buffers), and 2 processes that communicate like this would have to trust each other (e.g. sender can mess up queue for receiver) and there'd be some sort of security check when the connection is established; but there's no security problem for other processes or the kernel (neither sender or receiver can interfere with anything that isn't using that shared buffer), and the processes need to trust each other a little anyway (e.g. that sensitive information stays between them and isn't leaked, that requests and replies meet some predetermined protocol dictating expected behaviour, etc).
SukantPal wrote:Even in asynchronous message-passing, you must have two user-to-kernel and kernel-to-user switches. Same in the IPO, right.
You seem/seemed to be assuming "1 request = 1 message = 1 kernel call = 1 task switch" where all of these things can be decoupled.

The "1 request = 1 message" part is not necessarily true (e.g. send a list of requests/replies in a single message if all requests/replies go to the same receiver); the "1 message = 1 task switch" part is not necessarily true (e.g. asynchronous messaging); and the "1 message = 1 kernel call" part is not necessarily true either (e.g. the shared memory scheme I described).

Note: for my OS I don't use the shared memory scheme I described. Instead each thread has 2 message buffers (at fixed virtual addresses) and there's "batch kernel API" support; so that with a single kernel call a thread can ask kernel to send up to 2 messages and receive up to 2 messages (plus anything else that doesn't involve the message buffers). I could pack 100 requests into each message and send 200 replies (as 2 messages) and allocate some RAM and spawn 12 threads and receive 200 requests (in 2 messages), all with a single kernel API call that doesn't cause a task switch at all.
SukantPal wrote:I wanted to say that if a IPO can be used in kernel space, then that would be more secure because the kernel will check permissons/out-of-bound checks/whatsoever.
No, it won't be more secure (unless you're in a fantasy world where all hardware and all of the kernel's software is proven to be 100% correct (by something far more rigorous than "formal mathematical proofs that dodgy/false original assumptions are upheld").

There are 2 major principles for the design of secure systems. They are:
  • The principle of least privilege - restrict access to the minimum necessary to do what needs to be done
  • "Containerisation" - split things into pieces and isolate them from each other, to maximise the effectiveness of the principle of least privilege and minimise the damage a security breach can cause (e.g. split "web browser" into "javascript" and "HTTP" so that javascript doesn't need to be given access to TCP/IP just because some other part of the web browser needed it).
For both of these, you seem to want to do everything possible to ruin security (break the principle of least privilege by getting kernel to do things that don't need the kernel's privileges, and weaken the isolation between pieces/processes).
SukantPal wrote:Interpreted code should be fast enough, because it will be in minute/little quantity. Changing a object's state is just a few lines of code & instructions on the CPU. Adding a few more due to interpretation shouldn't affect.
If the performance was irrelevant because it's in minute/little quantity, there'd be no point considering it in the first place. The reason you're considering it is because you're looking for ways to improve performance; and it either won't help (because it doesn't happen often enough to matter) or it won't help (because it happens often enough to be a performance disaster).
SukantPal wrote:As you are stating that asynchronous IPC is far better than synchronous IPC, kernel IPOs will allow asynchronous IPC. Whenever the system-service will check the object-state it will get the idea that a change has occured.
Imagine you have a "prime numbers" service. You send a request ("is the number 12345678 a prime number?"). Does your thread block until the reply arrives (synchronous) or can your thread do other things while waiting for the reply to arrive (asynchronous)? For your kernel IPOs I'd assume the former (synchronous - the thread waits for ages while the kernel runs a script that determines if it was a prime number).
SukantPal wrote:Bundling - With interpreted-script in the kernel, you can bundling several requests together. You can make a list of functions to invoke as the caller and send that to the kernel to do on your behalf. Like

Code: Select all

$Script Comment
change_rgb(U32);
insert_button(BUTTON_HANDLE);
This would make the kernel invoke both functions, satisfying the bundling feature.
Yes, but now kernel scripts have to handle all requests that could be put into a bundle regardless of how complex or time consuming the requests are.
SukantPal wrote:Question -
Wouldn't a IPO help in using RCU (Read-Copy-Update) feature? System-services can update the object in user-space and then write the object to kernel-space in one-go.
RCU is used to implement atomic updates, to ensure a set of data is always in self-consistent state. Your IPO would be a major disaster for this because you've doubled the number of places that need to be "atomically" updated (you can't update the copy in user-space and update the copy in kernel-space at the same time, so it's possible for something to see an inconsistent state - e.g. some data from user-space that has been updated and some data from kernel-space that hasn't been updated yet).


Cheers,

Brendan

Re: Opinion: InterProcess Object vs. Interpreted Code

Posted: Fri Oct 06, 2017 10:36 am
by Korona
Kernel based "scripts" (interpreted or JIT compiled code snippets) can work in some situations. As an example, look at Linux' BPF infrastructure. However, I doubt that it can make a good IPC system.

I plan to add such a mechanism to my microkernel's interrupt handling so that it becomes possible to send EOI synchronously, which will improve the performance for shared IRQ lines a lot. However I won't write a complex JIT as I don't want to put the large attack surface of an optimizing compiler into supervisor mode. Instead I will aim for a single-pass, non-optimizing compiler for something like WebAssembly that does not have garbage collection and is designed to be safely embedded into larger frameworks.