Hi,
Some random notes to start with..
Singularity used a managed language that was compiled to native (it wasn't interpreted). This still causes a performance loss (e.g. run-time checks where compiler couldn't avoid it, inability to use extremely optimised assembly, inability to use self-modified and/or run-time generated code, etc).
The supposed performance advantages all come from retaining TLBs and nowhere else. The TLB usage advantages of "single address space" are greatly exaggerated (and the exaggeration backed up by very biased micro-benchmarks). For newer CPUs (that support address space IDs, where TLB entries can be tagged and not flushed when the virtual address space changes); there should be no advantage to "single address space" at all. For older CPUs (that don't support address space IDs); the TLB uses a "(pseudo) least recently used" eviction policy that typically causes one process' TLB entries to be evicted while other processes run, so that when you switch back to that process none of its TLB entries are left so you still get all of the TLB misses that "single address space" was supposed to avoid. It's this last case where very biased/misleading micro-benchmarks are often used (mostly involving rapidly switching between processes while doing almost nothing, so that the TLB entries don't become evicted like they would for normal software, and benefits that won't exist in practice are measured).
The other performance problem (with any form of TLB retention - both address space IDs and "single address space") is multi-CPU TLB shootdown. About half of the time (assuming "lazy TLB invalidation" is used where possible), when one CPU modifies a paging it has to inform other CPUs that any old TLB entries need to be invalidated, and this ends up being quite expensive. Without any form of TLB retention this can be avoided under various circumstances (e.g. single-threaded process, where you know no other CPU can have any stale TLB entries), and with any form of TLB retention this can be avoided. What this means is that (especially for systems with lots of CPUs, and especially when running many single-threaded processes or when processes are confined to a single NUMA domain) the increased multi-CPU TLB shootdown overhead can be greater than the (partially illusionary) benefits of retaining TLBs, resulting in TLB retention schemes causing worse performance than a traditional "one address space per process" approach.
However performance is not the only issue. For security it's a disaster because it's relies on all hardware being 100% perfect and all management code (e.g. the compiler or interpreter and its native run-time code) being 100% perfect. In reality neither hardware nor management code has ever been 100% perfect (there's always errata, glitches, compiler bugs, etc), so the security relies on a fantasy. To improve security you really want multiple levels, such that if one level of security fails (e.g. the compiler) there's still other levels of security (e.g. hardware protection mechanisms). The other issue is hardware failures and fault isolation. For example, with managed code there's no protection against sporadic RAM errors (and research done by google suggest a rate of one RAM error per GiB per month; which is about 1 RAM error per day for a computer with 32 GiB of RAM).
SukantPal wrote:1. If a program is immensely large then (for 32-bit systems), it would need a separate address space, then some context switches would occur leading to complexity for the software?
2. Also, the advantage of being a interpreted software is gone, Am I right?
If a program is immensely large (e.g. large "in memory" database and/or memory mapped files) then for typical 64-bit systems (where you only have 48-bits of virtual address space) it would need a separate address space. For old 32-bit systems it's significantly worse, but I doubt there's many good reasons to support 32-bit CPUs for a single address space OS. If a process needs a separate address space then it destroys the (lack of) advantages when switching between that process and other processes, but other processes that still share the same address space would still get the (lack of) advantages.
Note: Recently Intel released a "5-level paging" extension (which increases virtual address size to 57 bits in long mode) because 48-bit virtual addressing was becoming too limiting (for a single process on a traditional OS) for some people (mostly large servers).
SukantPal wrote:InterProcess Objects -
Kernel-memory is shared among all address spaces. Object can be used for IPC and here I will refer to them as IPO. If a IPO can be manipulated in kernel-mode only using a interpreted script/code, then that should be faster than making the user-mode interpreted right. Also, security b/w processes would be implemented in hardware only.
This depends on too many things, how IPC works, how code is interpreted, etc. For example, there's a massive difference between "synchronous IPC" (task switch required every time any message is sent/received) and "asynchronous IPC" (task switch not required when messages are sent/received); and there's a massive difference between "user-space uses high-end JIT to interpret" and "kernel uses crude/simple interpretation".
Note that as far as I'm concerned "ideal" is asynchronous IPC where sender and receiver are running in parallel on different CPUs and using shared memory buffers/queues (so messages can be sent/received without the kernel being involved at all); and where there's no task switches for any reason (not for IPC and not for scheduling - e.g. more CPUs than running threads) and CPU's caches (and TLBs) aren't shared by multiple tasks. Of course this set of conditions wouldn't happen too often, and wouldn't last when they do happen (would only exist temporarily due to flow control, scheduling, etc).
SukantPal wrote:When a object changes a signal-handler can be implemented, so that the process will be notified of a change-in-state of the object.
A signal would be more expensive than handling the message in user-space - same privilege level switching and/or task switching, with additional difficulty for CPU's speculative/out-of-order pipelines due to breaking normal (more predictable) control flow.
SukantPal wrote:Example -
Let's say a system-service implements the graphical-widgets in a kernel IPO. There is a list of widgets like (one) window, multiple text-boxes, buttons, images, text, etc. We want to change the RGB color value of a button. So, client-process (say calculator) will use a IPO-script loaded by the system-service to manipulate the objects. It can push a RGB-color paramater to a function in the script and directly manipulate the object and no need for message-passing. Also, the system-service can regularly update the screen's framebuffer, so no need for notifying it for the change, because on updating the screen it will get know automatically, right.
In this case kernel will spend 6 days just figuring out which script (for which object) the caller wanted the kernel to interpret before anything gets done; and that's ignoring all the "script management" (creating/checking new scripts, removing/deleting old scripts, creating/checking new objects, removing/deleting old objects, etc).
I'd rather do "bundling". E.g. rather than just sending a tiny "change the RGB colour of this one little thing" request I'd construct a list of any number of requests ("change the RGB colour of this, and that, and move this over there, and delete that, and...") and send a list/bundle of requests as a single message. Essentially; instead of "the overhead of one message per per request" you get "a fraction of the overhead of one message per request".
SukantPal wrote:Security for IPO can be implemented by listing a set of functions that have authority to change the object in the kernel. The system-service would declare the IPO and only it would have the right to set its security parameters - which scripts/functions/code could change the object.
Security for this would be a fantasy (the same as the security of single address space - relying on 100% perfect hardware and compiler/interpreter software, which have never existed and probably will never exist) except that the consequences would be significantly worse because it's in kernel space.
Note: it's relatively easy to write a simple interpreter (e.g. using a "fetch_next_instruction(); switch(instruction) { case ....}" type of thing) but the performance is extremely bad; and if you improve performance beyond "disgusting" (e.g. use a full high-performance JIT approach) the complexity becomes many order of magnitude higher and the chance of bugs/problems (and security holes) becomes many order of magnitude higher too.
Cheers,
Brendan