OSDev.org

Posted: **Fri May 04, 2018 11:40 pm**

The sketch in Schol-R-LEA's post made me realize that I don't understand the purpose and scope of lightweight virtualization. According to some internet sources, which may not be reliable, Docker and lightweight virtualization are picking up some pace. The question is basically - is this true, and if so why? What is the scope of solutions like Docker and can they replace traditional virtualization even in principle (i.e. even assuming that the container technology evolves rapidly)?

Is container virtualization used only to enable deployment and execution sandboxing to applications, or is it used in place of traditional virtualization to provide users with separate adminstrative domains? How does it conceptually address system aging? Traditional virtualization can reboot the guest, which refreshes most of the kernel state. Admittedly, the driver or emulation domain is problematic in traditional virtualization as well, creating a single point of failure for all guests, although with VT-d the drivers can be contained in the user guests. But the question is whether segregating the aging of the kernel state and enabling separate VM reboot can justify the the large memory footprint?

P.S. Note that using OS-level QoS, layered file system, file system snapshotting, a lot of the capabilities become possible without virtualization. Also, I am assuming that the OS choice and heterogeneity is not an issue, or otherwise traditional virtualization would be unavoidable.

Edit: rephrase some lingo - I know my so called english reads rather uncomfortably

Posted: **Sat May 05, 2018 11:04 am**

My flippant answer is that containerization makes it easier for the developers because even if they do something stupid that would either crash the OS, or lead to a runaway memory leak, they can rely on the hypervisor to clean up the mess.

Mind you, even going back to the outset of virtualization technology in the late 1960s, it has often been treated as a way to fix problems with buggy process isolation, so there's a certain amount of truth to that joke.

So, a large part of the appeal of system-level virtualization is the ability to more completely isolate programs which are know to never need to communicate with each other beyond the system's multiplexing of resources. That's one part of this, definitely.

This was also related to the main idea behind Exokernel designs - if the programs don't need to communicate with each other, then you can virtualize the programs themselves, and leave it to them to tune their operations.

The difference is that programmers are lazy, including OS devs, and it is easier (and in some ways more flexible) for all involved to spin up a full copy of an OS inside the container, using an existing OS as the hypervisor, rather than come up with a new, stand-alone hypervisor (and a paravirtualized shared library model that everyone would have to learn, and then tweak on their own). Since the ultra-high performance and ultra-fine granularity of an exokernel is pretty much a case of YAGNI for your typical web service - most of the time, 90% of the performance hit is in network lag, anyway - the containerization advocates took the virtualization but ignored the rest.

Mind you, I could mention Synthesis OS's synthetic machines (s-machines), in which individual programs would run paravirtualized and potentially have multiple isolated processes within the program, but I do enough fangirling over Massalin already.

But let's look at the history a bit more.

When IBM first came out with a VM/370 in... what, 1972? - the main goal was to be able to run multiple OSes, allowing them to run the existing OS/360 and its descendant alongside Conversational Monitor System, meaning that they didn't need the sort of general-purpose timesharing system they so loathed in order to serve remote terminals. The fact that their attempt at a commercial timesharing system was an epic fail had nothing to do with this, of course.

IBM had always disliked the idea of dirty peasant users getting their grubby mitts on their beautiful smirk, elegant giggle batch-processing systems which they were certain was The One True Way to Compute and always would be. However, the success of remote terminal serving systems such as SABRE (within their limited sphere of operation), and of timesharing systems such as ITS and Dartmouth TSS (within their limited user base), made it clear that they needed a response, so they came up with TSO, the Time Sharing Option, which was an add-on bolted onto the side of the MVT variant of OS/360. Unfortunately, it sucked like a vacuum cleaner, because a) it was just a bag on the side of a system that wasn't designed for interactive use, b) IBM didn't really get the idea that users might be very patient with delays during interactive use, and c) the first release was a prototype that got rushed to production for the specific reason of convincing their customers not to bother with timesharing. People saw right through that last part, and in any case, most of the customers who actually needed timesharing (or thought they did) had jumped ship for more flexible and less costly systems from DEC, Honeywell, or GE even before it was released.

Even so, enough customers stuck it out that it failed to fail outright and became a 'Springtime for Hitler' situation for IBM, making it in some ways a preview of what would happen with the IBM PC.

Meanwhile, a group of researchers in their Cambridge, Massachusetts research center - the same group place where TSO was developed - hit on the idea that one could run a simulated computer system on the hardware being simulated by actually running it on said hardware, but monitoring the simulated system and trapping any operations that you don't want to the simulation to run on real hardware. This allowed them to create a virtual computer in which the simulation would run at full speed for most things, but could be prevented from doing dangerous or unwanted things, while still maintaining the illusion that it had complete control of a real computer. This is the idea that would eventually become virtualization as we now know it.

Initially, they worked out a software-based system that ran on existing hardware, which was called 'Control Program/Cambridge Monitor System', or 'CP/CMS', The 'monitor' acted as a single-tasking, single-user system which, from the perspective of the user, appeared to be running on dedicated hardware - not too different from using, say, a PDP-8, except you didn't have (or need) access to the actual computer in order to use it.

That suited IBM's management, trainers, and field technical support right down to their socks, because it meant the system operators could spin up a CMS container as needed, and then forget about it, letting them live in their batch-processed Laputa and pretend timesharing users didn't exist most of the time.

It also appealed to their sale force - who were, after all, the ones who really called the shots at IBM - because they could claim to have a timesharing system without scaring away their institutional customers, whom they had spent half a decade convincing that timesharing was evil and batch processing was a gift from on high.

As research continued they developed improvements which relied on hardware modifications done on the researchers' testbed mainframe. Many of the virtualization techniques still in use were developed at this time, but it would be almost two decades before microprocessors - which didn't even really exist at the time - would be able to implement them.

This worked out well enough that when IBM released the System/370 update of the System/360 line, they were ready to include that hardware virtualization support in some higher-end models. They created a dedicated hypervisor called VM/370, and renamed Cambridge Monitor System in to Conversational Monitor System, and the descendants of both are still a mainstay of IBM's mainframe, high-end server, and blade server systems.

OK, let's jump ahead a few years. We can skip the 80386 for now, since, while it was an impressive feat of engineering in many ways, it didn't really do anything new; in fact, it only provided a sliver of what the 1970s mainframes were doing, and wasn't even the first microprocessor to do so.

The important next step was around that same time though: the work at MIT on the exokernel idea. Basically, several OS researchers there looked at VM/CMS and the other client OSes VM ran and said, in effect, 'if all you are doing is running one program, why does the virtualized system need a general-purpose OS at all?' It was You Ain't Gonna Need It write large, years before the term was even coined.

They decided to shuck all the general-purpose OS services in favor of a stripped down hypervisor whose sole job was to multiplex access to the hardware, and have each program have its own, highly tailored library of operations for interfacing the hardware in a precise manner, with no wishy-washy stuff about abstraction layers and common interfaces. They did set up a system for paravirtualizing the libraries, so that if two or more containers needed the same library (presumably for something that wasn't a bottleneck), they could share the library rather than having separate copies.

It wasn't a bad idea, but it really was only suited to servers - for general-purpose interactive systems, there were too many programs that would need to interact with eac h other, meaning that they would run into the same kind of IPC overhead that microkernels did, if not worse.

Also, it added to the burden of developing a server or application, and on the system configurator, as you didn't have a standard set of OS services you could be certain would always be there - every system configuration would be unique, which is fine for a handful of systems but won't scale to to the hundreds of thousands or millions of them.

I know I said that I was done talking about Synthesis, but I should mention that a lot of the ideas in that were aimed at the same kind of micro-optimization as exokernels, but doing it programmatically, rather than putting the burden on the application developers. However, it runs into another problem that hurt exokernels: poor locality of memory access.

Still, the exokernel concept did leave an impact on the set of ideas that became containerization systems such as Docker. The Synthesis approach, not so much (it's too weird and complex for most people), though you do see a few echoes of in some of the newer systems.

This brings us to the rise of server farms, which is the real reason for the widespread use of containerization. Basically, Docker and its ilk allow server managers to hand out small, dedicated slivers of their servers to people who need an Internet based server to do Just One Thing, but don't have the time or inclination to dive deep into the design of their own special-snowflake exokernel client.

Docker is a compromise between security (because the containers are better isolated than ordinary processes would be), simplicity (because they only need to have that one service running in the container), familiarity (because it can still be running at least a rump version of a commonly used OS backing up said service, so the container admins and developers don't need to learn anything new), flexibility (because the owners of the container can set up the specific OS they need without it conflicting with the admins of the server farm as a whole - well, not too often, anyway), and low administration overhead for the server farm admins (since they can dump most of the work on the admins for the individual containers).

It also has the advantage common to more general types of 'cloud' virtualization that the container doesn't need to be tied to a specific server, but can 'float' between physical hosts transparently, or even be running on multiple physically hosts simultaneously - provided you do a damn good job of synchronizing them; or better still, set up the services in such a way that the different copies don't need to be synchronized. Since both of those are a lot easier to do with a single-purpose service that only only touches a very limited cross-section of the data and other resources, it makes some sense to do micro-services in separate containers when it is feasible as opposed having one big service that is harder to float efficiently.

Posted: **Sat May 05, 2018 5:47 pm**

Schol-R-LEA wrote:My flippant answer is that containerization makes it easier for the developers because even if they do something stupid that would either crash the OS, or lead to a runaway memory leak, they can rely on the hypervisor to clean up the mess.

Hmm. I was under the impression that Docker uses linux namespaces (or equivalent). The latter provide virtual filesystem root and some networking and security isolation, but AFAIK do not use ring -1. That is, I thought they still use only kernel infrastructure, not a type-2 variety hypervisor.

Schol-R-LEA wrote:Mind you, even going back to the outset of virtualization technology in the late 1960s, it has often been treated as a way to fix problems with buggy process isolation, so there's a certain amount of truth to that joke.

This is true for traditional ring -1 virtualization, but is it also true for OS-level virtualization? This is exactly what concerns me with linux namespaces and similar approaches to creating virtualization-like experience (assuming that Docker uses linux namespaces indeed.) They cannot completely decouple the performance, availability, and security of the running software even if its coexistence is neatly hidden under a layer of kernel abstraction. The correlations exist because the software still shares the same kernel state. As you later suggested, this is contrasted with exokernels, where the shared kernel state is minimized, because all the OS services that are not crucial to maintaining isolation are exposed as libraries in user-space.

For a rather esoteric example, lets take a bit flip in RAM with traditional virtualization. It can corrupt one of the guests or a driver domain. If it hits the driver domain, the driver services may have to be restarted. The guests may suffer interruptions, but the guest kernels should survive the event without shutting down (, whereas the applications could lose their open descriptors/handles and may have to be restarted). This is the worst case. On other hand, if it hits a guest, the guest may have to be restarted (depending on whether it hit the kernel), but it will not affect the other guests. With linux namespaces, if the bit flip hits the kernel, the entire system with all user processes will have to be restarted, meaning that it will affect all users. The same applies to issues like kernel pool allocator fragmentation, latent software bugs that occur after sufficient use, security exploits that affect the kernel, etc. Every issue affects all users.

Part of the solution is apparently to have a failover or load-balancing setup, and to use the planned redundancy when a catastrophic event occurs. Since lightweight virtualization is so much cheaper, having a failover may still be overall less expensive, aside from the additional advantage it may offer. This works well for stateless services, such as a plain http server. But what about stateful software processes owned by external clients (i.e. in cases where they are not used internally by the company). Won't it lead to more frequent interruption of the client's workflow compared to traditional virtualization. I am either failing to consider the economical tradeoff here, or am unfamiliar with the proper use cases. (Of which I am certain.)

Schol-R-LEA wrote:This brings us to the rise of server farms, which is the real reason for the widespread use of containerization. Basically, Docker and its ilk allow server managers to hand out small, dedicated slivers of their servers to people who need an Internet based server to do Just One Thing, but don't have the time or inclination to dive deep into the design of their own special-snowflake exokernel client.

Again, I could be mistaken, but if Docker uses linux namespace, I can't understand why OS-level virtualization is becoming preferable in these scenarios to hypervisor based one.

Schol-R-LEA wrote:Mind you, I could mention Synthesis OS's synthetic machines (s-machines), in which individual programs would run paravirtualized and potentially have multiple isolated processes within the program

This is interesting, although I cannot yet say how I feel about Synthesis OS. Depends on how transparently the optimization occurs to the developer. I will have to look some more into it.

OSDev.org

Lightweight virtualization vs full-/para-virtualization?

Lightweight virtualization vs full-/para-virtualization?

Re: Lightweight virtualization vs full-/para-virtualization?

Re: Lightweight virtualization vs full-/para-virtualization?