My flippant answer is that containerization makes it easier for the developers because even if they do something stupid that would either crash the OS, or lead to a runaway memory leak, they can rely on the
hypervisor to clean up the mess.
Mind you, even going back to the outset of virtualization technology in the late 1960s, it has often been treated as a way to fix problems with buggy process isolation, so there's a certain amount of truth to that joke.
So, a large part of the appeal of
system-level virtualization is the ability to more completely isolate programs which are know to never need to communicate with each other beyond the system's multiplexing of resources. That's one part of this, definitely.
This was also related to the main idea behind
Exokernel designs - if the programs don't need to communicate with each other, then you can virtualize the programs themselves, and leave it to them to tune their operations.
The difference is that programmers are lazy, including OS devs, and it is easier (and in some ways more flexible) for all involved to spin up a full copy of an OS inside the container, using an existing OS as the hypervisor, rather than come up with a new, stand-alone hypervisor (and a paravirtualized shared library model that everyone would have to learn, and then tweak on their own). Since the ultra-high performance and ultra-fine granularity of an exokernel is pretty much a case of YAGNI for your typical web service - most of the time, 90% of the performance hit is in network lag, anyway - the containerization advocates took the virtualization but ignored the rest.
Mind you, I could mention
Synthesis OS's synthetic machines (s-machines), in which individual programs would run paravirtualized and potentially have multiple isolated processes within the program, but I do enough fangirling over Massalin already.
But let's look at the history a bit more.
When IBM first came out with a
VM/370 in... what, 1972? - the main goal was to be able to run multiple OSes, allowing them to run the existing OS/360 and its descendant alongside Conversational Monitor System, meaning that they didn't need the sort of general-purpose timesharing system they so loathed in order to serve remote terminals. The fact that their attempt at a commercial timesharing system was an epic fail had nothing to do with this, of course.
IBM had always disliked the idea of dirty peasant users getting their grubby mitts on their beautiful
smirk, elegant
giggle batch-processing systems which they were certain was The One True Way to Compute and always would be. However, the success of remote terminal serving systems such as SABRE (within their limited sphere of operation), and of timesharing systems such as ITS and Dartmouth TSS (within their limited user base), made it clear that they needed a response, so they came up with TSO, the
Time Sharing Option, which was an add-on bolted onto the side of the MVT variant of OS/360. Unfortunately, it sucked like a vacuum cleaner, because a) it was just a bag on the side of a system that wasn't designed for interactive use, b) IBM didn't really get the idea that users might be very patient with delays during interactive use, and c) the first release was a prototype that got rushed to production for the specific reason of convincing their customers not to bother with timesharing. People saw right through that last part, and in any case, most of the customers who actually needed timesharing (or thought they did) had jumped ship for more flexible and less costly systems from DEC, Honeywell, or GE even before it was released.
Even so, enough customers stuck it out that it failed to fail outright and became a
'Springtime for Hitler' situation for IBM, making it in some ways a preview of what would happen with the IBM PC.
Meanwhile, a group of researchers in their Cambridge, Massachusetts research center - the same group place where TSO was developed - hit on the idea that one could run a simulated computer system on the hardware being simulated by actually running it on said hardware, but monitoring the simulated system and trapping any operations that you don't want to the simulation to run on real hardware. This allowed them to create a virtual computer in which the simulation would run at full speed for most things, but could be prevented from doing dangerous or unwanted things, while still maintaining the illusion that it had complete control of a real computer. This is the idea that would eventually become virtualization as we now know it.
Initially, they worked out a software-based system that ran on existing hardware, which was called 'Control Program/Cambridge Monitor System', or '
CP/CMS', The 'monitor' acted as a single-tasking, single-user system which, from the perspective of the user, appeared to be running on dedicated hardware - not too different from using, say, a PDP-8, except you didn't have (or need) access to the actual computer in order to use it.
That suited IBM's management, trainers, and field technical support right down to their socks, because it meant the system operators could spin up a CMS container as needed, and then forget about it, letting them live in their batch-processed Laputa and pretend timesharing users didn't exist most of the time.
It also appealed to their sale force - who were, after all, the ones who really called the shots at IBM - because they could claim to have a timesharing system without scaring away their institutional customers, whom they had spent half a decade convincing that timesharing was evil and batch processing was a gift from on high.
As research continued they developed improvements which relied on hardware modifications done on the researchers' testbed mainframe. Many of the virtualization techniques still in use were developed at this time, but it would be almost two decades before microprocessors - which didn't even really exist at the time - would be able to implement them.
This worked out well enough that when IBM released the System/370 update of the System/360 line, they were ready to include that hardware virtualization support in some higher-end models. They created a dedicated hypervisor called
VM/370, and renamed Cambridge Monitor System in to
Conversational Monitor System, and the descendants of both are still a mainstay of IBM's mainframe, high-end server, and blade server systems.
OK, let's jump ahead a few years. We can skip the 80386 for now, since, while it was an impressive feat of engineering in many ways, it didn't really do anything new; in fact, it only provided a sliver of what the 1970s mainframes were doing, and wasn't even the first microprocessor to do so.
The important next step was around that same time though: the work at MIT on the exokernel idea. Basically, several OS researchers there looked at VM/CMS and the other client OSes VM ran and said, in effect, 'if all you are doing is running one program, why does the virtualized system need a general-purpose OS at all?' It was You Ain't Gonna Need It write large, years before the term was even coined.
They decided to shuck all the general-purpose OS services in favor of a stripped down hypervisor whose sole job was to multiplex access to the hardware, and have each program have its own, highly tailored library of operations for interfacing the hardware in a precise manner, with no wishy-washy stuff about abstraction layers and common interfaces. They did set up a system for paravirtualizing the libraries, so that if two or more containers needed the same library (presumably for something that wasn't a bottleneck), they could share the library rather than having separate copies.
It wasn't a bad idea, but it really was only suited to servers - for general-purpose interactive systems, there were too many programs that would need to interact with eac h other, meaning that they would run into the same kind of IPC overhead that microkernels did, if not worse.
Also, it added to the burden of developing a server or application, and on the system configurator, as you didn't have a standard set of OS services you could be certain would always be there - every system configuration would be unique, which is fine for a handful of systems but won't scale to to the hundreds of thousands or millions of them.
I know I said that I was done talking about Synthesis, but I should mention that a lot of the ideas in that were aimed at the same kind of micro-optimization as exokernels, but doing it programmatically, rather than putting the burden on the application developers. However, it runs into another problem that hurt exokernels: poor locality of memory access.
Still, the exokernel concept did leave an impact on the set of ideas that became containerization systems such as Docker. The Synthesis approach, not so much (it's too weird and complex for most people), though you do see a few echoes of in some of the newer systems.
This brings us to the rise of server farms, which is the real reason for the widespread use of containerization. Basically, Docker and its ilk allow server managers to hand out small, dedicated slivers of their servers to people who need an Internet based server to do Just One Thing, but don't have the time or inclination to dive deep into the design of their own special-snowflake exokernel client.
Docker is a compromise between security (because the containers are better isolated than ordinary processes would be), simplicity (because they only need to have that one service running in the container), familiarity (because it can still be running at least a rump version of a commonly used OS backing up said service, so the container admins and developers don't need to learn anything new), flexibility (because the owners of the container can set up the specific OS they need without it conflicting with the admins of the server farm as a whole - well, not too often, anyway), and low administration overhead for the server farm admins (since they can dump most of the work on the admins for the individual containers).
It also has the advantage common to more general types of 'cloud' virtualization that the container doesn't need to be tied to a specific server, but can 'float' between physical hosts transparently, or even be running on multiple physically hosts simultaneously - provided you do a damn good job of synchronizing them; or better still, set up the services in such a way that the different copies don't need to be synchronized. Since both of those are a lot easier to do with a single-purpose service that only only touches a very limited cross-section of the data and other resources, it makes some sense to do micro-services in separate containers when it is feasible as opposed having one big service that is harder to float efficiently.