Re: What generates operating system files/folders, and when?
Posted: Mon Dec 23, 2019 7:38 pm
I would like to provide my own answer to the original question.
Others in this thread talked about initrds - initial ramdisks. Typically in a Linux system, when the kernel first boots, it doesn't necessarily "mount" a physical disk. Alongside the kernel, the bootloader loads a file from disk which contains its own filesystem (called a cpio archive in Linux's most common case, though it supports initial ramdisks in a number of formats). The filesystem stored in that ramdisk can have its own directory structure, including its own /bin or whatever you want to have there. You can run an entire system through this ramdisk on Linux, but typically they just contained programs critical to the start up process - scripts to mount other disks, sometimes a boot splash. But initrds aren't a requirement, and in fact the ones on the Ubuntu installation I'm using right now only contain a directory with microcode the kernel loads for the CPU.
So that's Grub and Linux, but let me go on a tangent and talk about my own OS here:
I provide my OS as a live CD. It comes as a file - an "image" - which contains an ISO 9660 file system, suitable for burning to a real CD, but most people load it in an emulator. I have bootloaders available for BIOS and UEFI, and I do some complicated things with that CD filesystem to make this work well. The ISO 9660 file system contains a file which itself has a FAT32 file system. This is used by UEFI, and some headers in the CD filesystem tell UEFI where it can find that FAT32 filesystem. The rest of the ISO 9660 contains both the BIOS bootloader and specially crafted files which point into the FAT32 filesystem and mirrors its contents. This is known as a "hybrid filesystem", in that two different formats can be used to access the same files.
The FAT32 image contains the EFI bootloaders, the kernel, device driver modules and the ramdisk. The kernel is a multiboot-compliant ELF binary, the device driver modules are ELF object files, and the ramdisk is a tarball. The EFI bootloaders use EFI's APIs to access the FAT32 filesystem, while the bootloader contains an ISO9660 filesystem driver. Both the EFI bootloaders and the BIOS bootloader implement the mutliboot specification and use their respective file system interfaces to load the kernel, modules, and ramdisk into memory. Using the multiboot spec, the bootloader provides information to the kernel on what additional files it loaded (the modules and ramdisk). The kernel then starts, does some early setup, and finishes loading each of the modules (which contain device drivers for file systems, among other things). It then attempts to mount the ramdisk into its virtual filesystem and execute /bin/init. To do that, one of the modules is a file system driver for tarballs, which allows the kernel to perform typical vfs actions (read directories, read files) on the contents of the tarball that is already in memory. /bin/init is a typical userspace ELF binary which can then do whatever it wants. My live CDs use a script-based startup system, so init will run each of the ordered start up scripts, and one of the early ones will replace the tarball filesystem with an in-memory "temporary file system", which can be written and expanded arbitrarily, and then copies the contents of the original tarball into that tmpfs so that the live CD has a "writable root filesystem". Like Linux, my OS also has a synthesized /proc as well as a synthesized /dev.
Linux has what is called a virtual file system. It is not alone in this, in fact essentially every modern operating system has a virtual file system. A virtual file system allows many different kinds of file systems to be accessed by the same common interface. Operating systems like Linux, and its precursors and brethren in the Unix family, place all of these filesystems together in one namespace. This allows a directory like /usr to represent the contents of one filesystem while the rest of / is another, which was very common in the early days of Unix. These days you're more likely to see /home as a separate filesystem, rather than /usr, but the idea remains the same. Those directories represent file systems on a real disk, that already existed there before the operating system started, and which the kernel assigned to those virtual paths on startup. But this unified, single-namespace virtual file system also allows Linux to do something else: Have you ever looked in /sys or /proc? Those don't exist on a disk, they're synthesized by the kernel when you access them - truly virtual files within the virtual file system. On newer Linux systems, /dev may also work this but previously it did not: /dev used to be an actual directory on disk, containing special files called device files, which have device identification numbers stored in them instead of data. It used to be that /dev was set up manually by creating these files on disk, but this was later replaced with automated systems like udev, before being replaced with the synthesized version that allows the kernel to do it all automatically.This is probably a really simple question, but I'm just beginning to think about hard drive access in my operating system and then realised that I actually don't know what bit of software creates the OS filesystem (for example /bin, /etc, etc.. on linux), and when it's generated.
Does the kernel, when booting up, check that all of these folders/files exist, and if they don't, it creates them? This doesn't seem quite right.
Let's start at the bootloader. There are a lot of different architectures where booting works in subtly different ways, but the general idea is the same: The bootloader is a very small program, stored somewhere that the firmware - the computer itself - knows where to find it. In the context of a PC with a BIOS, the BIOS knows how to read hard disks and will check for these small chunks of code at the start of disks. Because the amount of code the firmware can load on its own is often limited (512 bytes in the classic PC BIOS, which is known as a master boot record or MBR), bootloaders would typically be broken up into multiple parts which we commonly call stages. GRUB, for example, has a small initial stage that loads another stage. GRUB is a big bootloader - it's essentially a full operating system in its own right. Its initial stage looks for a stage "1.5" which is stored at a hardcoded place on the disk and contains code to read from the filesystem on that disk. That stage can then look for files within the file system and find the final stage which contains the rest of the GRUB code. All of this data - the MBR, the stage 1.5 code, and the rest of grub - is created by an installer.So let's say I'm booting up a computer from my kernel, which is loaded onto a CD-ROM. Hmm.. As I write this, it's all starting to make sense, so, is this correct?:
As I said, GRUB is essentially an operating system in its own right. It has disk drivers and file system drivers. It knows what files and directories are, and when you give it a pathname it can find those files on disk. Grub also knows what a Linux kernel is and how to read and load it to the right place following Linux's boot protocol. Note that "the 0xbadb002 magic" is part of a specification called multiboot and is not used by Linux, but for a Multiboot-compliant kernel this magic sequence allows Grub to find a header with information on how to load things. Regardless, both Multiboot and Linux's boot protocol define specifications for how the bootloader can pass control to the kernel and give it configuration settings and other data.In this case, though, how does the bootloader know what is code, and to load into ram, and what is data (such as the /bin, and /etc folders), so to not load those into ram? In other words, when the bootloader (say, Grub), is loading the kernel from the hard drive into ram (by searching for the 0xbadb002 magic number), how does it know when it's got to the end of the kernel program which it wants to load to ram, and into the beginning of data?
Others in this thread talked about initrds - initial ramdisks. Typically in a Linux system, when the kernel first boots, it doesn't necessarily "mount" a physical disk. Alongside the kernel, the bootloader loads a file from disk which contains its own filesystem (called a cpio archive in Linux's most common case, though it supports initial ramdisks in a number of formats). The filesystem stored in that ramdisk can have its own directory structure, including its own /bin or whatever you want to have there. You can run an entire system through this ramdisk on Linux, but typically they just contained programs critical to the start up process - scripts to mount other disks, sometimes a boot splash. But initrds aren't a requirement, and in fact the ones on the Ubuntu installation I'm using right now only contain a directory with microcode the kernel loads for the CPU.
So that's Grub and Linux, but let me go on a tangent and talk about my own OS here:
I provide my OS as a live CD. It comes as a file - an "image" - which contains an ISO 9660 file system, suitable for burning to a real CD, but most people load it in an emulator. I have bootloaders available for BIOS and UEFI, and I do some complicated things with that CD filesystem to make this work well. The ISO 9660 file system contains a file which itself has a FAT32 file system. This is used by UEFI, and some headers in the CD filesystem tell UEFI where it can find that FAT32 filesystem. The rest of the ISO 9660 contains both the BIOS bootloader and specially crafted files which point into the FAT32 filesystem and mirrors its contents. This is known as a "hybrid filesystem", in that two different formats can be used to access the same files.
The FAT32 image contains the EFI bootloaders, the kernel, device driver modules and the ramdisk. The kernel is a multiboot-compliant ELF binary, the device driver modules are ELF object files, and the ramdisk is a tarball. The EFI bootloaders use EFI's APIs to access the FAT32 filesystem, while the bootloader contains an ISO9660 filesystem driver. Both the EFI bootloaders and the BIOS bootloader implement the mutliboot specification and use their respective file system interfaces to load the kernel, modules, and ramdisk into memory. Using the multiboot spec, the bootloader provides information to the kernel on what additional files it loaded (the modules and ramdisk). The kernel then starts, does some early setup, and finishes loading each of the modules (which contain device drivers for file systems, among other things). It then attempts to mount the ramdisk into its virtual filesystem and execute /bin/init. To do that, one of the modules is a file system driver for tarballs, which allows the kernel to perform typical vfs actions (read directories, read files) on the contents of the tarball that is already in memory. /bin/init is a typical userspace ELF binary which can then do whatever it wants. My live CDs use a script-based startup system, so init will run each of the ordered start up scripts, and one of the early ones will replace the tarball filesystem with an in-memory "temporary file system", which can be written and expanded arbitrarily, and then copies the contents of the original tarball into that tmpfs so that the live CD has a "writable root filesystem". Like Linux, my OS also has a synthesized /proc as well as a synthesized /dev.