Before I say anything about the design, I should mention that I maintain a wiki that serves as a design document. You can access it here: http://darksideproject.hopto.org/wiki/Main_Page. Also, I know this post is pretty long, but I did my best to organize it into sections. If you don't want to read the whole thing, I would most prefer feedback on the object manager, memory manager, subsystems, and the device manager.
To give a general overview of my design, I plan to use a hybrid kernel in my OS. The definition of "hybrid" I'm using is that all system components run in kernel mode, but they're structured similarly to a microkernel to allow for flexibility and modularity. The kernel in my design is actually divided into two components, the executive and the kernel. The kernel only handles low-level, architecture-specific functionality (just like a microkernel), while the executive performs system resource management. In addition to that, I also use a module called the Hardware Abstraction Layer (HAL) to abstract away differences in system hardware (things like legacy vs ACPI, 8259 vs APIC, PIT vs APIC timer vs HPET).
Executive
The executive is the central component of my operating system. It provides the basic system functionality that all applications, libraries, and drivers need to interface with the hardware, such as memory management, a file manager, processes and threads, and device management.
Object Manager
The object manager is the central component of the executive. The object manager manages all of the system resources resources as objects. It is responsible for keeping track of resources allocated to processes. All resource access goes through the object manager. Each object managed by the object manager has a header and a body. The header contains generic object information that is used by the object manager, while the body contains class-specific data. Generic object information includes the interfaces an object exposes, its access permissions, and a reference count.
Object Classes
Every executive subsystem implements object classes. An object class is a specific type of resource managed by the object manager. They are similar to classes in OOP languages. Each object class consists of a set of methods and defines the layout of the object body. The executive implements the following object types:
- Directory - Object used by the object manager to create the object namespace
Section - Object that maps a part of a process's virtual address space
Inode - VFS structures that contain information about files
File - Instance of an open file, device, pipe, or socket
Process - Self-contained tasks with their own threads, address space, and objects
Thread - Parts of processes that have their own execution path, registers, and stacks
Event - Asynchronous events that can be sent to threads
Pipe - Objects that provide a bidirectional data flow between 2 file handles
Socket - Communication endpoints that allow for data exchange between processes
Semaphore - Synchronization primitives that can be owned by multiple threads
Mutex - Synchronization primitives that can be owned by one thread at a time
RWLock - Special mutexes that allow multiple threads to read a resource at the same time, but only one to write to it
Timer - Objects that fire an event after a certain amount of time.
Module - Dynamically loadable kernel modules
Device - Hardware that's part of the system
Objects managed by the object manager are exposed to userspace through handles. Handles are opaque structures that refer to objects. They are created by the object manager whenever an object is opened. A process must own a handle to an object before using it. Each process has its own handle table, which is a table matching handles to objects. Handle table entries contain a pointer to the object and the permissions that the process has to access that object.
Object Namespace
Objects managed by the object manager can be given a name to identify it. The object manager maintains an internal object namespace that organizes named objects in a hierarchy. This allows for objects to be categorized and opened in a uniform matter. In order to implement the object namespace, the object manager defines a directory object type. Directory objects contain a list of directory entries, which are structures that map object names to object pointers. This allows for directory objects to contain objects in the namespace, including other directory objects. Each object maintains a link count that keeps track of how many directory entries point to it.
The only way for userspace to gain access to objects is through the object namespace. User applications can open objects in the namespace and get a handle to that object. When userspace code opens an object, it requests a specific interface. An interface is a set of methods that can be called. Each object can provide multiple interfaces. Calling one of an object's methods invokes a syscall that causes the object's methods to be executed in the executive or redirected to another process or over the network. In this way, the executive becomes a namespace manager and RPC system.
Memory Manager
The memory manager is responsible for managing virtual memory. The memory manager is made up of the physical memory manager, the virtual memory manager, and the kernel heap.
Physical Memory Manager
The physical memory manager allows passing out physical memory pages. In order to keep track of the system's physical memory, it first needs to know what physical memory is available. This information is collected by the kernel bootloader, and is passed to the kernel in an array of memory ranges. The memory ranges contain the start, size, and flags of the memory range. Once the physical memory manager has a map of available memory, it uses it to initialize the buddy allocator.
Virtual Memory Manager
The virtual memory manager is used to manage a process' address space. One of its major major responsibilities is keeping track of the virtual memory used by each process. It uses structures called Virtual Address Descriptors, or VADs, for this purpose. VADs contain information about a specific region of virtual memory in a process's address space, including its start, size, and type. Every process has a set of VADs organized by memory address in an AVL tree, which allows the memory manager to find a VAD that corresponds to a certain address.
Section Objects
The virtual memory manager also implements section objects. Section objects are objects that map parts of a process's virtual address space. They are the basis on which memory-mapped files and shared memory are built. A section object is used to map a portion of the virtual address space to either physical memory or a file. Once a section object is created, it can be mapped into an address space by mapping a view of it. A view of a section object refers to a specific portion of the object that the section object refers to.
Mapping a view of a section object reserves a portion of the address space but does not commit it. This is used to implement a scheme called demand paging. Demand paging is a mechanism by which pages are only allocated once they're accessed. Since the mapped view of the section object isn't committed, accessing it causes a page fault. The page fault handler consults the VAD tree for the process, and upon learning that the memory faulted on is occupied by a section object, it commits the memory. For physical-memory backed sections, it allocates a physical page and maps it. For file-backed sections, the same process occurs, but the file data is then read into memory.
Kernel Heap
The kernel heap is used for allocating kernel data structures. It heavily relies on both the physical and virtual memory managers to get memory for allocations. The functions the kernel heap implements are allocation, freeing, and reallocation.
The kernel heap is made up of 2 suballocators, a heap allocator and a slab allocator. The heap allocator is a large area of memory, which is subdivided into chunks. It reuses the buddy allocator that is used by the physical memory manager. It is mainly used to allocate strings and buffers which do not have a predetermined size, as well as infrequently allocated objects.
The slab allocator is used for objects that are allocated often, like threads and inodes. The way the slab allocator works is that there are slab caches for each type of object. Each slab cache contains several slabs, which are blocks of memory with a predefined size. The advantage of using the slab allocator is that slabs in a cache can be reused. When allocating from a slab, it searches for a slab that's free and returns it. When freeing a slab, it just marks the slab as free. That way, freed slabs can be easily reused, with no need to search for more free memory or perform splitting or coalescing of chunks.
Virtual Filesystem
The executive’s virtual filesystem, or VFS, provides a filesystem abstraction. It allows for multiple different filesystems to be accessed in a uniform manner. The virtual filesystem in my OS is based on the Unix filesystem. The VFS uses a node graph to keep track of the filesystem hierarchy. Volumes can be added to the filesystem tree by mounting them on a directory.
The VFS implements several object types. In order for the VFS to have security and reference counting, its object types are managed by the Object Manager. This allows for files and directories to be secured by access control lists and contain a reference count so that they're removed when they're no longer in use.
Inodes
The most important structure in the VFS is the index node, or inode. Inodes contain important information about files, such as the mountpoint the inode resides on, the file size, owning user and group, access, modification, and change time, and file mode. Each filesystem implements a subclass of inodes containing filesystem specific data.
Directory Entries
Another important structure in the VFS is the directory entry. Directory entries are structures that map filenames to inodes. Directory inodes contain a list of directory entries, which allows them to contain inodes as children. Using the idea of directory entries, the same inode can be referenced multiple times in the filesystem hierarchy if multiple directory entries refer to it. This is known as hard linking. Each inode maintains a link count that keeps track of how many directory entries point to it. Directory entries are crucial in building the filesystem hierarchy.
Although directory entries are a major part of the VFS, they aren't actually provided by the VFS. This responsibility is owned by the Object Manager. The object manager maintains an internal object namespace that organizes named objects in a hierarchy, which the VFS integrates with. In order to implement the object namespace, the object manager defines a directory object type. Directory objects are used to contain objects in the namespace. The VFS becomes part of this namespace by creating its own directory object under the path name \VFS, which represents the root of the filesystem. This \VFS directory implements methods for both an object directory and an inode, allowing it to function as both.
Filesystem Drivers
A major component of the VFS is filesystem drivers. Filesystem drivers are responsible for treating a volume as a filesystem. All filesystem drivers implement a set of filesystem functions and register them with the VFS. All of these functions take a device as an argument, which allows these functions to be used on any devices that use that filesystem. Mountpoints can use a registered filesystem in order to handle filesystem requests.
Mountpoints
Mountpoints are locations in the VFS where volumes can be added to the filesystem. They contain an inode that functions as the mountpoint, a device that is mounted, and a filesystem to handle requests. The way they work is that a certain device is mounted at an inode with a specific filesystem. The VFS looks up the filesystem, and if it is found, creates a mountpoint for the device. This mountpoint is added to the mountpoint list, and then the inode is updated to point to the root of the mounted filesystem.
Caching
In order to speed up filesystem access, the VFS implements caching of file data and directory entries. File caching is a mechanism used to cache file data. It uses section objects provided by the memory manager to map 256 KB views of files into memory. Each inode contains a pointer to the view that holds its cached data. If the cached I/O is allowed on the file, all read and write requests to files first attempt to read or write from the file cache. If the requested data is not in the cache, or the cache does not exist, the VFS maps a view of the file and reads or writes the data to it. The memory manager is responsible for reading in or writing out the data by sending a non-cached I/O request to the VFS.
Directory caching is a mechanism used to cache directory entries. It keeps the directory entries most likely to be used again in memory. The code that handles directory caching is implemented by the inode function used to lookup a directory entry. When this function is called by the Object Manager while traversing the VFS namespace, it first attempts to get the directory entry from the inode's list. If this fails, and the function detects that the inode does not contain the full directory cache, it calls the filesystem driver to read in the specific directory entry that it's looking for. At this point, if the directory entry was successfully read in, it is returned. Otherwise, the function has failed to find a directory entry.
Multitasking
The executive provides multitasking with processes and threads. Processes are self-contained tasks with their own threads, address space, and object handles. Threads are parts of processes that have their own execution path, registers, and stacks, but run in the same address space as every other thread in their parent process. Each process contains at least one executing thread.
Inter-process Communication
The executive exposes several IPC primitives to userspace, such as events, pipes, sockets, shared memory, synchronization primitives, and waitable timers. These IPC primitives are implemented as objects managed by the object manager. They can be shared between processes by naming them in the object namespace and opening them by name.
Events are asynchronous events that can be sent to threads. They're meant to be as generic as possible. Events can hold an arbitrary amount of data, which allows them to be used for asynchronous message passing. They are used for many purposes, such as notifying I/O completion, GUI events, and POSIX signals.
Pipes and sockets are both objects that are accessed through file handles. Pipes are objects that provide a bidirectional data flow between 2 file handles. Sockets are communication endpoints that allow for data exchange between processes on the same or different systems. They are session-layer interfaces that provide an abstraction on top of multiple transport-layer protocols.
Shared memory is memory that is shared between multiple processes. It is implemented using the Memory Manager's section objects.
Synchronization controls access to resources from threads. Synchronization is designed to enforce a mutual exclusion policy. There are three synchronization primitives that the kernel exposes to userspace: the semaphore, mutex, and readers/writer lock.
Timers are objects that fire an event after a certain amount of time. They can be used synchronously, meaning that an application blocks on the timer, or asynchronously, where an application is interrupted by an event when the timer finishes. There are two types of timers: manual-reset timers and periodic timers. Manual-reset timers will not fire again until they are reprogrammed. Periodic timers automatically reset each time they fire, allowing for timers to fire at a common interval.
Subsystems
The executive exposes the user mode API using syscalls. In order for the executive to be incredibly flexible, it allows for different subsystems to be loaded as kernel modules. Subsystems are pluggable syscall interfaces, which applications can run under. With subsystems, applications from any operating system can be run, as long as there's an appropriate subsystem.
Subsystems are written to support a certain API, such as my OS API, POSIX, or the Windows API. Each subsystem implements a set of syscalls exposing that API to userspace. My OS is planned to support several subsystems: the native subsystem, which implements the OS API; the POSIX subsystem, which implements the POSIX API; and the Windows subsystem, which implements the Windows API and emulates Windows features such as its volume management and the registry. I'm well aware that supporting all these subsystems will take a lot of effort, so my only immediate goal is the native subsystems. However, I'm keeping the other ones in mind.
Modules
The executive is designed to allow for kernel modules to be dynamically loaded. Kernel modules are executable files that the executive loads in order to add functionality to it at runtime. Modules are dynamically linked with the executive. The four main types of modules are device drivers, filesystem drivers, executable formats, and subsystems. Device drivers are the most common type of modules. They control devices. Filesystem drivers are a special type of driver, responsible for treating a volume as a filesystem. Executable format and subsystem modules are used to add their respective features to the executive.
To allow for the executive subsystems to find modules they want to load, the executive makes use of the module registry. The module registry is a database of modules that can be loaded. It allows for modules to be identified by the executive. The module registry is a text file that gets parsed by the bootloader and converted into a tree. This tree can be searched in order to find a module.
The way that modules are identified depends on what type of module they are. With device drivers, modules are identified by their device class, bus type, and device ID. The device ID is bus specific. For example, a PCI ATA hard drive with a PCI vendor ID of 0x8086 and PCI device ID of 0x7111, the device ID would be 0x80867111. With executable formats, filesystem drivers, and subsystems, they are identified by strings. An ELF executable format module would have a string of "elf", an EXT2 filesystem driver would have a string of "ext2", and the POSIX subsystem would have the string "posix".
Device Manager
The device manager is responsible for detecting and managing devices, performing power management, and exposing devices to userspace. It contains code for driver loading, I/O requests, and power management.
Drivers
Devices are managed by device drivers. As explained above, device drivers are the most common type of kernel module, which control devices. Device drivers are layered on top of each other in driver stacks. There are three types of device drivers: low level, intermediate level, and high level drivers.
Low level drivers are the lowest level drivers in the tree. The main low level drivers are bus drivers, which allow the device manager to detect devices on the system. Bus drivers interface directly with the hardware, and provide an interface for other drivers to access the hardware. Examples of bus drivers are PCI, PCI Express, and USB drivers. One special type of low level driver is a motherboard driver. Motherboard drivers are used by the kernel to provide bus detection and power management. When booting the kernel, the bootloader loads a motherboard driver that matches the system configuration. Examples of motherboard drivers are ACPI drivers.
Intermediate level drivers control devices found on a bus. There are two types of intermediate level drivers: function drivers and filter drivers. Function drivers control devices found on a bus. Examples of function drivers are video card drivers, storage device drivers, network card drivers, and input drivers. Filter drivers are drivers that modify the behavior of other drivers. They sit below or above other drivers in driver stacks.
High level drivers sit on top of intermediate level drivers. They control software protocols that exist on top of those drivers. Examples of high level drivers are filesystem drivers and network protocol drivers.
Device Detection
The main role of the device manager is detecting devices on the system. Devices are organized in a tree structure, with devices enumerating their children. Device detection begins with the motherboard driver. The motherboard driver sits at the root of the device tree. It detects the buses present on the system as well as devices directly connected to the motherboard. Each bus is then recursively enumerated, with its children continuing to enumerate their children until the bottom of the device tree is reached.
Each device that is detected contains a list of resources for the device to use. Examples of resources are I/O, memory, IRQs, DMA channels, and configuration space. Devices are assigned resources by their parent devices. Devices just uses the resources they're given, which provides support for having the same device driver work on different machines where the resource assignments may be different, but the programming interface is otherwise the same.
Drivers are loaded for each device that's found. When a device is detected, the device manager finds the device's driver in the module registry. If not loaded already, the device manager loads the driver. It then calls the driver's add-device routine with a pointer to the device object. The add-device routine starts the device and creates a thread for that device to handle requests.
I/O Request Packets
I/O Request Packets, or IRPs, are data structures used to perform I/O requests. They contain data about the request, such as the I/O function code, buffer pointer, device offset, and buffer length. IRPs can be created by the kernel or drivers, and passed to other drivers. Every device driver has dispatch functions used to handle each I/O function code. Most I/O function codes are driver specific, but some are generic and shared by all drivers.
Each driver has a queue of IRPs for it to handle. Whenever an IRP is sent to a driver, the device manager queues the request, and if the main driver thread is asleep, wakes it up. The main driver thread dequeues IRPs and handles them until the queue is empty. A driver will handle an IRP by either passing the IRP to a lower-level driver in the driver stack or performing the I/O request.
Asynchronous I/O
There are two main types of I/O: synchronous I/O and asynchronous I/O. Synchronous I/O sends an I/O request and then puts the current thread to sleep until the I/O completes. Asynchronous I/O just sends the I/O request and then returns. I/O completion is reported asynchronously using a callback. Asynchronous I/O improves the efficiency of the system by allowing allowing for the program execution to continue while I/O is performed. It also allows for multiple I/O requests to be started and then handled in the order they complete, not the order they execute. However, this comes at the cost of making programming more complex than using synchronous I/O.
Internally, my OS uses asynchronous I/O for all of its I/O requests. IRPs are sent to drivers, and then the function that sent them immediately returns. Eventually, the main driver thread will execute, handling the I/O request. Once the I/O request completes, it returns through the driver stack and finally calls the specified callback. It does this by queueing an event to the thread. Once the thread gets executed, the callback will execute.
Synchronous I/O is simply implemented as a special case of asychronous I/O. Just like with asynchronous I/O, an IRP is sent to the driver, but instead of returning, the thread goes to sleep. Once the I/O completion event is queued, the thread will wake up and execute the callback before returning.
Power Management
The device manager also performs power management. Power management is a feature of hardware that allows for the power consumption of the system and devices to be controlled. Each device managed by the device manager provides functions to set their power state. Setting the power state of a device will also affect their child devices' power state. For example, if the PCI bus is put to sleep, so will all of its devices. For power management support, all systems require a power management driver that controls the system power. On x86, this is done through ACPI. Each device also needs to support power management.
The device manager responds to power management events. Power management events can come from two sources: the user or the system. User-generated power management events are created by user mode applications. They are system-wide events for shutting down, rebooting, hibernating, or putting the system to sleep. When the device manager receives a system-wide power management event, it sets the power state of every device on the system.
System-generated power management events are events that come from the system hardware. Examples of system-generated power management events are plugging/unplugging an AC adapter or closing/opening the lid of a laptop. The device manager takes the appropriate action in response to the event.
Userspace Exposure
Devices are exposed to userspace through the device tree on /dev. /dev is actually a link to the \Device directory in the object namespace. The \Device directory contains device objects that represent each device in the system. Devices can be accessed directly through one of two ways: through normal file syscalls or through the object interface that a device provides. Because both of these methods can be complex to program, several device APIs are implemented in userspace that provide an abstraction over them.
Kernel
The kernel is responsible for architecture-specific code. It sits underneath the executive and performs I/O, trap dispatching, low-level virtual memory management, and thread scheduling. The kernel also implements synchronization primitives for use by the executive, which are spinlocks, semaphores, mutexes, and readers/writer locks. These services are exposed to the executive. In this way, the kernel pretty much serves as a microkernel, providing basic functionality that allows the executive to implement its services.
Hardware Abstraction Layer
The Hardware Abstraction Layer (HAL) is the lowest-level component in the OS. It implements machine-specific code, which is code that differs between machines with the same processor architectures, like IRQ controllers, system timers, real time clocks, and multiprocessor information. By abstracting different hardware configurations, the HAL provides the kernel with a consistent platform to run on.