Avoiding redundant copies with direct user IPC
Posted: Fri Jun 24, 2022 11:37 am
Hello, I'm currently trying to reduce excessive copying in my OS. While experimenting with various approaches I've come up with a model that should allow eliding most redundant copies relatively easily. I wonder what you think about it.
I'll briefly explain my current design and the problems I face with it, then I'll explain how my new model would (roughly) work.
Current design
There are two major aspects to my design: process isolation and asynchronous I/O.
The process isolation works by not explictly exposing any global state. Instead, all operations are performed on an object. For example, where a UNIX system would use `int open(const char *path)`, my design essentially uses `int open(int object, const char *path)`. Not only makes this sandboxing trivial, it also makes it easy to add "proxy" objects. E.g a firewall can be implemented by imitating a real network interface and filtering requests as needed.
Asynchronous I/O is implemented via ring buffers: a client puts requests in this buffer, notifies the kernel and the kernel then passes all these requests to the appropriate objects.
User processes are able to receive these requests by exposing a "table". Currently the only available table type is a "StreamTable", which exposes all basic operations and has a simple API, but more specialized tables are intended to be added later (e.g. FileTable).
Problems with the current design
One obvious disadvantage of the current design is that data associated with a request must also go through the kernel. For small amounts of data this is not an issue, but for large amounts the impact of (double!) copying becomes noticeable.
There are ways to avoid these redundant copies, such as sharing mappings (and using COW where necessary) but these aren't entirely free either as you need to walk the page table and invalidate TLB entries when necessary. These techniques also can't be used if the alignment of the source and destination doesn't match.
Potential solution: Sharing memory directly between processes
One way to avoid these redundant copies is allowing two processes to access shared range of memory. One process can then write data to this range and another process can read data from this range, trivially avoiding redundant copies.
This can be taken further advantage of by using specialized I/O structures, much like how different hardware define their own structures, and bypassing the kernel's asynchronous I/O queue.
I am still unsure how exactly these memory objects should be shared though: while there is already `share()` operation, it may make sense to allow sharing objects directly with e.g. `open()`, reducing the amount of I/O operations needed.
Notifying processes of changes to memory
Of course, we need a way to notify processes when an update has occurred to this shared memory. This can be done with semaphores.
A semaphore can hold a value (often an integer) and is atomically updated. When a thread wants to check if an update has occurred, it polls this semaphore. If no update has occurred, it can tell the kernel to wake it when it has.
The API for waiting on a single semaphore may look like `wait(semaphore, timeout)`, where the kernel will return to the thread when either the semaphore has been updated or the timeout has expired.
It may also be useful to wait on a (small) set of semaphores. For this there can be a`wait_any` call which takes a list of semaphores.
For a large set of semaphores it makes more sense to let the kernel keep track of all the semaphores. Such an API may look like this:
Potential problems with the solution
Two potential issues come to mind:
I'll briefly explain my current design and the problems I face with it, then I'll explain how my new model would (roughly) work.
Current design
There are two major aspects to my design: process isolation and asynchronous I/O.
The process isolation works by not explictly exposing any global state. Instead, all operations are performed on an object. For example, where a UNIX system would use `int open(const char *path)`, my design essentially uses `int open(int object, const char *path)`. Not only makes this sandboxing trivial, it also makes it easy to add "proxy" objects. E.g a firewall can be implemented by imitating a real network interface and filtering requests as needed.
Asynchronous I/O is implemented via ring buffers: a client puts requests in this buffer, notifies the kernel and the kernel then passes all these requests to the appropriate objects.
User processes are able to receive these requests by exposing a "table". Currently the only available table type is a "StreamTable", which exposes all basic operations and has a simple API, but more specialized tables are intended to be added later (e.g. FileTable).
Problems with the current design
One obvious disadvantage of the current design is that data associated with a request must also go through the kernel. For small amounts of data this is not an issue, but for large amounts the impact of (double!) copying becomes noticeable.
There are ways to avoid these redundant copies, such as sharing mappings (and using COW where necessary) but these aren't entirely free either as you need to walk the page table and invalidate TLB entries when necessary. These techniques also can't be used if the alignment of the source and destination doesn't match.
Potential solution: Sharing memory directly between processes
One way to avoid these redundant copies is allowing two processes to access shared range of memory. One process can then write data to this range and another process can read data from this range, trivially avoiding redundant copies.
This can be taken further advantage of by using specialized I/O structures, much like how different hardware define their own structures, and bypassing the kernel's asynchronous I/O queue.
I am still unsure how exactly these memory objects should be shared though: while there is already `share()` operation, it may make sense to allow sharing objects directly with e.g. `open()`, reducing the amount of I/O operations needed.
Notifying processes of changes to memory
Of course, we need a way to notify processes when an update has occurred to this shared memory. This can be done with semaphores.
A semaphore can hold a value (often an integer) and is atomically updated. When a thread wants to check if an update has occurred, it polls this semaphore. If no update has occurred, it can tell the kernel to wake it when it has.
The API for waiting on a single semaphore may look like `wait(semaphore, timeout)`, where the kernel will return to the thread when either the semaphore has been updated or the timeout has expired.
It may also be useful to wait on a (small) set of semaphores. For this there can be a`wait_any` call which takes a list of semaphores.
For a large set of semaphores it makes more sense to let the kernel keep track of all the semaphores. Such an API may look like this:
Code: Select all
create_semaphore_set()
add_semaphores(set, semaphores)
remove_semaphores(set, semaphores)
wait_set_any(set, timeout)
Two potential issues come to mind:
- Mapping and unmapping objects and allocating a semaphore for transferring only a small amount of data once may be excessively expensive compared to simply copying it into kernel space and back. It may make sense to keep the old API around for such operations.
- In many languages unsynchronized access to memory is undefined behaviour. It is possible to avoid this by using volatile accesses. However, the compiler is unable to coalesce or elide these accesses. Luckily, LLVM (and Rust) exposes an intrinsic for copying data with volatile accesses. While these accesses cannot be elided they can be coalesced. A better alternative would be to have a concept of "untrusted" memory in the language with corresponding operations. I've seen this also referred to as "freezing operations". This thread goes into more detail.