Zero-copy file-IO in userspace
Zero-copy file-IO in userspace
I'm working on the "microkernel" VFS again, and have come to one of the most complex issues: How to do file-IO mostly in user space.
The idea is that I will map parts of the file data in user space, along with a control block so the application can decide if the data is present or not, and if so, where to find it. Since I want zero-copy, I will map contents of the disc cache in user space. With modern disc APIs (AHCI, NVMe) the disc driver could add the physical address directly to the schedule, which means the OS is not reading anything, much less maps the buffers in linear address space.
The toughest issue with this approach is what to keep in user space, and what to keep in kernel space away from potential manipulation by user space. For instance, IO-wait lists must be in kernel space, but so must the physical address buffers. So, I need a user space structure that has file positions & base and size of mappings. These can be kept in user space without problems. A potential problem is that for as 512 byte per sector disc, file contents might not start at a page boundary, meaning the application could access things outside of the file. However, I've decided it's to much of a problem & affect speed if these pages must be mapped read-only. The application will have a per sector bitmap of modified content so the kernel driver can write back to disc. I will not use page attributes as I feel these have too poor granularity, particularly when most discs use 512 bytes per sector.
Another issue is that multiple processes might have the same file open, and so each process that has the file open needs their own user space mapping. That means I need a central structure that keeps track of which processes have the file open. I think the cached file content should be here too, and so user space processes would first check the central structure before it issues read operations.
Locks present problems too. User-space sections cannot be accessed from kernel, and kernel mode sections cannot be accessed from user space. A potential solution might be to use some type of spinlock, but I think a better approach is to write lock-less code. The kernel must be able to steal buffers from applications in low memory situations, which can pose a problem.
A final issue is how to intercept normal file-IO operations to use memory mapped access instead. In the C library, I suppose I can simply rewrite the file IO functions. The same applies to my file-IO class, which encapsulates files. However, the direct file-IO operations probably needs to go to kernel, and then kernel will use the memory mapped interface. Not optimal since it requires syscalls, but should work.
Anyway, this would solve the problem of limitations of 32-bit protected mode that cannot have large caches mapped in kernel space. With this interface, all file data is mapped in user processes, and only small structures (physical addresses) needs to be kept in linear kernel memory.
The idea is that I will map parts of the file data in user space, along with a control block so the application can decide if the data is present or not, and if so, where to find it. Since I want zero-copy, I will map contents of the disc cache in user space. With modern disc APIs (AHCI, NVMe) the disc driver could add the physical address directly to the schedule, which means the OS is not reading anything, much less maps the buffers in linear address space.
The toughest issue with this approach is what to keep in user space, and what to keep in kernel space away from potential manipulation by user space. For instance, IO-wait lists must be in kernel space, but so must the physical address buffers. So, I need a user space structure that has file positions & base and size of mappings. These can be kept in user space without problems. A potential problem is that for as 512 byte per sector disc, file contents might not start at a page boundary, meaning the application could access things outside of the file. However, I've decided it's to much of a problem & affect speed if these pages must be mapped read-only. The application will have a per sector bitmap of modified content so the kernel driver can write back to disc. I will not use page attributes as I feel these have too poor granularity, particularly when most discs use 512 bytes per sector.
Another issue is that multiple processes might have the same file open, and so each process that has the file open needs their own user space mapping. That means I need a central structure that keeps track of which processes have the file open. I think the cached file content should be here too, and so user space processes would first check the central structure before it issues read operations.
Locks present problems too. User-space sections cannot be accessed from kernel, and kernel mode sections cannot be accessed from user space. A potential solution might be to use some type of spinlock, but I think a better approach is to write lock-less code. The kernel must be able to steal buffers from applications in low memory situations, which can pose a problem.
A final issue is how to intercept normal file-IO operations to use memory mapped access instead. In the C library, I suppose I can simply rewrite the file IO functions. The same applies to my file-IO class, which encapsulates files. However, the direct file-IO operations probably needs to go to kernel, and then kernel will use the memory mapped interface. Not optimal since it requires syscalls, but should work.
Anyway, this would solve the problem of limitations of 32-bit protected mode that cannot have large caches mapped in kernel space. With this interface, all file data is mapped in user processes, and only small structures (physical addresses) needs to be kept in linear kernel memory.
-
- Member
- Posts: 5531
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Zero-copy file-IO in userspace
Almost all modern disks use 4096 bytes per physical sector. They may offer a compatibility mode with 512-byte logical sectors, but you should only write whole 4096-byte physical sectors. This is especially important for SSDs, where write amplification caused by partial physical sector writes will shorten the drive's lifespan.rdos wrote:particularly when most discs use 512 bytes per sector.
Re: Zero-copy file-IO in userspace
Right, and this would make the user space interface a lot simpler. Instead of needing to update bitmaps with written sectors, the page table dirty bit can be used for writing back whole 4096 areas.Octocontrabass wrote:Almost all modern disks use 4096 bytes per physical sector. They may offer a compatibility mode with 512-byte logical sectors, but you should only write whole 4096-byte physical sectors. This is especially important for SSDs, where write amplification caused by partial physical sector writes will shorten the drive's lifespan.rdos wrote:particularly when most discs use 512 bytes per sector.
The user space file part entry would then look like:
Code: Select all
struct FilePartEntry
{
int64 position;
int32 size;
int32 base;
};
Using the dirty and accessed bits a "second chance" algorithm can be developed for removing unused parts of the file. Every time a read has taken place the counter is increased, and a write might increase it with two (or more). When an untouched part is found, the counter is decreased, and if it is zero, the part is removed from the mapping. The replacement algorithm can be run with each part miss and regularly (like once a second). Write back of file contents can be checked at the same time.
Another mapping will map current state of the file, like current size, allocated size, attributes & name. This mapping will also be read only to user space.
I will need two new syscalls:
- Request to cache a new part of the file (position + desired size as parameters)
- Request to change the file size
Re: Zero-copy file-IO in userspace
The issue of kernel modifying the file parts using lockless code seems a bit complicated. There is also a speed issue of having to check 256 entries when searching for a specific part of a file. Ideally, a sorted array of entries or entry pointers should be used.
The 4k sector might be layed out like this:
The last 32 entries in the sorted array would always be 0. Now a search can be completed in only 8 iterations.
I think if kernel insert/remove entries in a smart way lockless operation can be achieved. Inserts starts from the top and create duplicate entries as it modifies the sorted array. The part entries & memory mappings are created before linking them in the sorted array. Removes will start from bottom and overwrite the removed entry, again creating copies of higher entries until the last one is set to zero. When an entry is removed, the sorted array is modified first to make the part non-visible. There probably needs to be a timeout before the part entries are cleared, starting with setting base & size to zero. After another timeout, the file mappings can be removed. This will imply that code wanting to read file contents always must start by finding the part entry, and then in a timely manner access file data.
The 4k sector might be layed out like this:
Code: Select all
struct FileMap
{
int16 SortedPtrArray[256];
struct FileMapEntry PartArray[224];
};
I think if kernel insert/remove entries in a smart way lockless operation can be achieved. Inserts starts from the top and create duplicate entries as it modifies the sorted array. The part entries & memory mappings are created before linking them in the sorted array. Removes will start from bottom and overwrite the removed entry, again creating copies of higher entries until the last one is set to zero. When an entry is removed, the sorted array is modified first to make the part non-visible. There probably needs to be a timeout before the part entries are cleared, starting with setting base & size to zero. After another timeout, the file mappings can be removed. This will imply that code wanting to read file contents always must start by finding the part entry, and then in a timely manner access file data.
Re: Zero-copy file-IO in userspace
There is a problem with locking the structure in user-space, and also a problem with keeping current file position. By adding another 4k page before the mapping page, and putting a spinlock there, and an array of current file positions, I think I have a solution that should work. There is a need to keep the mappings per process, and so the kernel file structure should have a list of processes that has the file open. In the per-process structure, there would be a list of threads waiting for read completion of a particular requested mapping.
Re: Zero-copy file-IO in userspace
Now I can type a file from the new filesystem with my user space file class. The positions & physical addresses are first cached in the file descriptor, and then mapped to user space on request.
The filesystem I'm using is not 4k aligned, and so the whole file cannot be memory-mapped if it is fragmented. I still need to merge parts in a smarter way, and I also need to keep track of references and unmap when user space no longer use the file or when there is no activity.
The filesystem I'm using is not 4k aligned, and so the whole file cannot be memory-mapped if it is fragmented. I still need to merge parts in a smarter way, and I also need to keep track of references and unmap when user space no longer use the file or when there is no activity.
-
- Member
- Posts: 5531
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Zero-copy file-IO in userspace
Why does incorrect alignment prevent you from memory-mapping a fragmented file?rdos wrote:The filesystem I'm using is not 4k aligned, and so the whole file cannot be memory-mapped if it is fragmented.
Re: Zero-copy file-IO in userspace
Because the alignment on disc determines which offset in a page a sector will be at. Since FAT32 use clusters, and cluster size typically is larger than 8 sectors, if the cluster data on disc is aligned, then the end of one cluster and the start of the next will be at the beginning of the page. This makes it possible to map the whole file even if clusters are not ordered. However, if one cluster ends in the middle of the page, then the rest of the page will be contents of the next cluster, but if there is fragmentation, then the mapping will need to use another page, and so the file cannot be mapped continuously in linear address space.Octocontrabass wrote:Why does incorrect alignment prevent you from memory-mapping a fragmented file?rdos wrote:The filesystem I'm using is not 4k aligned, and so the whole file cannot be memory-mapped if it is fragmented.
Previously, I implemented memory mapping by allocating new physical pages and copying the contents, but since I want a zero-copy solution with a general caching function, this means file data must be 4k aligned on disc. Which is not a problem for new installations since it is possible to partition discs so data is page aligned. However, things must still work on unaligned filesystems, although less efficiently. Writing 4k sectors to disc relates to this too, since if file data is 4k aligned, then the writes will be 4k aligned too.
- Demindiro
- Member
- Posts: 96
- Joined: Fri Jun 11, 2021 6:02 am
- Libera.chat IRC: demindiro
- Location: Belgium
- Contact:
Re: Zero-copy file-IO in userspace
Not necessarily. You can simply memcpy the data to the start of a page.rdos wrote:Because the alignment on disc determines which offset in a page a sector will be at.
All filesystems that support some form of inline data/block suballocation/... need to do this anyways.
Re: Zero-copy file-IO in userspace
Then it is no longer zero-copy. The idea is that the disc cache physical address can directly be mapped in user space, and that user space can read data directly from the file. If the file system is properly aligned, you can also memory map the file so user space can handle the file as a continuous object directly using the disc cache. A special case is to memory map an executable file as copy-on-write.Demindiro wrote:Not necessarily. You can simply memcpy the data to the start of a page.rdos wrote:Because the alignment on disc determines which offset in a page a sector will be at.
Although, I might also support the use of memcpy and new physical pages, but that is legacy and not optimal.
At least FAT with large enough cluster size (8 or more) supports it. I don't know about ext, but I suspect it should be possible there too.Demindiro wrote: All filesystems that support some form of inline data/block suballocation/... need to do this anyways.
-
- Member
- Posts: 5531
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Zero-copy file-IO in userspace
You can add an offset to realign the clusters to pages. Or, if you want to avoid hurting disk performance, you could use scatter/gather DMA to recombine fragmented data into contiguous pages. The latter option would require your cache to be a lot more complicated if you want to avoid partial writes, though.rdos wrote:Because the alignment on disc determines which offset in a page a sector will be at.
On some disks, there is an offset between logical 512-byte sectors and physical 4k sectors.rdos wrote:Writing 4k sectors to disc relates to this too, since if file data is 4k aligned, then the writes will be 4k aligned too.
Re: Zero-copy file-IO in userspace
The disc cache interface is complicated enough, so I'd rather not do that. The interface is also the same for all disc types, and consists of a list of physical addresses and sector numbers. That favors modern hardware that typically have schedules where physical addresses can be inserted directly. For IDE though, it will be less effective since the physical address must be mapped in linear address space, but that is not a big problem since IDE is slow anyway.Octocontrabass wrote:You can add an offset to realign the clusters to pages. Or, if you want to avoid hurting disk performance, you could use scatter/gather DMA to recombine fragmented data into contiguous pages. The latter option would require your cache to be a lot more complicated if you want to avoid partial writes, though.rdos wrote:Because the alignment on disc determines which offset in a page a sector will be at.
The file data request flow is a bit special. The kernel file driver will queue a request to the file system (micro kernel approach). The filesystem will determine which sectors should be read, and then queue those to the disc cache manager running in kernel. When the device has finished reading all sectors, it will notify the kernel file driver directly with a list of sectors & physical addresses.
Not good.Octocontrabass wrote:On some disks, there is an offset between logical 512-byte sectors and physical 4k sectors.rdos wrote:Writing 4k sectors to disc relates to this too, since if file data is 4k aligned, then the writes will be 4k aligned too.
-
- Member
- Posts: 5531
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Zero-copy file-IO in userspace
IDE supports DMA too.rdos wrote:For IDE though, it will be less effective since the physical address must be mapped in linear address space, but that is not a big problem since IDE is slow anyway.
Re: Zero-copy file-IO in userspace
Using DMA for IDE is complicated, just as for EHCI. Typically, the full physical address range is not supported, which cause trouble. I doubt I will do a new IDE driver for the new file system that uses DMA. Too much of old hardware that use IDE and that doesn't have proper DMA support.Octocontrabass wrote:IDE supports DMA too.rdos wrote:For IDE though, it will be less effective since the physical address must be mapped in linear address space, but that is not a big problem since IDE is slow anyway.
- AndrewAPrice
- Member
- Posts: 2299
- Joined: Mon Jun 05, 2006 11:00 pm
- Location: USA (and Australia)
Re: Zero-copy file-IO in userspace
I've thought a bit about this, but put in on the back burner for now. I pass a shared memory buffer from the program through to the disk driver, and let the disk driver figure out what to do.
So in theory, if the physical address of the memory buffer lived in the first 32-bits, I could use IDE DMA to write directly into it, vs. temp memory in the disk driver and copy it across.
The problem is: I can't guarantee that buffer lives in 32-bit memory (although I only run qemu with 128 MB of RAM so it does). I've thought about making a system call along the lines of "allocate a page below this physical boundary" but I only want to expose this to drivers. There are other times when DMA will be useful (audio drivers) and I don't want the possibility of a malicious program crowding out valuable lower memory.
Perhaps we could have a "high performance mode" where we can ask the driver to share DMA memory with a program? We'd need kernel support so the driver can grab the memory back if the program refuses to release it. There also isn't a guarantee that a program can use "high performance mode" (e.g. the first 4 GB of memory is claimed for) in which case we'd have to fall back to do some copying.
Another source of copying is the libc/application code. In my case, musl implements its own buffer, and then copying from this buffer into the one passed to 'fopen'. Based on my experience porting font libraries, programs love to do many single/double digit length reads, so the overhead of copying out a buffer is probably less than if we did IO for each small read. So, if you did have a "high performance mode" read/write, you'd probably want use a different API for these cases. (I can imagine legitimate reasons for "high performance mode" other than copying a file. Perhaps you want to open a 5 GB Photoshop file which requires working with a large amount of data? Perhaps you want to decode an 8K video stream?)
So in theory, if the physical address of the memory buffer lived in the first 32-bits, I could use IDE DMA to write directly into it, vs. temp memory in the disk driver and copy it across.
The problem is: I can't guarantee that buffer lives in 32-bit memory (although I only run qemu with 128 MB of RAM so it does). I've thought about making a system call along the lines of "allocate a page below this physical boundary" but I only want to expose this to drivers. There are other times when DMA will be useful (audio drivers) and I don't want the possibility of a malicious program crowding out valuable lower memory.
Perhaps we could have a "high performance mode" where we can ask the driver to share DMA memory with a program? We'd need kernel support so the driver can grab the memory back if the program refuses to release it. There also isn't a guarantee that a program can use "high performance mode" (e.g. the first 4 GB of memory is claimed for) in which case we'd have to fall back to do some copying.
Another source of copying is the libc/application code. In my case, musl implements its own buffer, and then copying from this buffer into the one passed to 'fopen'. Based on my experience porting font libraries, programs love to do many single/double digit length reads, so the overhead of copying out a buffer is probably less than if we did IO for each small read. So, if you did have a "high performance mode" read/write, you'd probably want use a different API for these cases. (I can imagine legitimate reasons for "high performance mode" other than copying a file. Perhaps you want to open a 5 GB Photoshop file which requires working with a large amount of data? Perhaps you want to decode an 8K video stream?)
My OS is Perception.