Zero-copy file-IO in userspace

rdos · Post by **rdos** » Mon Apr 10, 2023 2:59 pm

bellezzasolo wrote:There's a reason that all the major OSes offer some form of Scatter-Gather API. MSDN offer the example of database applications - https://learn.microsoft.com/en-us/windo ... her-scheme

GNU libc is less specific- https://www.gnu.org/software/libc/manua ... ather.html

Of course there is. The reason is that they implement the legacy read/write API for IO. This means that applications pass buffers that might span pages, and when these are passed to devices that use physical addresses, there is a need for scatter-gather.

Besides, reading the description of the Windows API, this is pretty similar to how my API works, except that in my implementation, the application will not provide the buffer, and alignment is handled by the OS and not by the calling application. I wouldn't call this "scatter-gather", rather it's a way to speed up file-IO by putting a lot of demands on the caller. So, yes, I more or less provide this support through the file map syscall. However, the file class is smart enough to always use this API for all file IO, without burdening the application with a lot of strange constraints.

bellezzasolo wrote: My OS is 64 bit, so virtual memory is no issue. Given that this is a capability offered by a plethora of modern hardware, I'd certainly consider it desirable to offer the API. The advantage is that the kernel virtual-physical translation layer can be common and is pretty much free. Then your file cache is almost just a special device driver that works in system memory, moving around pages.

Applications doing silly things with small writes can be addressed with userspace buffering IMO, not hard to add to a libc.

I think you will discover that this will not provide you with decent filesysten performance, and writing a 64 bit OS won't change this.

My original filesystem implementation, which is stable and run on a lot of systems, is based on passing buffers. However, it also has caches both for disc sectors and file content. That was necessary to provide performance compatible to that of other OSes, and I'm pretty sure that both Windows & Linux has these caches.

Regardless how you implement an API by passing buffers, you will be unable to handle small requests without copying. Buffering in userspace means you need to guess on optimal read sizes and copying is necessary for your buffers.

bellezzasolo · Post by **bellezzasolo** » Mon Apr 10, 2023 3:51 pm

rdos wrote:
bellezzasolo wrote:There's a reason that all the major OSes offer some form of Scatter-Gather API. MSDN offer the example of database applications - https://learn.microsoft.com/en-us/windo ... her-scheme

GNU libc is less specific- https://www.gnu.org/software/libc/manua ... ather.html
Of course there is. The reason is that they implement the legacy read/write API for IO. This means that applications pass buffers that might span pages, and when these are passed to devices that use physical addresses, there is a need for scatter-gather.

Besides, reading the description of the Windows API, this is pretty similar to how my API works, except that in my implementation, the application will not provide the buffer, and alignment is handled by the OS and not by the calling application. I wouldn't call this "scatter-gather", rather it's a way to speed up file-IO by putting a lot of demands on the caller. So, yes, I more or less provide this support through the file map syscall. However, the file class is smart enough to always use this API for all file IO, without burdening the application with a lot of strange constraints.

bellezzasolo wrote: My OS is 64 bit, so virtual memory is no issue. Given that this is a capability offered by a plethora of modern hardware, I'd certainly consider it desirable to offer the API. The advantage is that the kernel virtual-physical translation layer can be common and is pretty much free. Then your file cache is almost just a special device driver that works in system memory, moving around pages.

Applications doing silly things with small writes can be addressed with userspace buffering IMO, not hard to add to a libc.
I think you will discover that this will not provide you with decent filesysten performance, and writing a 64 bit OS won't change this.

My original filesystem implementation, which is stable and run on a lot of systems, is based on passing buffers. However, it also has caches both for disc sectors and file content. That was necessary to provide performance compatible to that of other OSes, and I'm pretty sure that both Windows & Linux has these caches.

Regardless how you implement an API by passing buffers, you will be unable to handle small requests without copying. Buffering in userspace means you need to guess on optimal read sizes and copying is necessary for your buffers.

Surely the optimal read size issue could be addressed by passing the requisite information as part of the result from fopen()?

What the file cache does internally isn't really my area of expertise, as I haven't got round to writing one. It's just that I don't see the harm in exposing a scatter-gather interface to the application. It's not really zero copy if the database application has to copy around data...

rdos · Post by **rdos** » Tue Apr 11, 2023 2:20 am

bellezzasolo wrote:Surely the optimal read size issue could be addressed by passing the requisite information as part of the result from fopen()?

You probably could, but then your fopen would be incompatible with fopen in POSIX. Besides, I don't think the application developer should need to consider internals in your filesystem implementation. Which is exactly what the MS scatter-gather API requires.

I do this by using the current file position and the size passed to read. If a read is not mapped, the requested position & size is passed to the filesystem. The filesystem then adjusts the size and position according to the actual file system used and it's alignment. For instance, it will always set the size to at least one 4k page. The only requirement of the filesystem is that data at the requested file position is read, but then data before and after can also be read and mapped to user space.

So, with my API, an application wanting more control of the mapping process can itself issue map requests with wanted start position and size. It then can use the mapped file info to figure out where different parts of the file are mapped. Using this method, it can achieve true zero-copy. An application wanting to do random accesses and not bother about mapping details, can use the file class and pass buffers and let the file class send appropriate mapping requests to the filesystem.

rdos · Post by **rdos** » Mon Apr 17, 2023 5:07 am

I think I will add a new server function that allows better status & debugging functionality. The client would send a command to the server, the server would parse it just like the command line tool, and then return an answer in clear text. This could be run as a special tool (like ordinary ftp or telnet functions), or as part of the command shell.

I now also have a partition server, and a client could send commands like remove partition, create partition, init disc for MBR/GPT. It could also mount & unmount filesystems or show details about current partitions. This way I don't need to implement this as userlevel classes that are linked into partitioning tools, and I can hide this in the MBR and GPT partition servers. Actually, I could disallow direct disc access to vital filesystem data.

For the filesystem servers, I could ask for open files, which blocks they have mapped, how much memory the cache uses and so on.

I then don't need most of this functionality as syscalls, and it's easy to add new commands to the servers. I can also provide help commands that shows syntax.

Particularly for creating bootable discs and EFI system partitions with support files, I could provide these in resource files to the partition manager and create the correct structure on the partitions.

Unlike how Fuse is constructed, I would definitely want a format function in the filesystem server (this should be required), and optionally a "check disc" tool that can fix crosslinking issues and other filesystem problems. This should not be special application tools, rather should be integrated with the filesystem server. Both of these can be run using the command line interface.

rdos · Post by **rdos** » Wed Oct 25, 2023 12:57 pm

The work has progressed a bit. I now have a working file-write API too. The file class will send a grow request to the file IO server, and the server will add clusters (FAT case) to the file, and then send a request to the disc server for the added sectors. The application will wait for completion of the request. The application will normally just write the data in userspace, and then the kernel side will check for modified pages and send write requests for these to the file IO server.

I also decided that unaligned file data could not be mapped for user space, and so requests without alignment will be mapped as kernel only (but still in user space). The file class will notice the request is not aligned, and use the kernel read/write handle syscalls instead of accessing data in user space. I've also implemented the kernel methods, more or less by duplicating the file class in kernel space. One difference is that it will directly notify areas written, and so these requests don't need to be scanned for modified pages.

For unaligned filesystems, file write is a bit of a challenge. Normally, all write requests to unaligned files would need to be preceeded by reads, but some smart coding has eliminated much of this on files that are not fragmented. However, all these writes must be done in kernel space since the files are not mapped with 4k alignment.

I've also linked the new filesystem API to the POSIX handle concept, which has some advantages.

AndrewAPrice · Post by **AndrewAPrice** » Tue Jan 09, 2024 8:28 pm

I did implement this.

Assuming all stars align:

The kernel can find you some physical memory in a low enough address space that DMA can write directly to it.
The address you want to read is sector aligned.
The length you want to read is a multiple of the sector size.
The physical area you want to write to is contiguous.

Typing this out, this sounds very rare, but is actually easy to accomplish with memory mapped IO.

Here is my loader where I memory map an ELF binary and parse it without copying (except to copy the sections into the final binary memory):
https://github.com/AndrewAPrice/Percept ... _loader.cc

rdos · Post by **rdos** » Wed Jan 10, 2024 2:15 am

I think this function is mostly useful for high speed disc hardware, and those are typically built on top of PCI using bus-mastering techniques (AHCI, NVMe). The disc hardware will typically support the physical memory available in the machine, and so there is no need to support DMA with a restricted address space. If such hardware is found, it's probably old and with poor performance, and then the disc driver can either use PIO or copy request data to proper buffers. A special case is USB drives. It would be possible to implement an interface for zero-copy in the USB stack if the USB hardware supports all physical addresses. However, at this point I decided to copy data and so when running on USB hardware, I don't currently support zero-copy (but could do in the future). The only hardware I support zero-copy on right now is NVMe.

iProgramInCpp · Post by **iProgramInCpp** » Sun Jan 14, 2024 10:28 am

Octocontrabass wrote: Almost all modern disks use 4096 bytes per physical sector.

Source?

Octocontrabass · Post by **Octocontrabass** » Sun Jan 14, 2024 4:37 pm

iProgramInCpp wrote:Source?

Mostly datasheets, but also documents like this one.

rdos · Post by **rdos** » Fri Sep 27, 2024 2:17 pm

I have not had time for this for nearly a year, but now I'm onto it again. Reading & writing to files now work very well, even if some performance enhancements can be added later (pre-read and pre-extend file).

I've solved a very tricky issue this week. When the file size is set to a lower value, like when deleting it, there is a need to clear caches. This is pretty complicated when file buffers are mapped in user space. The solution is to set the correct file size in the file object, but then only clusters (FAT case) that are not mapped will be freed, but the directory entry is set to the correct size and saved to disc. As the file class in user space notices that it has buffers above the file size, it will call a "flush" method to clear them, and then signals the file system server to update file clusters. If the user process never clears the cached entries, the file will become truncated after a restart when the server checks the cluster chain against the file size.

I also more or less have a working file delete that can handle that the file is open in another process. If the file cannot be truncated to zero size, then it will not be deleted. That way no clusters will be lost in the filesystem.

rdos · Post by **rdos** » Sat Sep 28, 2024 6:37 am

On second thoughts, I think I need to be able to "detach" mapped buffers in userspace. This is an important task in handling removable media. The trick should be to clear the physical address in the disc cache, and then free the page when it is unmapped in userspace. The normal case is that pages are not freed when they are unmapped since they are owned by the disc cache. Another case is when files are not aligned, in which case the same page could be mapped in several different contexts. However, these pages are not mapped accessible to user space, rather must be handled in kernel. Since the kernel code should always check validity before accessing data, these buffers can just be left mapped even if the disc sectors are freed and potentially reused for other purposes.

rdos · Post by **rdos** » Wed Oct 16, 2024 3:06 am

I changed the design so truncation of files now will "detach" the physical address from the disc cache, and the file size will be truncated. Affected buffers will be marked as "deleted". When the buffers are freed by the application, it will free the physical address. Should work both for truncating files and deleting them. This can be used in other scenarios too. For instance, when mapping an application in user space, the physical address from the disc cache can be detached from the cache and mapped in the application as both read and read-write without the risk of the contents being written back to the disc. This will only work if files are 4k aligned, otherwise, they will still need to be copied to a new 4k page.

rdos · Post by **rdos** » Sat Oct 26, 2024 1:02 pm

The file handle code is pretty messy and in need of a more or less complete rewrite. This is tricky since it will likely break vital functions. I want a 'virtual base class' interface but in assembly. Then the creator of the file initializes the access functions like read, write, get/set size and position, and some others. The old legacy file interface need the position in the handle data, while the new server fs has this in userspace. This can be handled by letting the legacy fs add the position last. The new fs also needs another handle in addition to the process file descriptor.

The per process handle table is also a mess, and implemented differently in legacy and the new fs. It should be implemented in the common code and not in the different fs drivers.

I think I need to add an new interface for test only and then switch to the real one as most of it is working properly.

The new server based fs also needs a supervisor thread per process that has open files. This is needed to scan for modified pages and also to invalidate unused buffers in low physical memory situations.

Another complexity is the kernel mode file functions. For the server version, they need to map file contents in kernel memory as they should be able to operate on files from many different processes. The PE executable handler uses this method.

rdos · Post by **rdos** » Mon Oct 28, 2024 3:13 pm

To save GDT descriptors, I'm redesigning how the various segments for the three-level handle structure is organised. The file selector must be in the GDT since it is shared between many processes. The process file handle table can be in the LDT. In fact, by using a fixed LDT entry (with different linear bases in different processes), I can quickly access the handle table. The segments that define which processes have a given file open can also be in LDT. So can the kernel mapping selector, and the handle object itself. This works since the FS servers never need to access process data and can send messages to the per-process server thread to invalidate buffers.

The locking scheme is kind of complex too, but by using a bitmap of allocated entries, I can use lock-free allocation of entries at all three levels. This makes synchronization a lot easier. It's the same principle as with the bitmap based physical memory allocator. You first determine the first free bit in a 32-bit dword, and try to acquire it with lock bts. If CY is returned, the allocation failed and you try again until it succeeds.

Fork is a bit complicated, but in the first stage in the parent process, inherited handles are allocated in the GDT, and then in the second stage in the new process, the selectors are reallocated in the new LDT and the GDT descriptors are freed.

OSDev.org

Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace