Zero-copy file-IO in userspace

rdos · Post by **rdos** » Wed Mar 01, 2023 9:45 am

AndrewAPrice wrote:I've thought a bit about this, but put in on the back burner for now. I pass a shared memory buffer from the program through to the disk driver, and let the disk driver figure out what to do.

So in theory, if the physical address of the memory buffer lived in the first 32-bits, I could use IDE DMA to write directly into it, vs. temp memory in the disk driver and copy it across.

In my case, I don't let the program pass a buffer to the disc driver, rather it must send a message to the VFS server (through a syscall) to request part of a file to be buffered. Once it is buffered and mapped into user space, the program can read from the buffer. However, this means that the every cached sector for the IDE drive must use 32-bit memory allocated below 4G. This can become a problem if large files from the drive are accessed or are memory mapped.

So, I have an API for allocating memory below 4G, but this can be a limited resource, so I'm not keen on letting the IDE drive use such buffers. I'm more inclined to allocate a normal physical address and use PIO access to read/write the drive.

OTOH, PCs that have more than 4G of memory typically have a BIOS option to select between IDE and AHCI. For AHCI, the schedule is based on physical addresses and it should handle 64-bit addresses if the PC has more than 4G of RAM.

Octocontrabass · Post by **Octocontrabass** » Wed Mar 01, 2023 11:40 am

rdos wrote:Typically, the full physical address range is not supported, which cause trouble.

DMA is so much faster than PIO that it might still be worth the extra effort to use bounce buffers for physical addresses above 4GB. And, if you happen to have an IOMMU, you can use it instead.

rdos · Post by **rdos** » Wed Mar 01, 2023 1:41 pm

Octocontrabass wrote:
rdos wrote:Typically, the full physical address range is not supported, which cause trouble.
DMA is so much faster than PIO that it might still be worth the extra effort to use bounce buffers for physical addresses above 4GB. And, if you happen to have an IOMMU, you can use it instead.

Maybe IDE discs discovered through PCI might be a candidate for DMA, but not those discovered at fixed IO ports. I also suspect that on really old hardware, DMA was even more limited regarding the physical address range.

Octocontrabass · Post by **Octocontrabass** » Wed Mar 01, 2023 2:14 pm

rdos wrote:Maybe IDE discs discovered through PCI might be a candidate for DMA, but not those discovered at fixed IO ports.

That's why the link in my earlier post is specifically for PCI. (There are actually two incompatible forms of DMA for PCI IDE, but the one I linked is much more common.)

rdos wrote:I also suspect that on really old hardware, DMA was even more limited regarding the physical address range.

On really old hardware, DMA was usually impossible because the necessary pins on the IDE connector weren't wired to anything. In the rare cases where those pins were wired, they were usually wired to the ISA DMA controller, and ISA DMA is both more limited than PCI DMA and slower than PIO. (On slightly less-old hardware, you might have EISA DMA or MCA DMA instead, but good luck getting either of those to work...)

rdos · Post by **rdos** » Thu Mar 02, 2023 4:04 pm

I think a smart design for discovering unused disc buffers is to handle this while waiting for file data to be read & buffered. This is also a time when the application is less likely to use the memory structures. Since the buffers are in VFS server memory, I will need to free them at the server side. It doesn't feel like the normal blocking messaging method should be used for this, rather the server might check this regularly while being idle.

The same goes for handling file writes. Performance will improve if writes are slightly delayed, particularly if it's "log type" writes. However, increasing file sizes by adding clusters is better done the same way as requesting read buffers (server messaging). Small increments of file sizes are better delayed though.

rdos · Post by **rdos** » Sun Mar 05, 2023 3:23 am

For sequential reads, it should be possible to queue disc requests in advance so the disc operations and processing in user space can occur in parallel. The file system server should be able to detect this and queue future requests.

For sequential writes, it should be possible to pre-allocate clusters in the file server to achieve better parallelism.

AndrewAPrice · Post by **AndrewAPrice** » Tue Mar 07, 2023 5:11 pm

This is super specific to my OS. I noticed switching from PIO to DMA didn't noticeably improve performance. I've built a microkernel so I added simple profiling a the syscall level. I discovered my bottleneck is mapping and unmapping shared memory into the driver.

I haven't added more granular profiling but my hunch is that finding an address range to map the buffer into is slow. This needs improvement at some point, but my biggest take away is that mapping a 4MB buffer into memory just to write into a 4KB window of it is slow.

So that's one thing I should address. Even with DMA there are times when I need to do this.

Times when I can use pure DMA (write straight into the destination buffer without a temp buffer in-between), I can skip mapping at all. I still probably need to define an API to increment the reference count of the shared memory even if not being mapped into the driver, for the potential that an application could free the shared memory mid-read and it gets recycled and another program crashes because the physical memory gets filled with junk.

Times when I discovered I have to use a temp buffer:
- When the destination buffer can't be accessed within 32-bit space.
- When we're not reading an entire sector aligned chunk from the disk. (E.g. if I only want to read 1KB of a 2KB sector, I can't DMA into the user program's memory without overriding 1KB of data elsewhere.)
- When the sector would be copied across page boundaries. (My kernel doesn't guarantee user memory is contiguous in physical memory.)

Google Skia/fontconfig/libjpeg are all using memory mapped files (implementing this was a big performance improvement - read about my journey here) and thankfully since page faults cause page aligned file reads, these are good candidates for DMA without buffering.

Octocontrabass · Post by **Octocontrabass** » Tue Mar 07, 2023 6:36 pm

AndrewAPrice wrote:- When we're not reading an entire sector aligned chunk from the disk. (E.g. if I only want to read 1KB of a 2KB sector, I can't DMA into the user program's memory without overriding 1KB of data elsewhere.)
- When the sector would be copied across page boundaries. (My kernel doesn't guarantee user memory is contiguous in physical memory.)

These situations can mostly be avoided by forcing applications to use a better API. It sounds like you're porting a lot of existing software, though, so changing the API may not be practical for you...

rdos · Post by **rdos** » Wed Mar 08, 2023 3:12 am

AndrewAPrice wrote:This is super specific to my OS. I noticed switching from PIO to DMA didn't noticeably improve performance. I've built a microkernel so I added simple profiling a the syscall level. I discovered my bottleneck is mapping and unmapping shared memory into the driver.

My disc driver has the opposite problem. When I use PIO, I will need to map the physical address in the cache to the linear address space of the disc driver, while DMA and bus-mastering doesn't need any mapping. Thus, the best performance will be achieved with modern devices that use bus-mastering through PCI.

AndrewAPrice wrote:Times when I can use pure DMA (write straight into the destination buffer without a temp buffer in-between), I can skip mapping at all. I still probably need to define an API to increment the reference count of the shared memory even if not being mapped into the driver, for the potential that an application could free the shared memory mid-read and it gets recycled and another program crashes because the physical memory gets filled with junk.

I keep reference counts along with the physical addresses in the disc cache.

AndrewAPrice wrote:Times when I discovered I have to use a temp buffer:
- When the destination buffer can't be accessed within 32-bit space.

That might need copying. Another case is the USB disc driver. Unless it has a physical interface to the USB schedule, it too will need buffering. Currently, my USB driver will need copying the data. This is complicated by different USB hardware types having different support for 64-bit addresses.

AndrewAPrice wrote: - When we're not reading an entire sector aligned chunk from the disk. (E.g. if I only want to read 1KB of a 2KB sector, I can't DMA into the user program's memory without overriding 1KB of data elsewhere.)
- When the sector would be copied across page boundaries. (My kernel doesn't guarantee user memory is contiguous in physical memory.)

These are both non-issues in my design since the application cannot control where or how much data is read from disk.

AndrewAPrice wrote: Google Skia/fontconfig/libjpeg are all using memory mapped files (implementing this was a big performance improvement - read about my journey here) and thankfully since page faults cause page aligned file reads, these are good candidates for DMA without buffering.

I have memory mapping files & executables in my "pipe-line", although this will only work effectively if the file systems are 4k aligned.

rdos · Post by **rdos** » Wed Mar 08, 2023 3:22 am

Octocontrabass wrote:
AndrewAPrice wrote:- When we're not reading an entire sector aligned chunk from the disk. (E.g. if I only want to read 1KB of a 2KB sector, I can't DMA into the user program's memory without overriding 1KB of data elsewhere.)
- When the sector would be copied across page boundaries. (My kernel doesn't guarantee user memory is contiguous in physical memory.)
These situations can mostly be avoided by forcing applications to use a better API. It sounds like you're porting a lot of existing software, though, so changing the API may not be practical for you...

Or by implementing a cache, and drawing the data from the cache rather than directly from the disc.

rdos · Post by **rdos** » Mon Mar 20, 2023 2:37 pm

I now have a good stress test program. It reads from random positions with random size from a 16M file with a very special formatting to allow checking if data is correct. It also does more reads at the beginning to create overlaps in the cache. The disc cache size is set to 6M, which will force regular reads from the USB drive. The filesystem is not 4k aligned. The test is run on multicore CPUs (4 & 8 cores).

After many issues, it now completes 10M operations with all data being correct, no memory or resource leakage and no faults.

When starting two copies of the stress test program (in different processes), it now works until they request overlapping data, and the filesystem server first queues one request and then an overlapping request. This results in the latter not being signalled as completed since only one object can be signalled per 4k page. I either need to make sure the filesystem server requests whole 4k pages, some timeout for waiting for completion or both.

When the same app starts two threads that access the same file, there are synchronization issues. I think I need to integrate the futex into the mapping to achieve proper synchronization.

rdos · Post by **rdos** » Mon Mar 27, 2023 2:04 pm

I redesigned the "micro-kernel" interface. Commands that need an answer are still sent the usual way (through an register block with potential array parameters), but most messages related to file data do not work well with "request-reply".

So, I added a command queue consisting of up to 256 commands that have a 64-bit field, a 32-bit field, a 16-bit file handle and a 16-bit command number. The user-mode driver (or partition driver) can quickly add commands to the queue and continue processing (if possible). The server side filesystem can process the command queue in user mode (until it's empty, at which point it must wait for more data in kernel). The command queue is quite similar to modern PCI device schedules, except it uses linear addresses and OS signalling rather than IRQs.

I have three file commands: req buffer, completed and mapped by usermode. Req buffer will determine which range to buffer (if not already done), and then queue a command to the cache. It will also add the request as pending. Completed will remove the request from being pending. Map can be used to improve performance on sequential reads by adding a new read command when the application maps the previous buffer. When dealing with non-aligned file systems, the pending queue must be used to delay overlapping requests.

bellezzasolo · Post by **bellezzasolo** » Mon Apr 10, 2023 6:30 am

The way I see it, the most complex request a user space program wil issue is a scatter-gather request for numerous virtual addresses. The need for this is recognised in several operating systems - windows has functions like ReadFileScatter, there's the POSIX readv. The beauty of this is that a single buffer is just a special case.

The next step is physical address translation. It's not exactly complex to do this, just call GetPhysicalAddress(). Where there's a discontinuity, you add a new entry to the scatter-gather list.

Then the request reaches the hardware. A lot of modern hardware - NVMe, AHCI, xHCI, ... suppports scatter gather. There are of course maximum lengths, but again you can just add extra entries.

If the driver has alignment requirements to deal with, and there's a misalignment, then that request needs buffering. Copying the buffer is then just an additional action to take on the completion of the I/O.

If you've got a file cache, then page alignment allows you to just do a copy-on-write.

rdos · Post by **rdos** » Mon Apr 10, 2023 2:27 pm

I don't think userspace want scatter-gather for file-IO. An app want to read some parts of a file with the best possible performance. Some apps will do sequential reads using large buffers, while others might want to read a file a byte at a time. The former scenario typically works well with most OSes and filesystem implementations, while the latter doesn't. So, many apps that would prefer to read a byte at a time will try to create larger requests to achieve reasonable performance. With my new file class, this will not be necessary. The app can read the file using single byte access, and the OS won't create syscalls and traverse the full filesystem path, rather most accesses will use contents cached in userspace.

The next problem I want to solve is to use free physical memory for filesystem caches. In a 32-bit OS, it's impossible to cache several GB of filesystem data if you keep caches in the virtual address space. Therefore, the disc driver will work with physical addresses and the caches will be organized based on physical addresses and will not be mapped in the kernel address space. File data actually is only mapped in the application that has the file open. The filesystem sends the cache a number of sectors, and then the cache sends "back" the physical addresses of the sector data, which is then mapped in the application address space. Meta data is mapped in the server and use versioning. When an application wants meta data, meta data is locked and the physical address from the server is mapped in the application. If meta data changes while locked, and new version is created in the server. When the application unlocks meta data, the physical address is freed if it's no longer used by any other application and it is not the current version.

The file system servers are a special form of application that runs in user space and has 2G of private memory per partition. The cache has 1G of memory per disc drive that is only accessible from kernel space.

This design also make scatter-gather obsolete. There is no user space (or kernel space) virtual addresses that are sent to disc-drivers, which need translation to physical addresses which might be non-continous, and need scatter-gather.

Of note is that applications that want optimal performance will need to user the file-IO class provided, or will need to use the file map API itself. If it uses the read function, every read will issue a syscall, which will provide poor performance for small reads.

bellezzasolo · Post by **bellezzasolo** » Mon Apr 10, 2023 2:41 pm

rdos wrote:I don't think userspace want scatter-gather for file-IO. An app want to read some parts of a file with the best possible performance. Some apps will do sequential reads using large buffers, while others might want to read a file a byte at a time. The former scenario typically works well with most OSes and filesystem implementations, while the latter doesn't. So, many apps that would prefer to read a byte at a time will try to create larger requests to achieve reasonable performance. With my new file class, this will not be necessary. The app can read the file using single byte access, and the OS won't create syscalls and traverse the full filesystem path, rather most accesses will use contents cached in userspace.

The next problem I want to solve is to use free physical memory for filesystem caches. In a 32-bit OS, it's impossible to cache several GB of filesystem data if you keep caches in the virtual address space. Therefore, the disc driver will work with physical addresses and the caches will be organized based on physical addresses and will not be mapped in the kernel address space. File data actually is only mapped in the application that has the file open. The filesystem sends the cache a number of sectors, and then the cache sends "back" the physical addresses of the sector data, which is then mapped in the application address space. The file system servers are a special form of application that runs in user space and has 2G of private memory per partition. The cache has 1G of memory per disc drive that is only accessible from kernel space.

This design also make scatter-gather obsolete. There is no user space (or kernel space) virtual addresses that are sent to disc-drivers, which need translation to physical addresses which might be non-continous, and need scatter-gather.

There's a reason that all the major OSes offer some form of Scatter-Gather API. MSDN offer the example of database applications - https://learn.microsoft.com/en-us/windo ... her-scheme

GNU libc is less specific- https://www.gnu.org/software/libc/manua ... ather.html

My OS is 64 bit, so virtual memory is no issue. Given that this is a capability offered by a plethora of modern hardware, I'd certainly consider it desirable to offer the API. The advantage is that the kernel virtual-physical translation layer can be common and is pretty much free. Then your file cache is almost just a special device driver that works in system memory, moving around pages.

Applications doing silly things with small writes can be addressed with userspace buffering IMO, not hard to add to a libc.

OSDev.org

Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace

Re: Zero-copy file-IO in userspace