Page 1 of 1

Implementing all block devices using mmap?

Posted: Thu May 05, 2016 1:28 pm
by Hellbender
As I'm rewriting & redesigning my VFS, I started thinking about combining file mmapping and reading/writing block devices.
More specifically, I thought about limiting block device interface to just map disc sectors to memory.

A file system could keep all inode blocks, bitmaps, etc. mapped to memory, so they could be easily accessed by linear address.
When a file is opened, file system would order the underlying block devices to map all blocks into a continuous memory area.
Reading would just copy data from that mmapped memory into user buffer. Writing would copy data the other way.

Device driver would load data to memory on page faults, or pre-fetch sectors if they are likely to be used.
Page allocator would trigger write-backs and invalidate clean pages that have not been used for some time, just as it would for 'normal' mmapped files.
Thus, unused physical memory would automagically be used to cache the data on disk.

Pros?
  • - Driver would know about all currently open files, and could pre-fetch co-located sectors of different directories / files using a single transmit.
    - I wouldn't need to worry about blocking IO, as I'd only have "wait while a page is being loaded" code, which is needed anyway to support mmapped files.
Cons?
  • - Async IO would need e.g. a separete thread to copy data to/from user buffer, sending 'ready' signal once copy operation is complete.
    - Char devices would be accessed in a completely different manner.
Any thoughts why this might be bad/good idea?

Re: Implementing all block devices using mmap?

Posted: Thu May 05, 2016 1:50 pm
by onlyonemac
I think it might be difficult to implement the repeated reordering of the blocks. How will you keep the block device mapped to a single block of memory while also reordering the blocks to make a particular file contiguous?

Also, are you mapping the entire block device to memory, or just the inode table, bitmaps, and any open files? If the latter, how is this different (in practice) from caching at the filesystem level, and how is it more efficient (in implementation) than caching at the filesystem level? (Caching at the filesystem level would offer additional advantages, because for example the filesystem driver can work with the inode table in a more efficient internal representation rather than working with it as it is stored on disk.) And would the determination of what blocks to map happen at the block device layer itself or be passed down from the filesystem layer?

Re: Implementing all block devices using mmap?

Posted: Thu May 05, 2016 2:37 pm
by Hellbender
I guess the bottom line is that, because kernel needs to support mmapped files anyway, I'd maximally utilize that mechanism to reduce the amount of code in block device and filesystem drivers.

Anyway, I'd not map whole block device as such. Instead filesystem would map what ever blocks it is interested in to suitable virtual addresses.
Filesystem would not need to keep a single file fully contiguous, or even fully mapped, if the virtual address space becomes fragmented.
Caching at the filesystem level would offer additional advantages, because for example the filesystem driver can work with the inode table in a more efficient internal representation rather than working with it as it is stored on disk.
Nothing prevents filesystem from caching different inode representation. To populate the inode cache, filesystem would access the raw inode data that has been mmapped to memory. Mmapped devices would just replace all block device read/write calls by direct memory references. If the filesystem wants to access a certain raw inode, it would just create a pointer to raw inode struct as "inode map base + sizeof(raw_inode_t) * ino". Kernel and device driver would make best effort to keep the required data in physical memory.

Re: Implementing all block devices using mmap?

Posted: Fri May 06, 2016 12:18 am
by Brendan
Hi,
Hellbender wrote:Any thoughts why this might be bad/good idea?
The main problem with memory mapped files is that there's no sane way to handle any errors. If there's a read error, or you run out of "RAM + swap", or the user unplugs a USB flash device, or the file system or device driver crashes, or...

For all cases where an error occurs in a "memory mapped file" area, you only really have 2 choices - you can terminate the process, or using something hideously problematic like "signals" to deliver the error to the process.

For memory mapping all block devices you'd have the same problem.

Note: For files on "unreliable devices" (anything removable, anything with "higher than normal media failure rates", etc) I'd make sure the data is copied to something more reliable (RAM, swap) before returning from the function that creates a memory mapping (sacrifice performance for reliability), or I'd refuse to allow the file to be memory mapped. If you memory map all block devices you can't do the "copy to something more reliable first" thing and can't refuse.

The next problem you'll have for memory mapped block devices is that they're mostly used by file system code; and (properly designed) file system code typically needs/uses synchronisation guarantees to ensure writes occur in a well defined order (and ensure that a power failure or other problem at the wrong time doesn't leave you with a corrupted file system).

The next problem you'll have is that the interfaces for file systems (and everything below that - storage device drivers, "software RAID" layers, whatever) need to be asynchronous for performance reasons. Mostly, you want an "as large as possible" list of pending operations so that you can optimise the order that operations are performed (e.g. taking into account things like head/seek travel time, IO priorities, etc). Worse, you want to be able to adjust the IO priority of pending operations in various cases (and also be able to cancel them). For example; if there's a low priority "prefetch block #1234" that was requested earlier (but postponed because higher priority operations where being done instead) but then you find out that the process needs this block of data ASAP (it goes from "prefetch for process" to "needed for process to continue"), then you want to change the IO priority of that request to a higher priority after the request was made; regardless of whether the request is still in the file system's list of pending operations, or if the request has made it's way through the file system code and other layers and is in the storage device driver's list of pending operations.

With a "memory mapped device" interface, the file system can't communicate information like IO priorities to the lower layers (e.g. storage device driver), can't cancel pending operations, etc. Without this information the lower layers can't optimise the order that operations are performed (and can't avoid doing work that was cancelled), and performance will suffer (especially under load where performance matters most).


Cheers,

Brendan

Re: Implementing all block devices using mmap?

Posted: Fri May 06, 2016 1:33 am
by Hellbender
Brendan wrote:The main problem with memory mapped files is that there's no sane way to handle any errors. If there's a read error, or you run out of "RAM + swap", or the user unplugs a USB flash device, or the file system or device driver crashes, or...

(properly designed) file system code typically needs/uses synchronisation guarantees to ensure writes occur in a well defined order.

Mostly, you want an "as large as possible" list of pending operations so that you can optimise the order that operations are performed (e.g. taking into account things like head/seek travel time, IO priorities, etc).

Worse, you want to be able to adjust the IO priority of pending operations in various cases (and also be able to cancel them).
Excellent points as always, thanks. It kinda feels like I'd need some 'ensure that there mmapped regions are read/written' feature, which would mean that I just did an overly complicated async-io interface. So I might be better off providing an actual async-io interface, with priorities, guarantees and stuff.

This would actually be very good for the pager as well. If the async completition event is 'send signal to thread', and I use a special 'SIG_THREAD_CONT' signal, the page fault handler can use those async-reads to handle mmapped files and swapped memory. It just needs to temporarily set thread signal mask, make the thread wait for signals, and launch an 'ASAP async-read' with the priority of the process in question (because my threads can only block to wait for signals). The pager would use SIG_BUS or maybe even SIG_KILL as the signal to send on read error.

That feels good. My VFS could only provide 'aio.h' -like interface to access file system nodes, so that would also be the way to access block devices. Non-async IO would just wrap those with waits, and supporting 'aio.h' would be simple.

Re: Implementing all block devices using mmap?

Posted: Fri May 06, 2016 9:50 am
by onlyonemac
I'm still not seeing the benefit of memory-mapping at the filesystem level. The filesystem driver preferably uses its own structures in memory so doesn't need to be working with pointers to memory-mapped inode blocks, and the userspace process probably doesn't care whether or not the file is memory-mapped unless it explicitly requests this.

If the whole idea is just to permit pointer arithmetic for parsing filesystem tables, then the filesystem driver can easily perform the same arithmetic and then say to the block device driver "read sizeof(raw_inode_t) bytes at address inode_map_base + sizeof(raw_inode_t) * inode_index into the buffer at address raw_inode", or if you simply can't resist saying "raw_inode_t raw_inodes[]" then you can read the whole inode table into memory at once and work with it there, committing changed entries back to disk as they are changed (or in whatever other order is preferable).

Re: Implementing all block devices using mmap?

Posted: Fri May 06, 2016 10:15 am
by Hellbender
onlyonemac wrote:I'm still not seeing the benefit of memory-mapping at the filesystem level.
I guess the thing I was looking after was separation of "what to read" (e.g. mmapped areas) from "when to read" (e.g. page fault handler), and to avoid double work when supporting normal reads and mmapped files.
But in a hind sight, the same thing can be achieved using async-io (enqueuing async reads vs. executing those reads), with the added benefit of having clear 'data ready' notifications (signals), cancellation, and priorities. Also it seems that mmapped files are easily implemented on top of async reads.