VFS in userspace & how to improve it

max · Post by **max** » Mon Jan 26, 2015 9:27 am

Heyho guys,

as some of you may have read about, I am writing a pure microkernel (Ghost), that means everything that is related to the very basic kernel functionality is running in userspace. This led me to also having the virtual filesystem in userspace. This has a great benefit: the system doesn't crash if the VFS is going down, it can just restart it. But, it is a real pain in the @$$ when it comes to performance.

At the moment, I'm allocating a new shared memory area each time a read or write is done; this is obviously bad and slow. So I thought about maybe allocating a shared memory when the user calls "open", and freeing that memory when he "close"s it. This is just one part, the other thing is that messaging is involved, which means the user process sends a message that it wants to read something, the vfs handles it, reads and sends a response, the user process reads the response.

Maybe someone have an idea on how to improve the performance? One shared memory per file would be a start - even though it must then be synchronized or per thread. Any ideas are welcome.

Greets,
Max

sortie · Post by **sortie** » Mon Jan 26, 2015 11:37 am

(I made some unhelpful posts go poof - please don't bikeshed opinions)

My system is not a microkernel, but I do have my main actual filesystem (ext2) in user-space, where the kernel VFS backend send messages to a filesystem process and await responses. It's not too well implemented. It's not that slow, but many small filesystem operations do have a cost that adds up, so I looked into it. I profiled how much time was spent in user-space, in the kernel, in the filesystem process and such during a self-hosting build on my OS.

The main issue I noticed was the message passing resulted in the sending thread yielding, then every other thread in the system ran until the filesystem process ran, then it ran the rest of the system threads until the original thread ran again. Basically, every IO operation resulted in every single thread in the system getting a chance to run. This is a ridiculously high cost, especially if other threads in the system is CPU bound and use up their entire timeslice.

I decided to improve my scheduling by introducing a new concept: The thread that is executing might not be the one owning the timeslice. When a thread sends a message and blocks, the yield routine notices it's waiting for another thread, so without yielding the timeslice it just switches to the filesystem process's listening thread. Once it gets a response, or something invalidates this, it switches back to the true thread, which may then continue or yield to another thread again. The result is that small operations cause a message to be sent, it instantly switches to the right thread that handles the small message and responds immediately, then this continues until the client thread is done or the time slice is up. This made my system significantly faster. I even extended other areas of my system with similar features (such as pipes). The generic concept is that a thread knows what it is waiting for, and if it's time slice isn't up, the scheduler runs the waited-for thread in the true thread's place.

I think that cost was much higher than memory allocations, but those matter too at some point. I suggest you have prepared areas in advance that can cheaply be used, more allocated if needed (or wait for one), so you can use an efficient constant time allocator.

My recommendation is just to implement it for now, then profile. Profile and figure out what needs improvement.

Kevin · Post by **Kevin** » Tue Jan 27, 2015 7:05 am

max wrote:This has a great benefit: the system doesn't crash if the VFS is going down, it can just restart it.

Really that easy? You need to have a copy of the VFS binary around because you won't be able to read it from the disk when you need it. And your applications need to be able to cope with the fact that the in-memory state of the VFS service is lost. Depending on your implementation this might mean that they need to be prepared to lose all of the file descriptors at any point, or even just that all the mountpoints are gone.

Most of this is solvable, but certainly way beyond "just restart it".

At the moment, I'm allocating a new shared memory area each time a read or write is done; this is obviously bad and slow.

I think you need to consider the whole way of the data from the application down to the disk. It's rather unlikely that an application will be content with having a single SHM buffer per file and directly working with the data in there. If you can do it, that's zero-copy with permamnt SHM and probably close to optimal, but it will only cover a few corner cases.

For the other cases, you either need to copy data from/to the SHM, or you can decide that you'll just temporarily share the buffer the application is already working on. The latter unfortunately requires page alignment, but even though it creates a new SHM area for each request, it should be much quicker than copying all the data around. If not, your SHM allocation code could probably use some optimisation.

You might decide that you want to offer both options.

So I thought about maybe allocating a shared memory when the user calls "open", and freeing that memory when he "close"s it. This is just one part, the other thing is that messaging is involved, which means the user process sends a message that it wants to read something, the vfs handles it, reads and sends a response, the user process reads the response.

How about getting the VFS process out of the way for reads and directly communicating between the application and the file system driver?

Maybe someone have an idea on how to improve the performance?

In tyndur we ended up moving the VFS into the kernel because the usermode one hadn't worked out that well.

Actually, in tyndur we have some proof-of-concept code where even the file system driver is eliminated from the I/O path: It can just upload the mapping of its blocks and then the kernel applies this mapping to any requests and directly submits the request to the disk driver. Going directly from the application to the disk driver is something you can hardly do without kernel support if you don't want every process to be able to access anything on the disk.

max · Post by **max** » Tue Jan 27, 2015 9:47 am

Hey sortie, hey kevin,

the scheduling idea sounds good, but for my taste this is too much logic for a scheduler. Getting my current design really performant would involve having the system much less compatible to unix software, and even though I don't need too much 'nix I want to stay at least as compatible as possible. I'll have to rethink all that and give you guys an update once I found a solution.
Thanks for your replies so far!

Greets,
Max

sortie · Post by **sortie** » Tue Jan 27, 2015 1:55 pm

max wrote:the scheduling idea sounds good, but for my taste this is too much logic for a scheduler.

I quite disagree. It's very simple. Each thread has a pointer to a thread it would prefer run in its place. When a thread blocks on something only another thread can do, that gets set. The scheduler just needs to check that variable and keep track of what is the true thread and what is the active thread. Finally, at key locations (like mutex unlock) it should return to the true thread to see if it can progress now. It's really not that much code and I think it's actually a great scheduler primitive. I expect it to be useful for thread that interactively deal a lot with other threads. But yes - this is just an example of an optimization I did, after profiling roundtrips to the filesystem server was a bottleneck.

Kevin · Post by **Kevin** » Tue Jan 27, 2015 2:16 pm

Just doing a directed yield could already improve things quite a bit. At least this is what I observed back then when I combined RPC requests with an immediate task switch to the target process (and back for the response). But then, that was probably only because several tasks were busy waiting for input - if your tasks are sleeping properly, then in simple cases with no active background processes it shouldn't make any difference because only one task is runnable anyway.

gerryg400 · Post by **gerryg400** » Tue Jan 27, 2015 3:10 pm

In a message passing kernel it is quite usual for the scheduler to be distributed around the kernel. This is especially easy to implement if your kernel uses one kernel thread per core.

When a message is being sent the sender is blocked and the receiver woken up inside the msg_send kernel call. In fact if the msg is small (as it often is) it's possible to switch the the receiver's memory context right there and do the msg copy. The receiver can be boosted to the sender's priority and take over its timeslice. The scheduler need not be involved because the correct thread is already running.

willedwards · Post by **willedwards** » Tue Jan 27, 2015 3:24 pm

L4 is an example of a microkernel which gives the remaining quantum to message recipient if the recipient is blocking waiting for a message.

http://www.nicta.com.au/pub?doc=6930 is a nice write-up which covers the technical, details, and other microkernel lessons-learned.

gerryg400 · Post by **gerryg400** » Tue Jan 27, 2015 3:35 pm

Max, I have my VFS in userspace. My VFS is basically just a tree. The tree contains files that only exist at runtime like devices but also contains open files in the system. This makes it much like the Linux dentry cache.
Each entry in the tree has a list of pages. User processes can call read or write to access the data in the pages or can mmap whole pages into their memory space.
Each read/write or mmap requires a send and reply so it is slower than a monolithic kernel call in that sense but I'm not worried about the performance yet. The biggest performance issues will be contention in the VFS tree and how well my page and inode caches work. The message passing performance is a small part of the design and not something I believe I'll need to look at soon.

Rusky · Post by **Rusky** » Tue Jan 27, 2015 3:47 pm

Another benefit of sortie's more-general method is it solves priority inversion- if a high priority process is blocked on something currently owned by a lower-priority process, the scheduler can automatically run the lower-priority process during the high-priority's timeslice (still billing the time to the high-priority one), rather than accidentally giving a mid-priority process time first.

A message passing-based mechanism probably wouldn't handle that sort of scenario.

As far as user-space VFS performance, a lot of the same optimizations apply as when you're doing it in-kernel- when to allocate buffers, for example. For message passing overhead, you could probably design the VFS so that it doesn't do anything security-sensitive and just make it a library you can link to each process, and have it send messages directly to specific file system servers.

In fact, to reduce message passing even more, you could turn the file system servers into simple disk block cache permission managers. All they do is understand the filesystem structure/metadata enough to keep track of which pages have which disk blocks cached and which processes are allowed to map them in. Then you just put everything else in a library that uses a simple mmap interface to the servers to implement open/read/write/close itself.

gerryg400 · Post by **gerryg400** » Tue Jan 27, 2015 3:56 pm

Rusky wrote:A message passing-based mechanism probably wouldn't handle that sort of scenario.

What do you mean ? Which scenario wouldn't it handle ?

Rusky wrote:In fact, to reduce message passing even more, you could turn the file system servers into simple disk block cache permission managers. All they do is understand the filesystem structure/metadata enough to keep track of which pages have which disk blocks cached and which processes are allowed to map them in. Then you just put everything else in a library that uses a simple mmap interface to the servers to implement open/read/write/close itself.

How would the clients and servers synchronise ? For example when the client needs a part of a file to be loaded into the cache, how would it ask for that ? And when the server loads a page, how would it tell the client that it is ready ? Surely messages (or something with the same cost as messages) would be needed anyway.

Rusky · Post by **Rusky** » Tue Jan 27, 2015 6:27 pm

gerryg400 wrote:
Rusky wrote:A message passing-based mechanism probably wouldn't handle that sort of scenario.
What do you mean ? Which scenario wouldn't it handle ?

Imagine, as a simplification, you have 3 processes- Process A is high priority, process B is low priority, and process C is somewhere in between. Now, B and A share a resource, protected by a lock. Process B locks it and then is preempted because A and C, which are higher priority, need to run. Process A really should run because it's highest priority. Without anything in the scheduler to tell it to run B because A is waiting on it, it will run process C instead. If you only donate timeslices when a message is sent, you might miss this.

gerryg400 wrote:
Rusky wrote:In fact, to reduce message passing even more, you could turn the file system servers into simple disk block cache permission managers. All they do is understand the filesystem structure/metadata enough to keep track of which pages have which disk blocks cached and which processes are allowed to map them in. Then you just put everything else in a library that uses a simple mmap interface to the servers to implement open/read/write/close itself.
How would the clients and servers synchronise ? For example when the client needs a part of a file to be loaded into the cache, how would it ask for that ? And when the server loads a page, how would it tell the client that it is ready ? Surely messages (or something with the same cost as messages) would be needed anyway.

I should have separated the parts of my post better. I'm not saying there wouldn't be any messages, just that this way there can be fewer messages. The implementation without libraries might in the worst case end up with a message to the VFS server, then a message to the file system server, then a message to the disk cache/driver server, then messages all the way back.

gerryg400 · Post by **gerryg400** » Tue Jan 27, 2015 7:13 pm

Rusky wrote:
gerryg400 wrote:
Rusky wrote:A message passing-based mechanism probably wouldn't handle that sort of scenario.
What do you mean ? Which scenario wouldn't it handle ?
Imagine, as a simplification, you have 3 processes- Process A is high priority, process B is low priority, and process C is somewhere in between. Now, B and A share a resource, protected by a lock. Process B locks it and then is preempted because A and C, which are higher priority, need to run. Process A really should run because it's highest priority. Without anything in the scheduler to tell it to run B because A is waiting on it, it will run process C instead. If you only donate timeslices when a message is sent, you might miss this.

That sound like a priority inversion problem that should be solved one of the usual ways that priority inversion is solved (e.g. priority inheritance). A message passing system can easily handle that scenario without any help from the scheduler by simply elevating the priority of a receiver of a messages to that of the highest sender/message that's on its queue.

Rusky wrote:
gerryg400 wrote:
Rusky wrote:In fact, to reduce message passing even more, you could turn the file system servers into simple disk block cache permission managers. All they do is understand the filesystem structure/metadata enough to keep track of which pages have which disk blocks cached and which processes are allowed to map them in. Then you just put everything else in a library that uses a simple mmap interface to the servers to implement open/read/write/close itself.
How would the clients and servers synchronise ? For example when the client needs a part of a file to be loaded into the cache, how would it ask for that ? And when the server loads a page, how would it tell the client that it is ready ? Surely messages (or something with the same cost as messages) would be needed anyway.
I should have separated the parts of my post better. I'm not saying there wouldn't be any messages, just that this way there can be fewer messages. The implementation without libraries might in the worst case end up with a message to the VFS server, then a message to the file system server, then a message to the disk cache/driver server, then messages all the way back.

The correct way to address that issue (and it solves other problems) is to make the page/disk cache part of the VFS. So the VFS is a tree of all open (and recently used) files with cached pages of those files. The filesystem server, and disk driver are only messaged on a cache miss. When a cache miss occurs the extra message passes are not noticed because of the time taken for actual disk IO.

eryjus · Post by **eryjus** » Tue Jan 27, 2015 9:12 pm

sortie wrote:
max wrote:the scheduling idea sounds good, but for my taste this is too much logic for a scheduler.
I quite disagree. It's very simple. Each thread has a pointer to a thread it would prefer run in its place. When a thread blocks on something only another thread can do, that gets set. The scheduler just needs to check that variable and keep track of what is the true thread and what is the active thread. Finally, at key locations (like mutex unlock) it should return to the true thread to see if it can progress now. It's really not that much code and I think it's actually a great scheduler primitive. I expect it to be useful for thread that interactively deal a lot with other threads. But yes - this is just an example of an optimization I did, after profiling roundtrips to the filesystem server was a bottleneck.

I have to agree with sortie. I asked a similar question some time ago. It lead me to code a function that when a process is waiting on a mutex (even if the lock holder was a lower priority), it would donate its quantum to the process holding the lock. In my 32-bit kernel, this was not implemented as a scheduling task; it was a yield function that has a specific process as its target (the holder of the mutex lock).

My thought process was simple: if a higher priority needed to do something but was waiting for a lock to be released by a lower priority task, it would donate its quantum to the process holding the lock in an attempt to get to the task at hand as soon as it could -- it certainly would not get there spinning and waiting for a lower priority task (which might never get any CPU time) to release a lock. I figure the same thinking is sound with a spinlock.

Here is the code from my 32-bit kernel -- I hope it helps illustrate:

Code: Select all

void DonateQuantum(Process *switchTo, ProcStatus sts, Mutex *mtx)
{
        uint32 flags = DisableInterrupts();
        Process *switchFrom = currentProcess;
        ListDelInit(GetProcStsQueue(switchTo));

        if (currentProcess) {
                SetProcStatus(currentProcess, sts);
                ListAddTail(GetProcStsQueue(currentProcess), &(mtx->waitList));
        }
        SetProcQuantumLeft(switchTo, GetProcQuantumLeft(currentProcess));
        SetProcStatus(switchTo, PROC_RUN);
        currentProcess = switchTo;

        ExecTaskSwitch(switchFrom, switchTo);
        RestoreInterrupts(flags);
}

Its call from LockMutex() looks like this:

Code: Select all

DonateQuantum(mtx->lockHolder, PROC_MTXW, mtx);

Rusky · Post by **Rusky** » Tue Jan 27, 2015 11:43 pm

gerryg400 wrote:That sound like a priority inversion problem that should be solved one of the usual ways that priority inversion is solved (e.g. priority inheritance). A message passing system can easily handle that scenario without any help from the scheduler by simply elevating the priority of a receiver of a messages to that of the highest sender/message that's on its queue.

It is priority inheritance - just generalized so a message send isn't the only way to bestow a process's priority on another. Putting it in the scheduler is a slight complication but also a performance win because you don't wake up the blocked application just to have it donate its time slice.

gerryg400 wrote:The correct way to address that issue (and it solves other problems) is to make the page/disk cache part of the VFS. So the VFS is a tree of all open (and recently used) files with cached pages of those files. The filesystem server, and disk driver are only messaged on a cache miss. When a cache miss occurs the extra message passes are not noticed because of the time taken for actual disk IO.

Sure, if you still want the VFS as a separate server, with a minimum of one cross-domain call per FS operation. Moving the VFS into a library is another perfectly valid organization that gets rid of a lot of those message sends, keeping the nice property of only messaging FS/disk driver on cache misses. It's actually been implemented in several research kernels at MIT, although in the context of exokernels so they had the file systems in the kernel rather than servers.

Microkernels are great for isolating failure, but you have to think about where you want to isolate the failure to. The VFS doesn't really have anything that needs to be isolated from applications (unless you want to make that tradeoff like you could with any other normal shared library), but if your VFS server goes down, it's going to impact a lot more applications than if a VFS library takes down its one host application.

OSDev.org

VFS in userspace & how to improve it

VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it