VFS in userspace & how to improve it

gerryg400 · Post by **gerryg400** » Wed Jan 28, 2015 12:45 am

Rusky wrote:
gerryg400 wrote:That sound like a priority inversion problem that should be solved one of the usual ways that priority inversion is solved (e.g. priority inheritance). A message passing system can easily handle that scenario without any help from the scheduler by simply elevating the priority of a receiver of a messages to that of the highest sender/message that's on its queue.
It is priority inheritance - just generalized so a message send isn't the only way to bestow a process's priority on another. Putting it in the scheduler is a slight complication but also a performance win because you don't wake up the blocked application just to have it donate its time slice.

I disagree and so do others. Did you read the paper that willedwards linked. I don't actually understand your statement about not wanting to wake up the blocked app. Isn't that precisely what we want to do?

Rusky wrote:
gerryg400 wrote:The correct way to address that issue (and it solves other problems) is to make the page/disk cache part of the VFS. So the VFS is a tree of all open (and recently used) files with cached pages of those files. The filesystem server, and disk driver are only messaged on a cache miss. When a cache miss occurs the extra message passes are not noticed because of the time taken for actual disk IO.
Sure, if you still want the VFS as a separate server, with a minimum of one cross-domain call per FS operation. Moving the VFS into a library is another perfectly valid organization that gets rid of a lot of those message sends, keeping the nice property of only messaging FS/disk driver on cache misses. It's actually been implemented in several research kernels at MIT, although in the context of exokernels so they had the file systems in the kernel rather than servers.

Microkernels are great for isolating failure, but you have to think about where you want to isolate the failure to. The VFS doesn't really have anything that needs to be isolated from applications (unless you want to make that tradeoff like you could with any other normal shared library), but if your VFS server goes down, it's going to impact a lot more applications than if a VFS library takes down its one host application.

MIT exokernels are irrelevant. What microkernels are great for or not great for is irrelevant. Isolation is irrelevant. The VFS going down is irrelevant.

Having the VFS in a shared library might be worth considering but how would files be shared between processes. All files are potentially shared resources and a VFS process allows the cache buffers to be shared. I'm not sure how that's possible with a library.

Kevin · Post by **Kevin** » Wed Jan 28, 2015 3:53 am

gerryg400 wrote:When a cache miss occurs the extra message passes are not noticed because of the time taken for actual disk IO.

You wish. You might get "good enough" performance with this in a simple case where you're reading a small text file and nothing else is happenning in parallel. That is, if your messaging is efficient, you can lose a lot there. But if another process needs the CPU in the background, the additional overhead will be noticed anyway. Or even without that, with SSDs even for the simple case "disks are slow, we can do any stupidity without it being noticed" is becoming less true.

eryjus wrote:My thought process was simple: if a higher priority needed to do something but was waiting for a lock to be released by a lower priority task, it would donate its quantum to the process holding the lock in an attempt to get to the task at hand as soon as it could -- it certainly would not get there spinning and waiting for a lower priority task (which might never get any CPU time) to release a lock. I figure the same thinking is sound with a spinlock.

Nope. it's not. A spinlock that yields is not a spinlock any more, by definition.

gerryg400 wrote:Having the VFS in a shared library might be worth considering but how would files be shared between processes. All files are potentially shared resources and a VFS process allows the cache buffers to be shared. I'm not sure how that's possible with a library.

Shared memory, I guess. As long as you only map data pages that the application has access to, this should work out.

However, I'm not sure how to control the cache size then. When memory is running out, you probably want to throw some cached data away rather than producing an OOM error.

gerryg400 · Post by **gerryg400** » Wed Jan 28, 2015 5:47 am

Kevin wrote:
gerryg400 wrote:When a cache miss occurs the extra message passes are not noticed because of the time taken for actual disk IO.
You wish. You might get "good enough" performance with this in a simple case where you're reading a small text file and nothing else is happenning in parallel. That is, if your messaging is efficient, you can lose a lot there. But if another process needs the CPU in the background, the additional overhead will be noticed anyway. Or even without that, with SSDs even for the simple case "disks are slow, we can do any stupidity without it being noticed" is becoming less true.

As is always the case if its good enough you stop. If it's not, you keep going. The next optimisation I imagine would be to implement shared memory between the VFS page cache and driver. That, combined with some read-ahead should provide another biggish boost. As is always the case, profiling a working system helps. Right now this is as far as I intend to go in my OS.

BTW, this arrangement with the VFS controlling the page cache makes it almost trivial to implement mmaped files. The kernel can convert page faults in a process to messages to the VFS. The vfs can load the page (with some readahead) and signal the kernel to restart the faulted process

Kevin wrote:
gerryg400 wrote:Having the VFS in a shared library might be worth considering but how would files be shared between processes. All files are potentially shared resources and a VFS process allows the cache buffers to be shared. I'm not sure how that's possible with a library.
Shared memory, I guess. As long as you only map data pages that the application has access to, this should work out.

However, I'm not sure how to control the cache size then. When memory is running out, you probably want to throw some cached data away rather than producing an OOM error.

Me either. It's difficult enough to see how to share and force reclaim the pages between the memory manager and VFS without the problem of individual processes also needing page caches. Also shared memory still needs synchronisation so I'm not sure how we are reducing the number of messages.

Rusky · Post by **Rusky** » Wed Jan 28, 2015 10:54 am

gerryg400 wrote:I don't actually understand your statement about not wanting to wake up the blocked app. Isn't that precisely what we want to do?

There are two ways to do priority inheritance that have been mentioned here. One way is to let the applications do it- when a high priority app is blocked on lower priority one, explicitly donate its timeslice. This is how you'd have to do it with the messaging-centric timeslice donation system several users wanted to stick with because they thought that sortie's expansion on top of it was too complicated. The other way is what sortie described- let the scheduler know which processes are blocked on which others, and have it directly schedule the blocked-on processes.

gerryg400 wrote:Having the VFS in a shared library might be worth considering but how would files be shared between processes. All files are potentially shared resources and a VFS process allows the cache buffers to be shared. I'm not sure how that's possible with a library.

The things you labeled as irrelevant are precisely the reasons that make a shared-library VFS worth considering, but in any case file sharing is managed by the file system servers, or whatever manages permissions. They just let applications map in pages of the files they're allowed access to, while the VFS library figures out what pages to request.

gerryg400 wrote:BTW, this arrangement with the VFS controlling the page cache makes it almost trivial to implement mmaped files. The kernel can convert page faults in a process to messages to the VFS. The vfs can load the page (with some readahead) and signal the kernel to restart the faulted process

That is just as doable with the VFS as a shared library, by signalling the process itself on page fault. The VFS library approach, however, is more flexible because applications can specialize the page fault handler for their own workloads, as well as implementing their own algorithms for prefetching. This can give you some pretty impressive performance improvements, when they're needed.

gerryg400 wrote:
Kevin wrote:However, I'm not sure how to control the cache size then. When memory is running out, you probably want to throw some cached data away rather than producing an OOM error.
Me either. It's difficult enough to see how to share and force reclaim the pages between the memory manager and VFS without the problem of individual processes also needing page caches. Also shared memory still needs synchronisation so I'm not sure how we are reducing the number of messages.

The server that owns the disk cache and handles application requests (file system or disk driver or something else, doesn't matter here) can revoke the mappings when it needs to reclaim space. Later access by the application process will simply page fault again.

The MIT exokernel designs really are a good place to look for how to implement this sort of system. Before dismissing it because of a problem you can't solve off the top of your head you may want to look at how they did it.

gerryg400 · Post by **gerryg400** » Wed Jan 28, 2015 4:31 pm

Rusky wrote:There are two ways to do priority inheritance that have been mentioned here. One way is to let the applications do it- when a high priority app is blocked on lower priority one, explicitly donate its timeslice. This is how you'd have to do it with the messaging-centric timeslice donation system several users wanted to stick with because they thought that sortie's expansion on top of it was too complicated. The other way is what sortie described- let the scheduler know which processes are blocked on which others, and have it directly schedule the blocked-on processes.

I read this and realised that you are not really paying attention.

I actually said

gerryg400 wrote:When a message is being sent the sender is blocked and the receiver woken up inside the msg_send kernel call. In fact if the msg is small (as it often is) it's possible to switch the the receiver's memory context right there and do the msg copy. The receiver can be boosted to the sender's priority and take over its timeslice. The scheduler need not be involved because the correct thread is already running.

The application does nothing. The kernel message passing code (and elsewhere e.g. futex, signal and interrupt code) does the work automatically. No-one suggested the application do it.

Sortie never mentioned priority inheritance in his posts. It's not clear whether he uses that to solve priority inversion. I guessed that the processes he was speaking about all had the same priority (since they all seemed to run in round-robin fashion until the correct one was found).

I believe it was you who suggested Sortie's method was "a slight complication". I am of the view that, as described, it's too simple.

It possibly doesn't handle the case where the target is a multithreaded server and the target thread cannot be uniquely identified. This is quite a common case when dealing with a multithreaded server like a VFS and choosing which thread or threads to schedule next requires intimate knowledge of the state of the message queue at the time the call is made. If you recall in one of my earlier posts I mentioned distributing the scheduler around the kernel. By that I meant that scheduling decisions are made on the spot in the message passing code because right then and there it's possible to determine exactly what to do.

It also possibly doesn't handle the case where the server has a waiting thread as efficiently as it might. It should be possible to deliver the message and run the receiver with no scheduling in many cases.

Sortix is a monolithic kernel. Clearly it has some advanced features and this is a good example. There is absolutely nothing wrong with what sortie has done. I will say though that in a microkernel this would not really be considered an advanced feature but just something that must be done to achieve an acceptable level of performance.

Rusky · Post by **Rusky** » Wed Jan 28, 2015 10:51 pm

Okay, you got me. max never said sortie's idea sounded like too much logic for a scheduler, Kevin, you, and willedwards didn't all talk about explicitly donating timeslices as part of passing messages, and you understood perfectly what I was talking about when I said you could also implement priority inheritance with sortie's more general method of yielding a timeslice when a thread is blocked on another rather than only when sending a message.

Sarcasm aside, it seems like we agree that priority inheritance is a useful kernel feature that is worth its "extra" logic.

OSDev.org

VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it

Re: VFS in userspace & how to improve it