Questions about userspace FS interaction

8infy · Post by **8infy** » Sun Mar 21, 2021 5:27 am

Hi, I recently finished implementing an ahci driver and now starting to work on the VFS and FAT32 as the initial filesystem, and I have a few questions:

1. I already made it so that if a thread is blocked because of a disk read/write request if anyone tries to kill it, it's deferred until that thread gets unblocked.
I now realize that it might not be enough because if a thread tries to e.g write some file I might have to fetch some other unrelated parts of the filesystem
like the file allocation table for FAT32, so essentially one userspace request gets potentially broken down into multiple read/write requests, and during those
I would expect the thread to be invulnerable so that it can complete the requested transaction atomically. So essentially my question is how do other kernels
handle the case where someone tries to kill a thread that's currently executing an important syscall like writing a file that can also yield/get blocked multiple
times during the request? Since all of that is asynchronous the kernel/scheduler must somehow recognize that although the thread is technically dead it's still
kinda inside a "critical section" and must still be scheduled until it's out of the "critical section".

2. I've also been thinking about how to go about implementing the disk cache. So far i'm leaning towards implementing it inside the filesystem, but i'm not completely sure.
Maybe it should be cached on multiple layers, both disk and each filesystem? What do you think is the best way to go about this?

Would really appreciate any information I could get on this one, thanks

Korona · Post by **Korona** » Sun Mar 21, 2021 5:53 am

You simply do not immediately kill a thread while it is in kernel space. That can wreck all kinds of havoc, not only file system inconsistencies. For example, the thread might hold a lock that will never get released.

Instead, you can wake up the thread, and make the blocking operation return the kernel-equivalent of EINTR. This error can then be propagated up the call stack until you reach the syscall handler again, where you kill the thread.

8infy · Post by **8infy** » Sun Mar 21, 2021 5:59 am

Korona wrote:You simply do not immediately kill a thread while it is in kernel space. That can wreck all kinds of havoc, not only file system inconsistencies. For example, the thread might hold a lock that will never get released.

Instead, you can wake up the thread, and make the blocking operation return the kernel-equivalent of EINTR. This error can then be propagated up the call stack until you reach the syscall handler again, where you kill the thread.

I do understand that, I only kill a thread when it yields execution with state set to dead, holding a kernel lock also disables interrupts so it would never yield.
But when reading a disk it would yield multiple times while the request is being processed, so you have to somehow keep track of that, right?

Korona · Post by **Korona** » Sun Mar 21, 2021 6:06 am

Then you can simply add another way to disable (and re-enable) killing the thread.

I'd suggest to do it the other way around though: have killing disabled entirely in the kernel, and only kill at specific points (e.g., via a possible_thread_kill() function or whatever).

8infy · Post by **8infy** » Sun Mar 21, 2021 6:15 am

Korona wrote:Then you can simply add another way to disable (and re-enable) killing the thread.

I'd suggest to do it the other way around though: have killing disabled entirely in the kernel, and only kill at specific points (e.g., via a possible_thread_kill() function or whatever).

Could u elaborate a bit more on what you mean by killing at specific points?

thewrongchristian · Post by **thewrongchristian** » Sun Mar 21, 2021 6:16 am

8infy wrote:Hi, I recently finished implementing an ahci driver and now starting to work on the VFS and FAT32 as the initial filesystem, and I have a few questions:

1. I already made it so that if a thread is blocked because of a disk read/write request if anyone tries to kill it, it's deferred until that thread gets unblocked.
I now realize that it might not be enough because if a thread tries to e.g write some file I might have to fetch some other unrelated parts of the filesystem
like the file allocation table for FAT32, so essentially one userspace request gets potentially broken down into multiple read/write requests, and during those
I would expect the thread to be invulnerable so that it can complete the requested transaction atomically. So essentially my question is how do other kernels
handle the case where someone tries to kill a thread that's currently executing an important syscall like writing a file that can also yield/get blocked multiple
times during the request? Since all of that is asynchronous the kernel/scheduler must somehow recognize that although the thread is technically dead it's still
kinda inside a "critical section" and must still be scheduled until it's out of the "critical section".

If by kill, do you mean the equivalent of sending a SIGINT or SIGKILL under UNIX?

The standard method of handling that in a UNIX like system would be to check for pending signals just before returning to user mode, along with having multiple sleep states for sleeping processes.

When sleeping waiting for something, a process is considered either interruptible, or uninterruptible.

An interruptible sleep process can be woken from its sleep by a signal, which will be detected and whatever operation the process doing would be short circuited. An example of an interruptible process might be a process reading from a network socket. This is considered a 'slow' operation, as we never know when the data will be available, so the sleep waiting for data is an interruptible sleep. How this interruption is handled can become complex, as you might be part way through some operation, so you have to ensure either you can undo partial operations, or have all the resources you need without further sleeps before starting to change state. This is a similar problem to exception safety in languages like C++.

An uninterruptible sleeping process cannot be woken from its sleep by a signal. It will finish waiting for the resource it is sleeping on. An example of an uninterruptible sleep might be waiting for a disk read I/O to complete (such as in your example above), which is considered a 'fast' operation as we know the I/O device will complete or error within a bounded time. So in your case, the operation will do whatever it does in the filesystem code to completion, then once done, it will check for pending signals before returning from the write system call, and act appropriately.

8infy wrote: 2. I've also been thinking about how to go about implementing the disk cache. So far i'm leaning towards implementing it inside the filesystem, but i'm not completely sure.
Maybe it should be cached on multiple layers, both disk and each filesystem? What do you think is the best way to go about this?

Would really appreciate any information I could get on this one, thanks

In order to provide coherence with demand paged mmap data, almost any modern OS will cache file data at the page level. So in a UNIX like system, with files managed using vnodes in a VFS, pages will be cached using the vnode/offset as the key. Then, both the read/write system calls and the virtual memory subsystem, will reference data using the vnode/offset key, and see the same data, thus ensuring coherence between data mapped into address spaces and data read/written using file handles.

This wasn't always the case, with caching in early UNIX being at the device block level, with early MMAP based systems essentially duplicating the data cached at the device block level in user page mappings, the result being that data written using write system calls would end up in the device block buffer cache, but not in paged memory mappings of the same file.

So, in answer to you question, data caching is best handled in the VFS layer, where it can be managed on a per-vnode basis, using the vnode/offset as a key.

For filesystem meta-data, you also need a buffer mechanism that operates under this file cache layer. It can also, if you prefer, use the same vnode/offset cache as the data cache, which might be useful to avoid duplicating code, but you have to be careful about double mapping file data, so that it doesn't get cached at both the file vnode level and the device vnode level. Block devices also don't always use page sized blocks of data, so for that reason as well it might be worth having a different device block buffer interface distinct from the page cache interface.

8infy · Post by **8infy** » Sun Mar 21, 2021 6:27 am

thewrongchristian wrote:
8infy wrote:Hi, I recently finished implementing an ahci driver and now starting to work on the VFS and FAT32 as the initial filesystem, and I have a few questions:

1. I already made it so that if a thread is blocked because of a disk read/write request if anyone tries to kill it, it's deferred until that thread gets unblocked.
I now realize that it might not be enough because if a thread tries to e.g write some file I might have to fetch some other unrelated parts of the filesystem
like the file allocation table for FAT32, so essentially one userspace request gets potentially broken down into multiple read/write requests, and during those
I would expect the thread to be invulnerable so that it can complete the requested transaction atomically. So essentially my question is how do other kernels
handle the case where someone tries to kill a thread that's currently executing an important syscall like writing a file that can also yield/get blocked multiple
times during the request? Since all of that is asynchronous the kernel/scheduler must somehow recognize that although the thread is technically dead it's still
kinda inside a "critical section" and must still be scheduled until it's out of the "critical section".
If by kill, do you mean the equivalent of sending a SIGINT or SIGKILL under UNIX?

The standard method of handling that in a UNIX like system would be to check for pending signals just before returning to user mode, along with having multiple sleep states for sleeping processes.

When sleeping waiting for something, a process is considered either interruptible, or uninterruptible.

An interruptible sleep process can be woken from its sleep by a signal, which will be detected and whatever operation the process doing would be short circuited. An example of an interruptible process might be a process reading from a network socket. This is considered a 'slow' operation, as we never know when the data will be available, so the sleep waiting for data is an interruptible sleep. How this interruption is handled can become complex, as you might be part way through some operation, so you have to ensure either you can undo partial operations, or have all the resources you need without further sleeps before starting to change state. This is a similar problem to exception safety in languages like C++.

An uninterruptible sleeping process cannot be woken from its sleep by a signal. It will finish waiting for the resource it is sleeping on. An example of an uninterruptible sleep might be waiting for a disk read I/O to complete (such as in your example above), which is considered a 'fast' operation as we know the I/O device will complete or error within a bounded time. So in your case, the operation will do whatever it does in the filesystem code to completion, then once done, it will check for pending signals before returning from the write system call, and act appropriately.

I think I understand what you mean, so if it's an interruptible sleep the scheduler would interrupt it on a signal, and if it's not the thread would check if it got an e.g SIGKILL while it was in an uninterruptible state and kill itself on return?

thewrongchristian · Post by **thewrongchristian** » Fri Mar 26, 2021 6:55 am

8infy wrote:
thewrongchristian wrote:
8infy wrote:Hi, I recently finished implementing an ahci driver and now starting to work on the VFS and FAT32 as the initial filesystem, and I have a few questions:

1. I already made it so that if a thread is blocked because of a disk read/write request if anyone tries to kill it, it's deferred until that thread gets unblocked.
I now realize that it might not be enough because if a thread tries to e.g write some file I might have to fetch some other unrelated parts of the filesystem
like the file allocation table for FAT32, so essentially one userspace request gets potentially broken down into multiple read/write requests, and during those
I would expect the thread to be invulnerable so that it can complete the requested transaction atomically. So essentially my question is how do other kernels
handle the case where someone tries to kill a thread that's currently executing an important syscall like writing a file that can also yield/get blocked multiple
times during the request? Since all of that is asynchronous the kernel/scheduler must somehow recognize that although the thread is technically dead it's still
kinda inside a "critical section" and must still be scheduled until it's out of the "critical section".
If by kill, do you mean the equivalent of sending a SIGINT or SIGKILL under UNIX?

The standard method of handling that in a UNIX like system would be to check for pending signals just before returning to user mode, along with having multiple sleep states for sleeping processes.

When sleeping waiting for something, a process is considered either interruptible, or uninterruptible.

An interruptible sleep process can be woken from its sleep by a signal, which will be detected and whatever operation the process doing would be short circuited. An example of an interruptible process might be a process reading from a network socket. This is considered a 'slow' operation, as we never know when the data will be available, so the sleep waiting for data is an interruptible sleep. How this interruption is handled can become complex, as you might be part way through some operation, so you have to ensure either you can undo partial operations, or have all the resources you need without further sleeps before starting to change state. This is a similar problem to exception safety in languages like C++.

An uninterruptible sleeping process cannot be woken from its sleep by a signal. It will finish waiting for the resource it is sleeping on. An example of an uninterruptible sleep might be waiting for a disk read I/O to complete (such as in your example above), which is considered a 'fast' operation as we know the I/O device will complete or error within a bounded time. So in your case, the operation will do whatever it does in the filesystem code to completion, then once done, it will check for pending signals before returning from the write system call, and act appropriately.
I think I understand what you mean, so if it's an interruptible sleep the scheduler would interrupt it on a signal, and if it's not the thread would check if it got an e.g SIGKILL while it was in an uninterruptible state and kill itself on return?

Exactly.

Interruptible system calls, like read(), will return EINTR if they're interrupted using the above mechanism (if the signal doesn't kill the process.)

On some systems, such as BSD, such system calls are restarted by default, so the calling code may never see the EINTR, and the system call would be called again after the signal has been delivered and handled in the calling process.

SysV, on the other hand, doesn't restart system calls by default, and returns EINTR.

In both, the behavior can be specified using the POSIX sigaction() and the SA_RESTART flag when specifying how to handle a signal.

POSIX also defines how partial operations are handled as well. Since POSIX.1-2001, if an interrupted read() had partially read something into the user provided buffer, then a short read is returned instead of EINTR. Prior to POSIX.1-2001, this behavior was left unspecified, and SysV would return EINTR and BSD would return a short read.

rdos · Post by **rdos** » Sat Mar 27, 2021 2:15 pm

thewrongchristian wrote: For filesystem meta-data, you also need a buffer mechanism that operates under this file cache layer. It can also, if you prefer, use the same vnode/offset cache as the data cache, which might be useful to avoid duplicating code, but you have to be careful about double mapping file data, so that it doesn't get cached at both the file vnode level and the device vnode level. Block devices also don't always use page sized blocks of data, so for that reason as well it might be worth having a different device block buffer interface distinct from the page cache interface.

This is a real problem. If you map device data for a file in user mode you are potentially exposing other information from the file system that doesn't belong to the file. The typical sector size is 512, while the page size is 4k, meaning that eigth sectors fit into a page. If you map the disc device on 4k boundaries in the device cache, then there is no garantee that file contents will start a offset 0 in a page. Also, with smaller cluster sizes, file data can be scattered in different clusters that are at arbitrary positions in 4k pages. Meaning that you actually cannot garantee that you can map file data in a continous data area without copying it. So if you want to create a zero-copy implementation, you will need to be able to handle the case where file fragments start & and end at arbitrary 4k page positions, and you get potential security problems by exposing non-file data to user mode. One possible measure to reduce this problem is to map file data as read-only (or copy-on-write), and then it is at least possible to stop user mode from writing to non-file data parts of pages.

rdos · Post by **rdos** » Sat Mar 27, 2021 2:23 pm

8infy wrote:Hi, I recently finished implementing an ahci driver and now starting to work on the VFS and FAT32 as the initial filesystem, and I have a few questions:

This is a bit backwards. You need to define how your disc driver interface (and disc buffering) works before you start to write drivers. AHCI is particularly interesting in this regard since it operates on physical memory addresses and not linear. I'm reimplmenting my disc interface & drivers so they use a physical memory interface rather than a linear.

8infy wrote: 1. I already made it so that if a thread is blocked because of a disk read/write request if anyone tries to kill it, it's deferred until that thread gets unblocked.
I now realize that it might not be enough because if a thread tries to e.g write some file I might have to fetch some other unrelated parts of the filesystem
like the file allocation table for FAT32, so essentially one userspace request gets potentially broken down into multiple read/write requests, and during those
I would expect the thread to be invulnerable so that it can complete the requested transaction atomically. So essentially my question is how do other kernels
handle the case where someone tries to kill a thread that's currently executing an important syscall like writing a file that can also yield/get blocked multiple
times during the request? Since all of that is asynchronous the kernel/scheduler must somehow recognize that although the thread is technically dead it's still
kinda inside a "critical section" and must still be scheduled until it's out of the "critical section".

You cannot kill threads that are in kernel. Actually, I don't support killing threads at all, rather the thread itself must determine that it should terminate.

8infy wrote: 2. I've also been thinking about how to go about implementing the disk cache. So far i'm leaning towards implementing it inside the filesystem, but i'm not completely sure.
Maybe it should be cached on multiple layers, both disk and each filesystem? What do you think is the best way to go about this?

I implement the disk cache as a separate module.

OSDev.org

Questions about userspace FS interaction

Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction

Re: Questions about userspace FS interaction