How to handle read & write failures in a file system

rdos · Post by **rdos** » Sun Aug 08, 2021 6:29 am

Korona wrote:I don't get the "it slows down fs code" argument. It's just checking for a return value and propagating it, right? That's negligible compared to the cost of the syscall anyway. Plus, you probably need to check for other errors in the fs code anyway (e.g., out of disk space).

The file read operation adds a large set of sectors to the disc system, and when all are done translates them to physical addresses before returning them mapped to the application. If one of the sectors is bad, the whole operation should not fail. Thus, when translating the sectors to physical addresses the code would also need to check if the particular sector was read correctly. I don't want to pollute the cache with this information.

A better idea probably is to keep a list of known bad sectors and compare every sector to this. If the list is empty (usual case), no test will be done.

thewrongchristian · Post by **thewrongchristian** » Sun Aug 08, 2021 10:34 am

rdos wrote:
Korona wrote:I don't get the "it slows down fs code" argument. It's just checking for a return value and propagating it, right? That's negligible compared to the cost of the syscall anyway. Plus, you probably need to check for other errors in the fs code anyway (e.g., out of disk space).
The file read operation adds a large set of sectors to the disc system, and when all are done translates them to physical addresses before returning them mapped to the application. If one of the sectors is bad, the whole operation should not fail.

Well that's where you and everyone else here who has responded differs.

If one of the sectors is bad, it doesn't matter if it's just a subset of the data that's bad, the read operation has failed to read the data required, and your application has to be told about it and act accordingly.

Now, it might be that your application can recover, binary chop and retry subsets of the read until it's isolated the bad data, and may be able to continue with the good data it has read. But it is unlikely to be many applications that can work this way. Most will require all the data read to be intact.

If your application has to tolerant of failure, then that is where redundancy comes in, and redundant data can come in the form of RAID, preferably with end to end checksums to detect bad data, like ZFS.

With RAID, a bad sector can be spotted, and a copy of the data used instead and returned to your application. Now, that sector read error can be handled in some sort of event management interface as you advocate while the application continues on as normal, but the difference here is that the OS HAS read valid data and passed that to the application, and hidden the error because it can.

But that is a long way from what you're suggesting, which is to ignore the error and pass incorrect data to the application even if no redundant copy of the data is available. That is bad news.

Ethin · Post by **Ethin** » Sun Aug 08, 2021 1:36 pm

Yeah, this whole error handling argument doesn't make sense to me. Conditional checks that supposedly "pollute" the cache are minimal compared to syscalls or TLB shootdowns (hell, a cache line flush on a single processor is more expensive). Considering the fact that you've got speculative execution and branch prediction to speed up your code even further, each conditional check would most likely only cost you a cycle or two -- you'd be at that point bit-comparing registers anyway, which is intrinsically fast. If that was slow, mainstream OSes wouldn't do it. Even OpenBSD/FreeBSD handles errors the same way -- if a bad sector is read the entire operation fails. And OpenBSD is one of those OSes that wouldn't hesitate to change something out from under you when it suits them, so I'd think that if that was truly extremely costly as you say, they would be the first to change it. Its better to "fail fast" instead of delivering invalid data to an application and requiring it to go through hoops to figure out why it got invalid data.

rdos · Post by **rdos** » Sun Aug 08, 2021 2:12 pm

A very good case when this "a bad sector kills everything" is super-stupid are logs. If a sector is wrong in a log-file, you certainly don't want the whole log to become unreadable. Rather, you want to ignore the error and continue to read after it, and you also want to continue to add new data to the log. For text files, a sector with zeros typically will add a null-terminator. For binary files, you will typically have checksums or the layout is block-oriented. Here too, you don't want to lose the whole content just because a single sector is bad. The same with executable files. You don't want to get an error when loading it just because a single sector is bad. The bad sector might not be used. The same is relevant for imported functions that are missing. They might not be used.

rdos · Post by **rdos** » Sun Aug 08, 2021 2:22 pm

Ethin wrote:Even OpenBSD/FreeBSD handles errors the same way -- if a bad sector is read the entire operation fails. And OpenBSD is one of those OSes that wouldn't hesitate to change something out from under you when it suits them, so I'd think that if that was truly extremely costly as you say, they would be the first to change it. Its better to "fail fast" instead of delivering invalid data to an application and requiring it to go through hoops to figure out why it got invalid data.

I find the Unix file API horrible. A horrible legacy of the 70s that should long have been replaced with something better.

Octocontrabass · Post by **Octocontrabass** » Sun Aug 08, 2021 3:51 pm

No one is arguing that you shouldn't have a way to continue despite disk errors.

What we're saying is that disk errors must not be silent. If part of a file is unreadable, it should be up to the application to decide whether it can safely continue or fail the entire operation.

Ethin · Post by **Ethin** » Sun Aug 08, 2021 4:45 pm

I think the "loading an executable" analogy is a bad one. It doesn't hold up because the program loader *must* fail when a bad sector is read. You should *never* load an executable into RAM and run that code if a disk error occurs and a sector can't be read but is required for a program to be fully in RAM. The program loader can't guarantee that that sector is unused. If that sector is in fact used (either as code or data) and it doesn't load and is replaced with zeros, or something else, your going to have undefined behavior that could majorly alter the flow of the program. In that instance, failing fast is preferable. Otherwise you risk (deliberately) introducing a security risk all because a sector was bad. If the sector had instructions in it and you replace it with zeros (which appears to be the ADD instruction) then you may or may not get a no-op or page fault. But if you fault, what do you do? If the sector is truly bad then you might not be able to actually read that sector of instructions. But if your able to read it, you then need to poke at the memory of the program to insert the new instruction stream and then continue execution there -- and that's a very nasty hack. If you replace it with garbage, then the result is truly unpredictable and for all you know you might have a jump instruction that jumps into kernel code.
If the sector is data, on the other hand, you have no idea what you've just altered. It could be a string or a numeric constant. It could be a memory address. And so on. But I think I've illustrated just why you *shouldn't* do this.

rdos · Post by **rdos** » Mon Aug 09, 2021 2:47 am

Ethin wrote: If the sector had instructions in it and you replace it with zeros (which appears to be the ADD instruction) then you may or may not get a no-op or page fault. But if you fault, what do you do?

All faults in the application will first verify zip files with the central storage and then will verify that all unzipped files match those in the zip file. If a file doesn't match the zip, then it is recreated. This process is looped until all files match, and then the application is restarted. However, if the application just exists, this is interpreted as a temporary error and so no verification is done, and the system goes into a reboot loop.

However, I think that for executables, it's better to set bad sectors to 0xCC, which is int 3, and which will always create a fault (unless a debugger is attached). When used as the address 0xCCCCCCCC it will fault too.

rdos · Post by **rdos** » Mon Aug 09, 2021 3:06 am

Octocontrabass wrote:No one is arguing that you shouldn't have a way to continue despite disk errors.

What we're saying is that disk errors must not be silent. If part of a file is unreadable, it should be up to the application to decide whether it can safely continue or fail the entire operation.

I think that applications must always be prepared for the scenario that files do not contain what the application thinks they should contain. One reason might be that an improper file name is passed, with completely different contents from what the application expects. Another reason for corrupt files might be invalid cluster chains in FAT, and the operating system cannot realistically inform the application of this since it will only discover the problem by scanning the file system. So, what I'm suggesting is that since the application must be tolerant of non-expected file contents without being informed with errors from file-IO anyway, we might just as well signal bad sectors by loading predefined patterns in the file (like zeros).

I also think that the model that some people use for file-IO is very outdated. It might have been that in the 70s that a file read actually did read stuff from the disc directly, and thus could easily give back error codes. I suspect even MS-DOS might have used this method. However, I'm sure no modern OS does it this way because of the poor performance. A typical modern OS caches disc contents in RAM, and also tries to read as many sectors as possible in each request. This means that the file operation doesn't directly correspond to disc activity, and so the physical disc driver cannot directly send back error codes to the application. The sectors could also be cached, in which case they will not be reread. I'm also pretty sure that all modern OSes also cache file contents. For writes, the OS must make sure that many small writes to a flash disc does not result in many sector writes to the disc, and so it must handle write operation in a lazy way only buffering them. This means that file-write will not get meaningful error feedback.

In the physical disc driver, a failed multisector read (or write) must be redone as many single sector operations in order to figure out exactly which sector is bad. Since I only cache 4k blocks, and typical sectors are 512 bytes, it means that in the disc cache I will need 8 new int fields for the error codes. Today, I only have a single 64-bit int for the physical address of the 4k block. Saving error codes per sector therefore will waste lots of memory for stuff that is rarely used.

thewrongchristian · Post by **thewrongchristian** » Mon Aug 09, 2021 4:25 am

rdos wrote:
Ethin wrote: If the sector had instructions in it and you replace it with zeros (which appears to be the ADD instruction) then you may or may not get a no-op or page fault. But if you fault, what do you do?
All faults in the application will first verify zip files with the central storage and then will verify that all unzipped files match those in the zip file. If a file doesn't match the zip, then it is recreated. This process is looped until all files match, and then the application is restarted. However, if the application just exists, this is interpreted as a temporary error and so no verification is done, and the system goes into a reboot loop.

However, I think that for executables, it's better to set bad sectors to 0xCC, which is int 3, and which will always create a fault (unless a debugger is attached). When used as the address 0xCCCCCCCC it will fault too.

And you're worried about the performance impact of checking individual sector errors?

Who is doing this checking? If it's at user level, then all well and good. This sounds very much like policy that needs to be kept out of the kernel.

But, what happens if, because you don't propagate IO errors through read, your zip files on central storage are corrupt due to a bad sector, you'll unzip garbage (which will probably not work anyway)?

Part of the user/OS system call contract needs to be a reliable way of transferring data, and knowing that the data you read is intact. If you can't rely on that without jumping through extra hoops in user land, then you've just made extra work for yourself.

The POSIX contract for read is "I successfully read what you told me, or I tried my best but it failed."

Your OS contract for read is basically "I tried to read what you told me, but it may or may not have read successfully, so I may have left garbage for some or all of the data, and you might want to verify all the data just read to see if it makes sense to your application."

rdos · Post by **rdos** » Mon Aug 09, 2021 5:56 am

thewrongchristian wrote: And you're worried about the performance impact of checking individual sector errors?

The biggest impact actually is that the disc cache will consume several times more memory with error codes per sector. Something that directly affects performance due to less sectors buffered.

thewrongchristian wrote: Who is doing this checking? If it's at user level, then all well and good. This sounds very much like policy that needs to be kept out of the kernel.

When the system starts, it will start a loader. The loader has control over faults and the software that runs. However, it needs useful input from the application if it should do reasonable actions, and the application just exiting is not useful.

thewrongchristian wrote: But, what happens if, because you don't propagate IO errors through read, your zip files on central storage are corrupt due to a bad sector, you'll unzip garbage (which will probably not work anyway)?

The loader fetches the zip files with TCP/IP. It will also get an MD5 sum so it can verify zip files. The loader will make sure that it has rebooted the machine before it compares files, and if something is recreated. it will check if things are right after a reboot so it knows that the contents are correct on the disc, and not just in the disc cache.

thewrongchristian wrote: Part of the user/OS system call contract needs to be a reliable way of transferring data, and knowing that the data you read is intact. If you can't rely on that without jumping through extra hoops in user land, then you've just made extra work for yourself.

MD5 signatures will do that.

thewrongchristian wrote: The POSIX contract for read is "I successfully read what you told me, or I tried my best but it failed."

Your OS contract for read is basically "I tried to read what you told me, but it may or may not have read successfully, so I may have left garbage for some or all of the data, and you might want to verify all the data just read to see if it makes sense to your application."

With POSIX contract, you just get a failure that a file cannot be read. This doesn't give you the correct contents, and if the app doesn't check error codes properly, then it might just continue with garbage. The POSIX contract also doesn't tell you if there are related problems in the filesystem, like crosslinked clusters, one FAT is unreadable or that the cluster chain is invalid. You need an event function for this. Also, if you only rely on POSIX, then all the errors are dispersed within many parts of the application, and so a supervisor cannot get notifications that something is wrong, and when making it's best to correct the error, will typically start the application again. With the same result.

It's a bit like USB errors. You cannot let those propagate to USB functions because then the general picture of USB errors will be lost as the USB functions just will do their best to correct the errors, and then will discard them. Thus, you will be lost about if your USB hardware operates properly or not. It's the same with disc devices. They should provide a generic event interface so a supervisor can get a good picture of the condition of your disc drives.

So, in reality, my situation is that the read will complete if the file size allows, but the supervisor (loader) will be notified there is a disc problem, and the VFS might try to fix it by modifying the cluster chain and marking the bad cluster as bad.

Octocontrabass · Post by **Octocontrabass** » Mon Aug 09, 2021 9:18 am

rdos wrote:Since I only cache 4k blocks, and typical sectors are 512 bytes,

No, typical sectors are bigger than that. Magnetic disks all use 4kB sectors now, and most flash disks use 4kB sectors too.

rdos wrote:Saving error codes per sector therefore will waste lots of memory for stuff that is rarely used.

Then don't save them. Error codes are only useful to the supervisor, all the application needs to know is that something failed.

Ethin · Post by **Ethin** » Mon Aug 09, 2021 9:22 am

Wait wait wait. I'm confused. When did zip files enter into this discussion? I was talking about executable files, not compressed archives. Unless your telling me that:

Your filesystem is just a bunch of zip archives; or
Every executable will be associated with a zip archive (which doesn't make sense, what if I just compiled it?)

rdos · Post by **rdos** » Mon Aug 09, 2021 1:23 pm

Ethin wrote:Wait wait wait. I'm confused. When did zip files enter into this discussion? I was talking about executable files, not compressed archives. Unless your telling me that:

Your filesystem is just a bunch of zip archives; or

Every executable will be associated with a zip archive (which doesn't make sense, what if I just compiled it?)

It's how our embedded system work (payment terminal). The loader also handles automatic upgrades. All of it is part of the application.

rdos · Post by **rdos** » Mon Aug 09, 2021 1:29 pm

Octocontrabass wrote:
rdos wrote:Since I only cache 4k blocks, and typical sectors are 512 bytes,
No, typical sectors are bigger than that. Magnetic disks all use 4kB sectors now, and most flash disks use 4kB sectors too.

True, but if you use FAT, then logical sectors in the FS are still 512. CDs use larger sectors though (2k?).

Octocontrabass wrote:
rdos wrote:Saving error codes per sector therefore will waste lots of memory for stuff that is rarely used.
Then don't save them. Error codes are only useful to the supervisor, all the application needs to know is that something failed.

I just got the idea that I can use a single bit to indicate some faulty content. When the code sees this bit set, it can then search through a bad sector list to resolve which sectors are good and which are bad. Then I can return failure on bad content, but still fill the data buffer with zeros.

OSDev.org

How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system

Re: How to handle read & write failures in a file system