Consistency checking FAT

rdos · Post by **rdos** » Thu May 20, 2021 1:11 pm

How is this typically done? Do FS drivers do this automatically or rely on user intervention?

There seems to be two common problems in FAT:
* Crosslinked clusters
* Orphaned clusters.

My idea is that if I use a bitmap for cluster allocation, I could also use it (initially) to discover the two above problems in the FS. If I follow all cluster chains for files & directories, and set the cluster bit in the bitmap, then I will notice crosslinking by the bit already being set. I can also discover allocated but not used clusters by comparing the bitmap with the next link in the FAT tables, and correct those. Crosslinking is probably best fixed by copying contents so they get separate areas.

There are also three other problems that are correctable:
* Incorrect cluster chain for directories. This can be discovered by finding non-zero entries after zero entries, and truncating the cluster chain to omit those.
* Cluster chains for files that map more than the file size. This will also be fixed by truncating the cluster chain.
* File size that exceeds the cluster chain size. This is fixed by changing file size to the size of the cluster chain.

nexos · Post by **nexos** » Thu May 20, 2021 3:22 pm

rdos wrote:* Cluster chains for files that map more than the file size. This will also be fixed by truncating the cluster chain

I think that the cluster should be assumed to be correct, and the file size should be changed. Else someones file could possibly be truncated. But in all fairness, there is no way to tell which one would be better. That is why we have Ext4, NTFS, HFS+, and so on.

bzt · Post by **bzt** » Thu May 20, 2021 7:23 pm

rdos wrote:How is this typically done? Do FS drivers do this automatically or rely on user intervention?

On Linux you have the "fsck" family of commands. They normally ask for the user what to do, unless you specify "-f" (force fix) on the command line in which case they fix everything they can without intervention. These "fsck" tools are no drivers; they are standard user space applications which know nothing about devices in the kernel, they just receive a path as argument. Since under UNIX everything is a file, that path can point to an image file as well as to a device like "/dev/sda1".

Now as for the FAT variant in particular, "fsck.fat" in the dosfstools package, you might want to take a look at its source to see what it's actually doing. Been developed for nearly 30 years, so it is a very comprehensive tool that can fix almost all problems of FAT file systems that every day life might throw at you.

Cheers,
bzt

Octocontrabass · Post by **Octocontrabass** » Thu May 20, 2021 11:38 pm

rdos wrote:Do FS drivers do this automatically or rely on user intervention?

FAT includes a bit to indicate whether the volume was cleanly unmounted. In some cases, Windows uses this bit to automatically run chkdsk before mounting a FAT filesystem. When chkdsk is started this way, it prompts the user before attempting to fix any inconsistencies.

I'm not sure if that really counts, since chkdsk isn't part of the filesystem driver. As far as I know, there are no filesystem drivers capable of repairing a FAT filesystem. (Compare to xfs, where repairs are usually performed by the filesystem driver, and fsck.xfs does nothing at all.)

rdos · Post by **rdos** » Fri May 21, 2021 1:34 am

nexos wrote:
rdos wrote:* Cluster chains for files that map more than the file size. This will also be fixed by truncating the cluster chain
I think that the cluster should be assumed to be correct, and the file size should be changed. Else someones file could possibly be truncated. But in all fairness, there is no way to tell which one would be better. That is why we have Ext4, NTFS, HFS+, and so on.

Agreed.

I hope to eventually get to port Ext or NTFS, primarily to get support for 64-bit file sizes and more stable filesystems.

rdos · Post by **rdos** » Fri May 21, 2021 2:07 am

bzt wrote:
rdos wrote:How is this typically done? Do FS drivers do this automatically or rely on user intervention?
On Linux you have the "fsck" family of commands. They normally ask for the user what to do, unless you specify "-f" (force fix) on the command line in which case they fix everything they can without intervention. These "fsck" tools are no drivers; they are standard user space applications which know nothing about devices in the kernel, they just receive a path as argument. Since under UNIX everything is a file, that path can point to an image file as well as to a device like "/dev/sda1".

Now as for the FAT variant in particular, "fsck.fat" in the dosfstools package, you might want to take a look at its source to see what it's actually doing. Been developed for nearly 30 years, so it is a very comprehensive tool that can fix almost all problems of FAT file systems that every day life might throw at you.

Cheers,
bzt

Doing file checks manually like this is a non-option. It must be automatic simply because RDOS runs in unattended systems where an operator cannot do this. The most important aspect of the FS driver is that it can still operate in the presence of severe errors since everything else means a dead system and costly service trips.

Actually, it's my opinion that both formatting and automatic error recovery should be part of the FS driver and should happen "behind the scenes". That's one of the short-comings of Fuse, that has neither as part of the concept.

nullplan · Post by **nullplan** » Fri May 21, 2021 9:08 am

rdos wrote:Doing file checks manually like this is a non-option. It must be automatic simply because RDOS runs in unattended systems where an operator cannot do this. The most important aspect of the FS driver is that it can still operate in the presence of severe errors since everything else means a dead system and costly service trips.

While I do agree in principle (even a human operator might simply not know how to proceed, unless they are a file system engineer and actually know what all the terms being thrown at them mean). the problem here is doing it right. A file system driver that responds to errors by automatically corrupting vital system files isn't going to be any good to anyone. At least for some errors, resolution is easy (unlinked cluster found? Mark it free!) But how do you deal with errors that are simply not clear cut? For example for FAT, a cluster ends up being linked into two files. That essentially means that two files now share the same tail. Which of the two gets to keep the tail? Or does neither get it? How would you notice? How would you fix it in any feasible amount of time?

How to detect: If you are actively scanning for the condition, you can test every field in the FAT for being equal to any other field in the FAT (and not being one of the special entries for bad sectors and end-of-chain), but even in the best cases, this is of quadratic time complexity with constant space complexity. Or else you can make it linear time complexity with linear space complexity. So it really does not scale up. Once detected, you might be tempted to say that only at most one of the files involved is going to have a file length compatible with the length of the chain. But then you have the problem that you need to identify the directory entries belonging to these cluster chains. For that you will first need to find the starting cluster of both files involved, which is not only costly, but if the FAT is inconsistent, can easily be infinite (if some clusters end up building a ring in the FAT). And even once you've found the starting clusters, now you have to find the directory entry somewhere in the FAT that contains that cluster as starting cluster value. And you have to do that while knowing the FAT to be inconsistent.

So what do you do here? I have no real answer. The classic UNIX answer has always been to kick the can down the road. The syscall returns an error. The library passes the error to the application. The application writes an error message to the user. The user calls the IT department. The IT department has no-one to kick the can to, so they get stuck with it. Not really satisfying, but what else can we do?

rdos · Post by **rdos** » Fri May 21, 2021 12:40 pm

nullplan wrote:
rdos wrote:Doing file checks manually like this is a non-option. It must be automatic simply because RDOS runs in unattended systems where an operator cannot do this. The most important aspect of the FS driver is that it can still operate in the presence of severe errors since everything else means a dead system and costly service trips.
While I do agree in principle (even a human operator might simply not know how to proceed, unless they are a file system engineer and actually know what all the terms being thrown at them mean). the problem here is doing it right. A file system driver that responds to errors by automatically corrupting vital system files isn't going to be any good to anyone. At least for some errors, resolution is easy (unlinked cluster found? Mark it free!) But how do you deal with errors that are simply not clear cut? For example for FAT, a cluster ends up being linked into two files. That essentially means that two files now share the same tail. Which of the two gets to keep the tail? Or does neither get it? How would you notice? How would you fix it in any feasible amount of time?

That's crosslinking. You can detect it by creating a bitmap, set all bits to zero, and then you follow every cluster chain in the file system and set the bit corresponding to the cluster. for every cluster. If the bit is already set, you have a crosslink. To avoid loops and other similar problems, for files you truncate the cluster chain if it goes beyond the file size. For a directory, you truncate the cluster chain if directory entries are invalid, or if zeroed entries are followed by non-zeroed. When you are done with this, you can detect unlinked clusters by comparing to the bitmap and free those.

As for resolving crosslinks, the safest way should be to make a copy of the data and then separate the links so they no longer crosslink. After that, the crosslinked objects can be examined as ordinary entries. A potential problem could they that there is not enough free space.

nullplan wrote: How to detect: If you are actively scanning for the condition, you can test every field in the FAT for being equal to any other field in the FAT (and not being one of the special entries for bad sectors and end-of-chain), but even in the best cases, this is of quadratic time complexity with constant space complexity. Or else you can make it linear time complexity with linear space complexity. So it really does not scale up. Once detected, you might be tempted to say that only at most one of the files involved is going to have a file length compatible with the length of the chain. But then you have the problem that you need to identify the directory entries belonging to these cluster chains. For that you will first need to find the starting cluster of both files involved, which is not only costly, but if the FAT is inconsistent, can easily be infinite (if some clusters end up building a ring in the FAT). And even once you've found the starting clusters, now you have to find the directory entry somewhere in the FAT that contains that cluster as starting cluster value. And you have to do that while knowing the FAT to be inconsistent.

I think following cluster chains as I described above is more efficient. In the first iteration you will find all crosslinked clusters and which objects that are crosslinked (except the first), and in the second iteration you can find the first object that links too.

nullplan wrote: So what do you do here? I have no real answer. The classic UNIX answer has always been to kick the can down the road. The syscall returns an error. The library passes the error to the application. The application writes an error message to the user. The user calls the IT department. The IT department has no-one to kick the can to, so they get stuck with it. Not really satisfying, but what else can we do?

Returning error codes doesn't solve anything. In the current implementation, I silently ignore errors and return "not ok" but no error code. For some fatal errors, I simply hit a break-point in the kernel and then the production release will record the register state and similar for further analysis & reboot.

bzt · Post by **bzt** » Fri May 21, 2021 12:47 pm

rdos wrote:Doing file checks manually like this is a non-option.

Erhm, what? Are you replying to my post at all? What I've told you was, check the dosfstool's check.c source to see what it does when the "-f" automatic fix switch is given. I've said nothing about manually running fsck, I've said study its source to learn what decidions it is making.

Cheers,
bzt

rdos · Post by **rdos** » Sat May 22, 2021 4:05 am

bzt wrote:
rdos wrote:Doing file checks manually like this is a non-option.
Erhm, what? Are you replying to my post at all? What I've told you was, check the dosfstool's check.c source to see what it does when the "-f" automatic fix switch is given. I've said nothing about manually running fsck, I've said study its source to learn what decidions it is making.

Cheers,
bzt

I might, but I doubt that a tool run when there are known problems in the filesystem, and which can run for a long time without annoying the user, is something to build upon for fixing AND detecting problems before the user is even aware there are any. I feel these are completely different approaches to the problem that are not compatible. Automatic fixing needs to be fast, or do stuff that will need to be done anyway. So, if clusters are put in a bitmap for allocation purposes, they can (as a side effect) be used for detecting crosslinking too. Fixing corrupt cluster chains as files are opened or directories are cached doesn't affect performance and so can be done in real time.

OSDev.org

Consistency checking FAT

Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT

Re: Consistency checking FAT