thewrongchristian wrote:
And you're worried about the performance impact of checking individual sector errors?
The biggest impact actually is that the disc cache will consume several times more memory with error codes per sector. Something that directly affects performance due to less sectors buffered.
thewrongchristian wrote:
Who is doing this checking? If it's at user level, then all well and good. This sounds very much like policy that needs to be kept out of the kernel.
When the system starts, it will start a loader. The loader has control over faults and the software that runs. However, it needs useful input from the application if it should do reasonable actions, and the application just exiting is not useful.
thewrongchristian wrote:
But, what happens if, because you don't propagate IO errors through read, your zip files on central storage are corrupt due to a bad sector, you'll unzip garbage (which will probably not work anyway)?
The loader fetches the zip files with TCP/IP. It will also get an MD5 sum so it can verify zip files. The loader will make sure that it has rebooted the machine before it compares files, and if something is recreated. it will check if things are right after a reboot so it knows that the contents are correct on the disc, and not just in the disc cache.
thewrongchristian wrote:
Part of the user/OS system call contract needs to be a reliable way of transferring data, and knowing that the data you read is intact. If you can't rely on that without jumping through extra hoops in user land, then you've just made extra work for yourself.
MD5 signatures will do that.
thewrongchristian wrote:
The POSIX contract for read is "I successfully read what you told me, or I tried my best but it failed."
Your OS contract for read is basically "I tried to read what you told me, but it may or may not have read successfully, so I may have left garbage for some or all of the data, and you might want to verify all the data just read to see if it makes sense to your application."
With POSIX contract, you just get a failure that a file cannot be read. This doesn't give you the correct contents, and if the app doesn't check error codes properly, then it might just continue with garbage. The POSIX contract also doesn't tell you if there are related problems in the filesystem, like crosslinked clusters, one FAT is unreadable or that the cluster chain is invalid. You need an event function for this. Also, if you only rely on POSIX, then all the errors are dispersed within many parts of the application, and so a supervisor cannot get notifications that something is wrong, and when making it's best to correct the error, will typically start the application again. With the same result.
It's a bit like USB errors. You cannot let those propagate to USB functions because then the general picture of USB errors will be lost as the USB functions just will do their best to correct the errors, and then will discard them. Thus, you will be lost about if your USB hardware operates properly or not. It's the same with disc devices. They should provide a generic event interface so a supervisor can get a good picture of the condition of your disc drives.
So, in reality, my situation is that the read will complete if the file size allows, but the supervisor (loader) will be notified there is a disc problem, and the VFS might try to fix it by modifying the cluster chain and marking the bad cluster as bad.