Filesystem design
Posted: Wed Jan 05, 2005 4:29 pm
I've been slaving at it in my mind for a full year now, and I've come up with a first kind-of design, without any on-disk file structures or anything, but just a database-like design of the structures and a bunch of concepts that it supports.
Concept list:
- Slice/Section: Instead of partitioning the disk is divided into a bunch of slices, and any number of slices (from any disks) can be added together to form a section. They are not physically separate, not logically separate (they don't divide the sectors in the way partitioning does) but they allocate blocks as they see fit. Blocks are of a size of 4096 bytes, or 4K, which suffices for all purposes this is designed for. Sections and slices are stored in the two tables in the lower right hand side, the blue indicates "primary key" or "unique in this table".
- Optional date/time keeping: Since it takes a lot of time to update date/time for each file access an OS does (look at windows/linux, researched by the BeOS people), especially when it's a log file or a system file, it's totally useless in those cases. Anybody ever wondered when the last time was the OS read comctl32.dll? Therefore, to save space, the date/time is in a different table (no space taken), and it can be switched on/off optionally for each file or directory, and through a user interface recursively (of course). Default for user sections is to use date/time stamping (on both streams and file names), default for the rest is to not do that. Streams and file names have separate date/time locations, since there can be more than one file name pointing at a file and it could well be of interest whether you made the last change.
- Versioning of files: Lots of files, especially user files, change often and when they change, only a little. It's very useful to be able to go back (think a little 7-year old timmy that erases all your text, types "mommy I love you" and saves that over your financial records for three years) and in most cases, not much of the file changes. Making a true incremental backup of such files allows the file not to grow much in size, but the user being able to go back to an older version nonetheless. The actual implementation of the DIFF function is left to OS modules, but the default (two bits for indication, one is "file should be backed up/has multiple versions", second is "backups should be incremental") is just plain storing multiple versions. Note on implementation, the new version should have the old inode number, the old version (the diff) should have a new inode number, the links (table) should be updated to reflect this.
- Encryption that works nicely with users and groups: You should know what encryption is. You can't read the file except if you know a given password. This scheme uses a few layers to work nicely with users and groups. Each group has a password (gpass), each user has a password (upass) and each encrypted file has a single password (fpass). The upass is hashed twice for authenticating to the system (OS), and then compared with the value stored in the UserID-Passwd table. If they're correct, the pass is correct (assumed) and the single hash of the original password (so, there's a third value you use for calculations that is not the password (saving the secrecy of it for memory dumps) and that is not identical to the hash on disk (saving against adversaries that also read this text)). This hash can be used as a symmetric key for decrypting the group password (auto-generated random bits) (in the userid-groupid keyed table) that has been encoded for this user. This way, the user receives the group password without knowing any other password than its own, and without giving any password along. This password is hashed through the hash once to prevent memory dumps and then is in single-hash mode. Both of these single-hashes can be used for decoding the appropriate userid/groupid-fileid combination, to get the file password (which is then hashed and only then used) to decrypt the file itself. Kinda complex, but it allows the just to access the file without hassle, and the adversaries not to access the file with any ability.
- Compression, archive bit (for backups), the usual access control
- Extents for block management: See also the FS design topics that have already passed here. Saves space and is easier. Each extent can be compressed or encrypted separately, for fine-grained control. This allows the logging system to work along with encoding a big file gradually nicely.
Concept list:
- Slice/Section: Instead of partitioning the disk is divided into a bunch of slices, and any number of slices (from any disks) can be added together to form a section. They are not physically separate, not logically separate (they don't divide the sectors in the way partitioning does) but they allocate blocks as they see fit. Blocks are of a size of 4096 bytes, or 4K, which suffices for all purposes this is designed for. Sections and slices are stored in the two tables in the lower right hand side, the blue indicates "primary key" or "unique in this table".
- Optional date/time keeping: Since it takes a lot of time to update date/time for each file access an OS does (look at windows/linux, researched by the BeOS people), especially when it's a log file or a system file, it's totally useless in those cases. Anybody ever wondered when the last time was the OS read comctl32.dll? Therefore, to save space, the date/time is in a different table (no space taken), and it can be switched on/off optionally for each file or directory, and through a user interface recursively (of course). Default for user sections is to use date/time stamping (on both streams and file names), default for the rest is to not do that. Streams and file names have separate date/time locations, since there can be more than one file name pointing at a file and it could well be of interest whether you made the last change.
- Versioning of files: Lots of files, especially user files, change often and when they change, only a little. It's very useful to be able to go back (think a little 7-year old timmy that erases all your text, types "mommy I love you" and saves that over your financial records for three years) and in most cases, not much of the file changes. Making a true incremental backup of such files allows the file not to grow much in size, but the user being able to go back to an older version nonetheless. The actual implementation of the DIFF function is left to OS modules, but the default (two bits for indication, one is "file should be backed up/has multiple versions", second is "backups should be incremental") is just plain storing multiple versions. Note on implementation, the new version should have the old inode number, the old version (the diff) should have a new inode number, the links (table) should be updated to reflect this.
- Encryption that works nicely with users and groups: You should know what encryption is. You can't read the file except if you know a given password. This scheme uses a few layers to work nicely with users and groups. Each group has a password (gpass), each user has a password (upass) and each encrypted file has a single password (fpass). The upass is hashed twice for authenticating to the system (OS), and then compared with the value stored in the UserID-Passwd table. If they're correct, the pass is correct (assumed) and the single hash of the original password (so, there's a third value you use for calculations that is not the password (saving the secrecy of it for memory dumps) and that is not identical to the hash on disk (saving against adversaries that also read this text)). This hash can be used as a symmetric key for decrypting the group password (auto-generated random bits) (in the userid-groupid keyed table) that has been encoded for this user. This way, the user receives the group password without knowing any other password than its own, and without giving any password along. This password is hashed through the hash once to prevent memory dumps and then is in single-hash mode. Both of these single-hashes can be used for decoding the appropriate userid/groupid-fileid combination, to get the file password (which is then hashed and only then used) to decrypt the file itself. Kinda complex, but it allows the just to access the file without hassle, and the adversaries not to access the file with any ability.
- Compression, archive bit (for backups), the usual access control
- Extents for block management: See also the FS design topics that have already passed here. Saves space and is easier. Each extent can be compressed or encrypted separately, for fine-grained control. This allows the logging system to work along with encoding a big file gradually nicely.