Filesystem design

Candy · Post by **Candy** » Wed Jan 05, 2005 4:29 pm

I've been slaving at it in my mind for a full year now, and I've come up with a first kind-of design, without any on-disk file structures or anything, but just a database-like design of the structures and a bunch of concepts that it supports.

Concept list:

- Slice/Section: Instead of partitioning the disk is divided into a bunch of slices, and any number of slices (from any disks) can be added together to form a section. They are not physically separate, not logically separate (they don't divide the sectors in the way partitioning does) but they allocate blocks as they see fit. Blocks are of a size of 4096 bytes, or 4K, which suffices for all purposes this is designed for. Sections and slices are stored in the two tables in the lower right hand side, the blue indicates "primary key" or "unique in this table".

- Optional date/time keeping: Since it takes a lot of time to update date/time for each file access an OS does (look at windows/linux, researched by the BeOS people), especially when it's a log file or a system file, it's totally useless in those cases. Anybody ever wondered when the last time was the OS read comctl32.dll? Therefore, to save space, the date/time is in a different table (no space taken), and it can be switched on/off optionally for each file or directory, and through a user interface recursively (of course). Default for user sections is to use date/time stamping (on both streams and file names), default for the rest is to not do that. Streams and file names have separate date/time locations, since there can be more than one file name pointing at a file and it could well be of interest whether you made the last change.

- Versioning of files: Lots of files, especially user files, change often and when they change, only a little. It's very useful to be able to go back (think a little 7-year old timmy that erases all your text, types "mommy I love you" and saves that over your financial records for three years) and in most cases, not much of the file changes. Making a true incremental backup of such files allows the file not to grow much in size, but the user being able to go back to an older version nonetheless. The actual implementation of the DIFF function is left to OS modules, but the default (two bits for indication, one is "file should be backed up/has multiple versions", second is "backups should be incremental") is just plain storing multiple versions. Note on implementation, the new version should have the old inode number, the old version (the diff) should have a new inode number, the links (table) should be updated to reflect this.

- Encryption that works nicely with users and groups: You should know what encryption is. You can't read the file except if you know a given password. This scheme uses a few layers to work nicely with users and groups. Each group has a password (gpass), each user has a password (upass) and each encrypted file has a single password (fpass). The upass is hashed twice for authenticating to the system (OS), and then compared with the value stored in the UserID-Passwd table. If they're correct, the pass is correct (assumed) and the single hash of the original password (so, there's a third value you use for calculations that is not the password (saving the secrecy of it for memory dumps) and that is not identical to the hash on disk (saving against adversaries that also read this text)). This hash can be used as a symmetric key for decrypting the group password (auto-generated random bits) (in the userid-groupid keyed table) that has been encoded for this user. This way, the user receives the group password without knowing any other password than its own, and without giving any password along. This password is hashed through the hash once to prevent memory dumps and then is in single-hash mode. Both of these single-hashes can be used for decoding the appropriate userid/groupid-fileid combination, to get the file password (which is then hashed and only then used) to decrypt the file itself. Kinda complex, but it allows the just to access the file without hassle, and the adversaries not to access the file with any ability.

- Compression, archive bit (for backups), the usual access control

- Extents for block management: See also the FS design topics that have already passed here. Saves space and is easier. Each extent can be compressed or encrypted separately, for fine-grained control. This allows the logging system to work along with encoding a big file gradually nicely.

Candy · Post by **Candy** » Wed Jan 05, 2005 4:30 pm

- Logging at a very low level. The FS isn't logged at any level present in this drawing, but the blocks are logged as they're written to disk. Some sections (configurable) are logged, so the FS driver writes all blocks to disk (not actually changing anything), then writes the FS DB update to the DB sectors (being logged, and written in a single go instead of gradually), and after a while a cleanup program goes through the log and actually writes to the DB. The point (for people who don't understand what it adds) is that the FS DB is updated in a single write, which can either fail completely or succeed completely, but there's no limbo state where the FS would have for instance used an inode for the file, but not assigned a filename. Should the system go down between those two writes, the FS would have spilled an inode. Therefore, using logging prevents this.

- Other small things that I have forgotten probably.

- The image is at http://candy.dublet.org/fsdbdesign.png, bitmap version at candy.dublet.org/fsdbdesign.bmp. This is on my home line, don't overload it please

. Yes, it's capped at 32k/s.

PS @ Pype: I didn't put the shared section or quota discussion in use yet, can't fit it in easily. Should be possible though, do you have an idea?

PS @ all others: You can join in the discussion and use this FS for yourself too, we're kinda trying to get it a standard in OS dev country.

PS3: Another post coming in a few minutes.

PS4: read through the "saved by trial version" it was either that or a BMP version - Fixed now, thanks to DennisCGc. Thanks

PS5: DF, increase message size. This is pathetic

PS6: The "Creator", third field from the left in the third table from the lower right hand corner is also a UserID, just not linked.

edit: moved my server to port 80 recently

Pype.Clicker · Post by **Pype.Clicker** » Thu Jan 06, 2005 5:00 am

wow. I'll read all this later ... i've got an exam to prepare about artificial neurons, decision trees etc.

zloba · Post by **zloba** » Thu Jan 06, 2005 4:21 pm

what is a section, its meaning?

Pype.Clicker · Post by **Pype.Clicker** » Fri Jan 07, 2005 5:41 am

hmm ... Your discussion about how time should be stored seems interresting ... However, due to the ever-increasing amount of options and parameter to be stored about a file, i really wonder if it wouldn't be interesting to have some sort of generic key->value storage in the "information node" rather than trying to define a complex structure.

Solar · Post by **Solar** » Fri Jan 07, 2005 6:23 am

(Turning the crank on the good ol' eight-eight like a madman to give this a couple shots o' flak to see if it holds together. You know the drill - you don't have to answer me, you rather have to question yourself. ;D )

What is the amount of overhead required for keeping track of slices / sections in addition to actual file metadata keeping? What is the performance impact of having to look up this table repeatedly on every write access, possibly read access too?

Are you aware that, if you have multiple disks, one disk failing can render all data in a system corrupted / useless as random slices are lost?

Have you considered the performance impact when section A and section B get slices assigned alternatingly, scattering both A's and B's data all across the available disks?

Regarding date / time keeping - what is the advantage, compared to e.g. a Linux system having /usr, /var etc. mounted without atime and /home, /etc with atime? I just see another field of required configuration that can be screwed up. (Let alone that this table has to be looked up everytime a file is accessed.)

As for the versioning, I'm completely with you there - versioning should be integral part of the FS instead of some application on top.

Candy · Post by **Candy** » Fri Jan 07, 2005 3:50 pm

Solar wrote: (Turning the crank on the good ol' eight-eight like a madman to give this a couple shots o' flak to see if it holds together. You know the drill - you don't have to answer me, you rather have to question yourself. ;D )

I know, just answering the parts I know... I posted this here in hope of a lot of replies with critics in them. I wasn't hoping for a bunch of "what are you talking about" or "bad design, no detailed comment". Thanks for this one

What is the amount of overhead required for keeping track of slices / sections in addition to actual file metadata keeping?

Keeping track of these is very minimal. In a functioning system with 64-bit ID's for files each file has an inode, whose links are segmented into bits (this is practical implementation side) where the lower 40 bits are an index into one of the tables of inodes (key being inode #), the next 8 are for the slice id (up to 256 slices should suffice) and the last 16 are for OS purposes (probably keeping sections and file systems apart). Allocating is similar to current allocation, you select any random slice with space available, or just throw it all into one big troth and let the section just grab stuff out of it. Each file itself has to be on a single slice though (one thing that's required for my next answer). The overhead will be minimal, as it's very transparent.

What is the performance impact of having to look up this table repeatedly on every write access, possibly read access too?

You don't look up each write/read, you only use it implicitly. The read/writes refer to the location on the disk, you only need to know what disk it is (constant time lookup) and how large (same table, you had to check it anyway). The overhead will be in the region of a few cycles. Compared to disk seek times this is very small.

Are you aware that, if you have multiple disks, one disk failing can render all data in a system corrupted / useless as random slices are lost?

Yes and no, that's not true. Each slice saves a FULL copy of the database concerning the files stored, so each slice can be mounted separate from the others, just with a few files contents missing. See it as the file table being entirely present but the stream table being incomplete. This way, if you lose a slice, bunch of slices and/or complete physical disks, you still hold all the data. This was a design goal from day 1 actually

Adding to that, we are still considering adding bits (see also flags) for using multiple streams containing the exact same data or some form of multistorage (using multiple disks to store a certain amount of data, with a loss of N disks allowed) to allow data to be secured without a hardware/software raid system which would function on a partition level (or disk level).

Candy · Post by **Candy** » Fri Jan 07, 2005 3:51 pm

DF: thing about message size...

Have you considered the performance impact when section A and section B get slices assigned alternatingly, scattering both A's and B's data all across the available disks?

You can defragment a disk in itself, you can also allow the block manager for the OS to allocate a certain amount of blocks for reserve use for the section that allocates it to prevent this exact case. However, there will always be worst cases and of the worst cases, this one is neither the FS design responsibility (imo) nor the most likely case. People in windows doing a bunch of slow multifile copies, yes that'd fit the bill. How many people do that?

As a note, been thinking about it (mainly Pype though, we didn't agree on this yet) and I consider this to be an implementation issue. If you want the section to be contiguous, you're probably paying attention to the wrong thing anyway. similar to current cases with multiple partitions for /home, /usr etc., some files in each are used most. If you sort by those (say, documents around the text editor) you're a lot more likely to get high performance than just sorting by directory. People don't really sort many files, only programs do.

Regarding date / time keeping - what is the advantage, compared to e.g. a Linux system having /usr, /var etc. mounted without atime and /home, /etc with atime? I just see another field of required configuration that can be screwed up. (Let alone that this table has to be looked up everytime a file is accessed.)

There's not much advantage, but there's not much disadvantage either. The advantage is that you can win some speed by marking a lot of reference files in your /home (I know I have loads of hobby OSes in there) as non-timed, saving loads of writes.

As for the versioning, I'm completely with you there - versioning should be integral part of the FS instead of some application on top.

Good to hear. Been considering this only for a few weeks now, ask Pype for more details

. Thread on Clicker forum opened about this specific topic too.

Off to bed now with my fairy

Candy · Post by **Candy** » Sat Jan 08, 2005 9:35 am

zloba wrote: what is a section, its meaning?

Also answered in your PM, which I sent and then lost :$

Second try:

Old days, you had a disk. It needed to be put into separate parts, so you could put multiple OSes on a disk. Somebody thus thought up taking the disk and cutting it up into partitions, contiguous nonoverlapping areas where you could dump your information.

Some time later, people were using just Windows, or just Linux, or just X, so the point of the partitioning went away. Also, people had multiple partitions and were annoyed by their inability to grow, shrink, not-be-in-the-way, having to be a multiple of the cluster size, all sorts of problems. Solving this, you need a system that allows you to just indicate "I want area A to be up to 10GB, area B to be up to 32.5GB, area C to be up to 10GB and I want area A to have at least 7.5GB available. My disk is 40GB." with the OS working out the details.

In this particular case, you can work out that you first reserve 7.5GB for the first area, then divide up the rest to whoever comes first to ask it. If the user tried to make any area larger than would fit, deny it and put it to the maximum.

Now, there's a different problem. You can figure out quite easily that this user interface (idea) is impossible (or very very slow, see also partition magic) with partitions, hence it hasn't been realised (efficiently). If you, however, don't divide up the disk into fixed portions when the disk is first used, but only divide it up when the space is actually used (and take back space when its freed!) you can assign the space available to whoever comes first.

This does pose a new problem. Since each block potentially belongs to anybody, a filesystem crash is desastrous, since it brings down all those areas. A lower level of security is needed, hence we introduce a free-list and a used-list, plus a general-purpose log "file". In the log, a record is kept in the form of "sector 1234 becomes ASDHAALSDGH, sector 1519 becomes SHALKDSJ and sector 2908 becomes AHDSGFLKSJD", which can be written in a single go. Now, these sectors are adjusted themselves and after that's done, the log entry can go again. This means that all writes to FS-level structures must go through the log, and all user-writes don't go through the log (since it's just for file system integrity). The log is below the file level, and even below the area-level since it maintains its integrity.

Having solved this, we can consider the options of the new layer idea:

We propose a layer in which the physical disk (partition) is subdivided into logical units called "blocks", each with the size of 4096 bytes or 4 Kbytes. These blocks are managed through a freelist for the entire disk (partition), a bad-list for the entire disk (partition) and a used-list for each area called a "slice". These slices then have a certain number of blocks reserved for them (not specific blocks, mind you! just a certain number of blocks!) and a certain upper bound on the size they can reach. Each slice can be as large as you would want it to be (potentially, up to the disk size or 2^64 * 4Kbytes, whichever is lower... probably the disk size).

We then propose a second layer, right on top of this one and tightly connected to it, in which any given number of slices is bound together to form a "section". These sections are usable user-level areas each being assigned a "node" or a "drive-letter" or whatever your operating system considers a subdrive (previously partition) to be. These sections are limited by the limitations of their slices. Containing multiple slices, it can offer advanced multi-disk services that were previously limited in concept to the advanced and expensive multi-disk solutions called RAID arrays. Similar setups can be achieved using slices and sections for the purpose of data backup and storage.

These two layers are the basis of the entire system of *FS (read: StarFS).

In the slices-discussion each time "disk" is mentioned it's followed by "partition". For compatibility reasons, the filesystem can work as a partition on an existing disk, but it can also work as the entire disk. The first 32kbyte of the disk/partition have been reserved for bootup code, necessary to contain startup logic capable of using the file system. In the first sector information is contained about the relative location of the filesystem to the start of the disk, which may be 0 (in case of the entire disk). Having multiple partitions all as *FS is pretty pointless, unless they're physically distinct (not back-to-back) on the disk.

Small thing that just popped up: a file is now required to be in one slice entirely. Should I scratch that? This way it's more crash-proof, the other way is more versatile (imagine all slices being nearly full) and gives less cryptic errors. Any ideas?

zloba · Post by **zloba** » Sat Jan 08, 2005 10:23 am

Candy
thanks for the additional info, i read through it and i have a much better understanding of it than before. i'll read it more.

a file is now required to be in one slice entirely. Should I scratch that? This way it's more crash-proof, the other way is more versatile (imagine all slices being nearly full) and gives less cryptic errors. Any ideas?

imagine me having 2 1G disks and trying to store a 1.5G file, and not being allowed to - just because it is more reliable..
my reaction: wtf? GRRRRRR!!!

so i'd say, it shouldn't be a predefined limit.
instead, you might want to offer some options about how a file should be distributed among slices, such as:
-keep entirely on a given disk
-mirror-duplicate on N disks
-striped
-etc.

p.s.

and gives less cryptic errors.

what makes you give cryptic errors anyway?
if one of the slices fails, you report that as such, instead of something obscure like "read error" (let the user figure that one out - what? where? what do i do now to recover/fix that?)

Candy · Post by **Candy** » Sat Jan 08, 2005 11:08 am

zloba wrote:
a file is now required to be in one slice entirely. Should I scratch that? This way it's more crash-proof, the other way is more versatile (imagine all slices being nearly full) and gives less cryptic errors. Any ideas?
imagine me having 2 1G disks and trying to store a 1.5G file, and not being allowed to - just because it is more reliable..
my reaction: wtf? GRRRRRR!!!

so i'd say, it shouldn't be a predefined limit.
instead, you might want to offer some options about how a file should be distributed among slices, such as:
-keep entirely on a given disk
-mirror-duplicate on N disks
-striped
-etc.

Hm... that's an idea... Yet, I don't think this should be the problem... it could be made a bit in the section configuration, whether it should always, usually or never try to keep things on one disk. Considering the better aspects, you can then split a file up over multiple disks for higher speeds. It would enforce a design change though... the SliceID then moves from the stream# table to the stream#/extent# table, since each extent can be in its own slice.

p.s.
and gives less cryptic errors.
what makes you give cryptic errors anyway?
if one of the slices fails, you report that as such, instead of something obscure like "read error" (let the user figure that one out - what? where? what do i do now to recover/fix that?)

This was more as a "how do I report an error that even though he has 4gb free on 4 disks, he can't store a 1.5gb file" sort of error. Users can't gripe the idea that they can't do that. I think the splitup is a better choice even though it gives less efficiency. The reliability aspect is offset by preferring same-slice extents. Anybody has a comment on this?

zloba · Post by **zloba** » Sat Jan 08, 2005 11:34 am

Each slice keeps a full file table for all files on the section, yet each slice only keeps data streams for a subset of them (which might be all of them nonetheless).

is keeping a full file table for the section a good idea?
imagine having a section with millions of small files, spread over a number of relatively small slices.

first, storing a full file table in each slice would introduce a nasty space overhead, as it grows huge - to the point of overflowing smaller slices.

second, the issue of updates. suppose you add a file to a section and store it in some slice - now you have to update file tables in the rest of the N slices of that section..

third, what's the use keeping file tables in a slice, with files not in that slice? it's the job of the section - to keep information about all of its files, in some sufficiently reliable way. mirroring it on each slice seems redundant and inefficient.

Candy · Post by **Candy** » Sat Jan 08, 2005 12:36 pm

zloba wrote:
Each slice keeps a full file table for all files on the section, yet each slice only keeps data streams for a subset of them (which might be all of them nonetheless).
is keeping a full file table for the section a good idea?
imagine having a section with millions of small files, spread over a number of relatively small slices.

first, storing a full file table in each slice would introduce a nasty space overhead, as it grows huge - to the point of overflowing smaller slices.

second, the issue of updates. suppose you add a file to a section and store it in some slice - now you have to update file tables in the rest of the N slices of that section..

third, what's the use keeping file tables in a slice, with files not in that slice? it's the job of the section - to keep information about all of its files, in some sufficiently reliable way. mirroring it on each slice seems redundant and inefficient.

Well... the idea would be that if any slice is lost the remaining slices can still construct a full file list, overview, and manageability. The user wouldn't see a difference in appearance, helping familiarity with the computer (which even helps seasoned veterans) and he would be able to tell which wasn't there (by having a red file name or some indication which is OS dependant). Also, it helps OS implementation by not requiring the developer to figure out how to merge it or from where to get it, it's just everywhere. Combine this with the idea of a single slice per section per disk and it's pretty efficient still. The file table will at most contain like 500000 records (current estimate), creating a table of about 500000 times a file record, which is about 64 bytes each. That amounts to 32 megabyte, which should not be significant for slices of an OS section. Or a program or user section for that matter, who has up to 500000 files in one section?

Which also brings me to a second point about that, the use of a section. Each section holds data for ONE system (which is not the OS, that is a disk-file-system section), ONE program, ONE OS or ONE user. There can be shared sections if users create them by giving a bit of their quota for the shared section. In any case, there are a lot of sections, each containing one limited set of files. Each section contains a set of files which is logical for a given user, program or set of users. Sections with more than 5000-10000 files should therefore be very unlikely. If you can make them for just about free, might as well use it as a protective advantage, right?

The sections are thus tagged to indicate what type they are. They are a program section, user section, group section (new), system section (fs) and OS section (boot). This isn't in the design yet, but it will be sometime soon... when I figure out how

zloba · Post by **zloba** » Sat Jan 08, 2005 2:03 pm

Which also brings me to a second point about that, the use of a section. Each section holds data for ONE system (which is not the OS, that is a disk-file-system section), ONE program, ONE OS or ONE user.

this is one of the best ideas, it will certainly go into whatever system i may end up implementing. (your "applications" rant also contains some ideas)

here's how i see it.

the way programs are currently installed is flawed:
-ambiguity and conflicts, when installing multiple instances or different versions of an application or different applications
-DLL hell, where all DLLs go into a shared folder, and uniqueness is not a good bet.
-what business does one app have reading, much less writing to another app's program files or user's files?

an app should be installed into its own "program" filesystem and not see anything beyond it. this filesystem may be within a user's quota, or global, executable by all users.

then, it may need storage for user's settings and/or data, obviously within the user's quota. no apps refusing to work under a nonpriviledged user.

if the user wants to open a file with the app, they can "feed" the file to the app via a system-provided "open file" dialog. (this also adds the possibility of accessing more than just normal files - such as HTTP documents - with the OS or another app encapsulating the retrieval of the document, and the app not giving a damn about it, much less including a full HTTP implementation)

am i confused or what?

Candy · Post by **Candy** » Sat Jan 08, 2005 5:14 pm

zloba wrote:
Which also brings me to a second point about that, the use of a section. Each section holds data for ONE system (which is not the OS, that is a disk-file-system section), ONE program, ONE OS or ONE user.
this is one of the best ideas, it will certainly go into whatever system i may end up implementing. (your "applications" rant also contains some ideas)

Well... nice to know actually somebody found it without me hard-pointing him there... and that you actually read it too ;D

here's how i see it.

the way programs are currently installed is flawed:
-ambiguity and conflicts, when installing multiple instances or different versions of an application or different applications

Yes. Now, this doesn't directly fix that, but it does kind of allow multiple versions to coexist.

-DLL hell, where all DLLs go into a shared folder, and uniqueness is not a good bet.

Yes. Now, I think that the idea of sharing all libraries is way overrated, most people have dozens of programs from exactly the same amount of vendors, who might just have 2 or 3 shared libraries. Out of thousands of libraries, that isn't enough to warrant the names to be unique, let alone the contents compatible. Keep your DLL's to yourself, until proven identical (and no, the date isn't enough).

*mental note* : It is of course a nice idea to make certain files hashed files, as in, you also store the files hash for quick determination. Not sure about this but it just popped up and it makes for a quick virus check (if only modifiable by the OS).

-what business does one app have reading, much less writing to another app's program files or user's files?

Very true, but this is actually Pype's idea. You should (or have) read his notes too, he's got some very good things in there (which explains our cooperation in the first place).

an app should be installed into its own "program" filesystem and not see anything beyond it. this filesystem may be within a user's quota, or global, executable by all users.

Not entirely agreed upon. I consider the programs locally installed to be more an operating system thing, similar to what current systems do. I strongly prefer it to be taken from the global "programs" slice into its own slice (special case

) such that the amount for programs in general can be limited, each program can be checked upon, but nobody is penalized for having a bunch of programs installed. This is something that annoys me at school, people using MS office and IE have 0 bytes of programs they want to install, and those with OpenOffice and Firefox (hello, yes, that'd be me!) need up to 200-250 mb in their program folder. Which we, needless to say, don't get at school. A bunch of us have decided to not use those computers anymore through lack of usability.

Do you know the feeling when the forum complains three times in one thread about your message being too long?

OSDev.org

Filesystem design

Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design