Filesystem design

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 12, 2005 3:32 am

okay. I've walked a bit more around the "extensible attributes" framework to get something readable. Here you get it

If you consider a file (e.g. a file entry in a directory or an inode), you may classify informations stored about it as such:

* information that tells where the datablocks of the file actually are.
* information that tells how the file should be used, who made it, when, etc. and that makes sense globally thorough the system -- which i'll call attributes.
* information that tells more about the file (its class, its original source, its "importance" for the user, when it was originally created if created locally, etc) but which are of no use for the system itself, which i'll callmetadata.

Some of the metadata for a file can be made very small ... Even the "mp3 author" can become compact if you use an external table to map an "authorID" to its unicode string. Other metadata are relatively large (like the comments and the original source). What i'd advocate for (and implement as a part of *FS if we come to an agreement on that with Candy) is the ability for the system to handle both attributes and metadata transparently.

Transparently means that the song's author remains the same if you move or rename the file. Most frameworks nowadays fail to offers this (just using the media library of newer winamp proves you shouldn't move your MP3s ever ... RealOne is ever worse)

Let's say now i wish to have a selection that automatically shows "all the songs from author X" based on a directory that'd contain "all the songs". If i have "song author" as a part of the real data (such as an ID3 tag), it means i have to crawl the directory, open all files (descending the indirect index blocks?), check ID3 tag, perform string compare, close files. Rather slow & pathetic.

Storing "author" as a field in a additionnal "metadata stream" about the file will not really help either.

Now what about storing the author's ID (just an int) within the inode itself ? first of all, checking the inode is enough to tell if we keep the file or not for the current selection.

Questions still pending:

* how do we tell what should be kept at inode and what shouldn't ?
* could the inode just be a "cache" for any last-recently-used metadata&attributes ?
* could we just keep latests "accept" or "denied" rules for the file in the inode and the more complex (?) decision rule in the "metadata stream" ?
* how much overhead does the fact that "owner" is a key instead of a hard-coded location involves ? the hardest part of it was to read the data out of the disk anyway ... once in cache ...

zloba · Post by **zloba** » Wed Jan 12, 2005 5:58 am

about attributes - some random food for thought:

that "Author" attribute: is it not an integral part of the document? meaning, if you were to lose it, you lose information and not just some faceless data that gets slapped onto the file by the OS (such as timestamps, chmod or whatever).

how would you like it if, in the process of copying files around, you lost all IDv3 tags from them, because of incompatibilities between filesystems or protocols? (this routinely happens with other attributes, such as timestamps, access rights and even filename's lower/upper-casedness, which you'd think should matter)

what exactly should be "metadata" as opposed to just "data"? or why shouldn't there be 3 streams, or even 4?
"contents that are actually used", as opposed to "information about the contents"?
what about, for example, a text document with author, comments, changelog, etc? is a comment any less valuable than the main text?

or maybe should these FS- and OS-related attributes (as opposed to those that are an integral part of a document) be handled separately?

the problem with .doc, .mp3 and all other files is that there are so many unrelated formats. to display attributes, you need to (1) have some idea about their meaning and structure and, worst of all, (2)support reading each file format.

for (1), are there actually so many unique types of attributes? or can they be classified and standardized by a number of standards?
-author who created the document
-artist, album etc. for music/video
-changelog - who, when, what, why (not just timestamps)
-comments - who, when, what..
-etc etc

for (2), making them uniformly accessible would solve the problem.

why not make general SQL-style queries into the filesystem? (this is pretty much what Pype wants), such as "gimme all files where artist is XYZ"..
now, can you do that trick with unix "locate" command? is there a good reason why not?
perhaps you can, with an external utility to read mp3 tags, but it's ugly and a particular case. and why should a user be skilled in using regular expressions? which is a rather obscure concept (and disgusting regardless)..

about timestamps (or even more than one kind - creation, access, modification):
various combinations are supported by filesystems, but what good are they?
do we want to just know when a file was last changed? how useful is it?
(and when the file changes again, the old modification timestamp is lost - is that good?)
or do we want to know when _each_ change happened, and exactly how? maybe even with optional comments from the user, such as "spellchecked, removed typos..." ?
or, for example, you could search to see what you changed on a given date. normally this information is lost, and it's taken for granted.

in conclusion, attributes matter and should be given more thought. the way FSes and OSes support chosen attributes is arbitrary, half-assed and misplaced.

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 12, 2005 7:13 am

okay, seems like i didn't made myself as clear as i'd like to...

that "Author" attribute: is it not an integral part of the document?

for some specific document format, it is. For others it isn't. I'm thinking at raw WAV files, .xm, .it, .s3m, .mod or any more exotic format: they're all "audio:music" as well as MP3 is, so they should be able to carry an "author" metadata just like MP3 does, but unfortunately, there's no place in the data *itself* where that information should go.

If the system is able to abstract the metadata as being "author" regardless of the actual file format, you can operate easier on "all documents that come from 'author' and not simply "all MP3 with a given value in ID3 tag" ...

for (1), are there actually so many unique types of attributes? or can they be classified and standardized by a number of standards?

My suggestion for this is that the system can receive new file classes which can describe their attributes and methods and which can inherit for other classes. I have a testbed here that works on all sort of text documents (html, raw text, pdf, ps (gzipped or not), etc) and for which i can say "this text document is actually an article", which automatically extract values of "abstract", "bibliographic items", etc.

why not make general SQL-style queries into the filesystem?

It could be sql-like, it could be "doclocate author=leduc" or a nice search box with a grey-bearded wizard looking in a huge spellsbook or whatever. The real issue is that the content of the metadatabase *needs* to be hard-linked with the actual file contents so that you can retrieve the file you're dealing with.

The reason why i didn't extend the feature to source code atm, for instance, is that 99% of my text editors will *not* reuse the same file: instead they use the old inode as a "backup" file and create a new inode with the newly edited file ... which defeats the whole purpose, unfortunately

how would you like it if, in the process of copying files around, you lost all IDv3 tags from them, because of incompatibilities between filesystems or protocols? (this routinely happens with other attributes, such as timestamps, access rights and even filename's lower/upper-casedness, which you'd think should matter)

simply by stating that as soon as different storage technologies are involved, you no longer *copy* the file but instead *import/exchange/export* it.
The oversimplified export procedure would be to copy the "main stream" in a file on the target location and to use an alternate file for reminding attributes that cannot be applied on that location (for instance, writing an XML file describing the file's attribute would perfectly fit the "make a backup CD" export)

Importing and exporting is a fileclass-defined method. When you install a plugin that make your system know what MP3 actually is, you're supposed to provide a "welcome-audio-mp3" tool aswell which will (for instance) inspect ID3 tag and write the appropriate attributes. There are also other standard tools performing encoding/decoding from upper fileclass (generic audio)...

zloba · Post by **zloba** » Wed Jan 12, 2005 7:17 am

Pype:

Storing "author" as a field in a additionnal "metadata stream" about the file will not really help either.

can you explain why not? if it's standardized, kept in a standard structured storage and uniformly accessible...

Now what about storing the author's ID (just an int) within the inode itself ? first of all, checking the inode is enough to tell if we keep the file or not for the current selection.

and what will you do with that ID? what does this have to do with the filesystem, other than optimization for a Very Specific Case?

<edit: text removed - never mind, i somewhat misunderstood Pype's idea of inodes>

if you have an mp3 database so huge that it makes a difference, and if you want to search it at top speed, why not build an indexing database or whatever on top of that?

zloba · Post by **zloba** » Wed Jan 12, 2005 7:59 am

to clarify myself:

why not make general SQL-style queries into the filesystem?

i meant that just as an example of what could be done with uniform attribute handling, not as a main method of access to attributes.

>that "Author" attribute: is it not an integral part of the document?
for some specific document format, it is. For others it isn't.

i know, that was meant as an example of an attribute. most attributes are useful for some files but not others.

Pype
i generally agree.

Candy · Post by **Candy** » Wed Jan 12, 2005 11:07 am

Taking a few potshots at some comments I don't agree with:

As a general usability design ideal: try to reduce the number of passwords user has to remember. Password encrypted files are stupid: either people just use their normal login password, or they use so easy passwords that you'll brute force them in minutes. Either way it's bad idea.

Read the encryption design. The login design.

As a part of the filesystem on which the entire system runs (ideally) you get ONE password and ONE username for the entire system. It's constructed in such a way (see above) that it's not possible to access the files through anything but the first hash of the password, and it's impossible to get the first hash of the password from the disk (only the second one is stored). The password is not kept in memory in plain form making it unfeasible to search through it for the sequence that constructs the password (IE, unix "strings" program would very probably miss it).

The password hash is used for using all encrypted files and all encrypted group files that the user should be able to access normally, but now encrypted. The user DOES NOT enter any more passwords. The user can use his UUID (universally unique identification) as a number, together with some form of online authentication not yet thought out to authenticate online and then use whatever he wants, again without password.

The user has one password. The computer does all the rest. It's not hackable locally, it should not be hackable remotely (in the future).

Personally I think that it is enough to be able to read the filesystem from one OS, as long as (1) you can mount local fs from a booting cdrom, and (2) it supports network filetransfers with other OS, or can mount FAT.

The first is a nice thought. The second one is unacceptable. You can't assume everybody has a network in any form, and you can certainly not blame my attempts to abolish FAT. It's patented, trash, unreliable and badly designed, and yet you want to require others to use it? What's the point of a new design at all then?

But that's beyond the scope of the filesystem, really

Uh... I disagree?

The file system is the first thing people use that can use passwords. If it has a single way to store passwords for them (hashed, of course) the rest of the system can profit from that.

New thought: plain system support. Pretty much a distilled version of what you guys suggested here all along, and it might even help kick FAT's butt some more...

Same as the other, no dates, no incremental files, all sections are single-slice, stuff like that. Very simple to load, very easy to implement (for others) and easy to be compatible with. Now that's something for file exchange...

Still, I do see a lot of use for a "common" file storage system for user files too, and I do consider these filesystem-related.

Can we, for clearness of the discussion separate "attributes", things that actually modify the file or it's use patterns and "metadata", stuff that's normally an attribute but does not change its use patterns or its contents. Attributes need to be predefined and consistent with all file system implementers. If somebody changes the meaning of one of them, they can't interoperate.

For attributes, you only have to know the type of attribute. You predefine a bunch of types (date, int, real, string) and allow the user to pick a type and a content. You could implement all of these as a two-way lookup table, one with a key of the file name (inode #, something like that) and the other the value itself. You can then quickly summarize a bunch of files and find files with a property between X and Y etc. Seeing as most of these are sorted streams with an X and Y for searching, I propose a form of B+ tree for storage.

As for the implementation overhead, I was considering making the "hard" parts public domain code snippets anybody could use for free and claim their own. Say, the password hashing stuff, the encryptor/decryptor, date calculations in <whatever date format> etc.

Next post coming, I think I'm hitting the limit again.

Candy · Post by **Candy** » Wed Jan 12, 2005 11:24 am

Are there actually so many unique types of attributes?

Extrapolate the previous few things and come to the conclusion that it must be "yes".

1920, music had an artist, title, instruments and location.
1950, music had an artist, title, record company, instruments and location.
1980, music had an artist, title, record company, instruments, distrubution method (tape, record) and location
2000, music had an artist, title, record company, manager, instruments, distribution method (tape, record, mp3, cd, dvd-audio, ra), bit rate, location, genre.
2030, music had an artist, title, record company, manager, instruments, distribution method (tape, record, mp3, cd, dvd-audio, ogg, wma, aac, ra), bit rate, location, genre and what else? Can you predict that this is definitely everything?

As the people in 1980 couldn't predict this stuff, you can therefore not predict what's happening in 25 years. So, you can't possibly define all attribute types.

in conclusion, attributes matter and should be given more thought. the way FSes and OSes support chosen attributes is arbitrary, half-assed and misplaced.

This was an attempt to define a few attributes that would at least suffice until the first version of the FS is in place. These things you mention above this comment are already in place, and optional for all files.

The reason why i didn't extend the feature to source code atm, for instance, is that 99% of my text editors will *not* reuse the same file: instead they use the old inode as a "backup" file and create a new inode with the newly edited file ... which defeats the whole purpose, unfortunately

Just for the record, the mechanism through which this will work has been thought out and involves creating a new inode and updating two others for the older versions. The inode number containing the actual file MUST stay the same, if just for filesystem integrity (say, hard links?).

if you have an mp3 database so huge that it makes a difference, and if you want to search it at top speed, why not build an indexing database or whatever on top of that?

If you have a number of files with attributes, that you probably want to search through, why separate it from the filesystem? Seems like the place it should be to me.

For a simple note, we are still looking for other people supporting *FS. Not to be evangelising too much, but having a "universal file system" is pretty pointless if we are the only two in the universe using it. The goal is for it to support the features a modern-day user level file system should support for user files. Lesser versions of these can be made from this version, with the easiest for OS and system sections. Also, "sharing sections" might be an idea but I consider it a bad one.

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 12, 2005 11:25 am

zloba wrote: if you have an mp3 database so huge that it makes a difference, and if you want to search it at top speed, why not build an indexing database or whatever on top of that?

That's very precisely what "media library" part of Winamp and realone do ... the trick is that if the FS is not aware of that DB existence, the DB will get wrecked as soon as a file is moved/renamed from another program.

<story>
That's real life example. My fairy had a +320 collection of MP3 full of fairies stuff plus a few of our fav' cds backups ... she then whished to split both so that it'll be "tidied up" and that i don't hear child-sleepwell-spell songs when i wish to have her CPU making music for my ears...
That was not a good idea at all. For the next 2 weeks, her system complained about missing files (even if they were present 3 times in the list) and miserably failed to play anything but the same 5 songs after a third week. Solution ? i had to wipe out the whole DB and then insert *.mp3 --recursive again in the software (thus re-indexing the whole stuff)

How good is this ?
</story>

Now think of what would've happenned if the program was not only offering metadata stored as IDv3 tags, but also things like user preferences (don't play nirvana after 21pm ...) personnal comments, href to lyrics.com pages, etc ??

All those precious infos should have had to be entered again.

Candy · Post by **Candy** » Wed Jan 12, 2005 11:32 am

Oh yes, can comment on the MP3 subject too. Some OS using FAT kind of lost my MP3 directory and uh... forgot to put those files anywhere. Next scandisk (ok, that's pretty much telling) revealed about 350 unknown file fragments, with a total size of a gigabyte. All rounded up to the next 32K block of course

Solution was to write a VB program that read back from the end to the next TAG, and to extract the information using that info to rename the file. Saved me 300 renames and a lot of boring work.

zloba · Post by **zloba** » Wed Jan 12, 2005 11:46 am

Candy

>Are there actually so many unique types of attributes?
Extrapolate the previous few things and come to the conclusion that it must be "yes".

actually, this demonstrates my point nicely.
observe how each successive set of fields is pretty much a superset of the previous ones.
so what you do is define each of these into standards.
then, as you come up with a new standard that encompasses the previous one, you can upgrade the file attributes version and add new fields. you can even keep the file backwards-compatible with a little care.

and the best part is, none of these even have to exist at the time the basic FS standard is defined, because the existence of these standards has no effect on the FS design, yet the FS will be ready for these with extensible attributes.

also, as far as "type of attributes" is concerned, these are not unique standards but successive verion of one standard, backwards compatibility and all..

Pype

That's very precisely what "media library" part of Winamp and realone do ... the trick is that if the FS is not aware of that DB existence, the DB will get wrecked as soon as a file is moved/renamed from another program.

well then, why not make it work nicely? i think it's still possible to keep those in sync without too much work.
why not make the FS aware of the Media Library, like as a plugin? or maybe even make the Media Library into an FS of its own.

Candy · Post by **Candy** » Wed Jan 12, 2005 11:56 am

zloba wrote:
>Are there actually so many unique types of attributes?
Extrapolate the previous few things and come to the conclusion that it must be "yes".
actually, this demonstrates my point nicely.
observe how each successive set of fields is pretty much a superset of the previous ones.
so what you do is define each of these into standards.
then, as you come up with a new standard that encompasses the previous one, you can upgrade the file attributes version and add new fields. you can even keep the file backwards-compatible with a little care.

and the best part is, none of these even have to exist at the time the basic FS standard is defined, because the existence of these standards has no effect on the FS design, yet the FS will be ready for these with extensible attributes.

also, as far as "type of attributes" is concerned, these are not unique standards but successive verion of one standard, backwards compatibility and all..

Assuming you can always define those new fields before somebody else decides to "add those to your software to save you the trouble", three people doing the same slightly incompatible, chaos ensuing.

It's a nice idea, but it's not practical. Also, I don't think this is any better from not defining any metadata fields.

That's very precisely what "media library" part of Winamp and realone do ... the trick is that if the FS is not aware of that DB existence, the DB will get wrecked as soon as a file is moved/renamed from another program.
well then, why not make it work nicely? i think it's still possible to keep those in sync without too much work.
why not make the FS aware of the Media Library, like as a plugin? or maybe even make the Media Library into an FS of its own.

Why would you keep the media library separate from the file system if it's all about /files/ ? Integrate it as well as you can and let the FS driver do the work of keeping it in sync. Saves you work

. Also, especially do NOT "make other programs work nicely with X". As soon as you single out any given program you will give a bad signal to all the others. Either define your own (and try to get others to conform, say, by being way ahead of them) or do not integrate it. Don't complete others work.

zloba · Post by **zloba** » Wed Jan 12, 2005 12:12 pm

Candy

Assuming you can always define those new fields before somebody else decides to "add those to your software to save you the trouble", three people doing the same slightly incompatible, chaos ensuing.

It's a nice idea, but it's not practical. Also, I don't think this is any better from not defining any metadata fields.

standards should be managed and coordinated. otherwise indeed chaos results. think of how it's done with RFCs.

Why would you keep the media library separate from the file system if it's all about /files/ ? Integrate it as well as you can and let the FS driver do the work of keeping it in sync.

not exactly separate. i mean for them to work together, an FS module to manage files, a Media Library module to organize and classify them..

Candy · Post by **Candy** » Wed Jan 12, 2005 12:25 pm

zloba wrote: standards should be managed and coordinated. otherwise indeed chaos results. think of how it's done with RFCs.

Without manpower to coordinate and manage it (remember, we all have daytime jobs or studies) we have to choose to either predefine it all (increasing the overhead) or making it completely undefined but within a bunch of rigid lines. I think most people can determine that the bit rate would be an integer and that the creation date would be a date. Might add a "percentage" to it, but that should suffice. In those restrictions people have a single good choice (merge int and real btw, to number) and they will probably make it. It's the best uncoordinated effort we can do.

Time to call for a vote on the metadata. Do we:

1. Predefine all fields, predefine types, predefine names and types to apply it to?
2. Predefine fields, types and names, but not the filetypes it applies to?
3. Predefine only types.
4. Leave everything very much open (variant-type).

My personal vote goes to number 3.

not exactly separate. i mean for them to work together, an FS module to manage files, a Media Library module to organize and classify them..

How about a general "media library module" that is also a library for all other sorts of information about files? Just a metadata database?

I don't see why you want to separate these two so badly. Is there a reason I don't understand or see? For user files this is only a bit of overhead if it can be used, otherwise it's no overhead (you can't spend much time managing no information).

*edit: Yay! I made reply 42!

Back to topic though, the filesystem design. I'll update it asap and post a new version as soon as it's done.

I'm currently fighting with the idea of how to integrate the file system type... will put on music and think about it the rest of the evening.

Pype.Clicker · Post by **Pype.Clicker** » Thu Jan 13, 2005 3:53 am

[quote author=Candy link=board=1;threadid=7170;start=30#msg59717
Time to call for a vote on the metadata. Do we:

1. Predefine all fields, predefine types, predefine names and types to apply it to?
2. Predefine fields, types and names, but not the filetypes it applies to?
3. Predefine only types.
4. Leave everything very much open (variant-type).

My personal vote goes to number 3.
[quote][/quote]

We certainly have to predefine types (e.g. is the content an block ID, an integer, a part of a stream, etc), imho.
We probably also have to predefine baisc fields such as file size, file class identifier and whatever the system requires to access the data correctly.

So my policy would be "specify types, specify 'standard' attributes names, types and semantic, leave everything else open".

If we're going for metada==attributes, i suggest a part of the attributes is left for class-specific stuff. From a generic *FS program, the "document::author" will be an opaque "class-specific attribute #4" of "class #1234". That's enough for class 1234-aware software to know what it is and for generic software to know that it's something it shouldn't try to understand.

Whether access control fields should be *FS-wide or OS-specific probably depends on what we think is fair. If a file has been restricted to user X on system Y, what is system Z expected to allow ?? Imho, the only part of access control attributes (readable, writable, owner, etc) that has to be defined at *FS level is "other" permissions: the "World" is the only thing that will be common to both OSes.

zloba · Post by **zloba** » Thu Jan 13, 2005 7:00 am

the closest thing to how i want it is

4. Leave everything very much open (variant-type).

except i'm not sure what this "variant-type" business means.
i'd just say "leave everything open", defining only general categories of attributes and data (the filesystem model), while at the same time, defining additional standards for whatever you care to design and implement at the moment (security model, whatever else).

you can ignore my opinion. if i get to do it, i'll test my ideas anyway, it's where a prototype could prove very useful. you may think something is a good idea, but try using it and see how it works out.
why try so hard to decide everything at once?

f we're going for metada==attributes

i'm starting to think that these should be separate.
that is, metadata which is an integral part of the document vs. attributes are not and which may be system-dependent.

OSDev.org

Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design

Re:Filesystem design