Filesystem design
Filesystem design
I've been thinking about designing a FS. And I've come up with a simple but (i think) good idea.
Like with FAT there is a BPB but its very small it is:-
char [2] - Jump code
char [4] - Magic - "CSFS" (cimicis server file system)
char [8] - Disk Name
short - Number of root directory entries
double - number of sectors on disk
(maybe - double - serial number)
After the bootsector there will be some sectors which record which sectors are in use. 1 bit per sector. So this "table" is <No. of sectors> / 8 / 512 sectors long.
Then comes the root dir. This is 5 sectors long (allowing 80 32byte entries)
The directory entries are 32bytes long and are structured as follows:-
char [19] - Filename
double - First Sector
double - Size
double - Edit Date (seconds since ???????)
char - Flags
The flags byte is:-
Bit 0 - Set if read only, clear if write allowed
Bit 1 - Set if hidden
Bit 2 - Set if system file
Bit 3 - Set if file, clear if directory
Bit 4 - Set if executable
Bit 5 - Set if file is a driver module
Bit 6 - Set if file is the kernel
Bit 7 - Reserved = 1
Therefore for the kernel the flags byte is 0xFF
The bootsector recognises the kernel by the flags being all set. The filename can be used as a version number string or whatever.
The rest. At the start sector there are 2 doubles. One specifies the length of the block and the second the start sector of the next block. If it is the last block it is set to 0. (this means that block sizes are variable)
May have forgotten some stuff but I'm sure someone will ask for it if I have
Pete
Like with FAT there is a BPB but its very small it is:-
char [2] - Jump code
char [4] - Magic - "CSFS" (cimicis server file system)
char [8] - Disk Name
short - Number of root directory entries
double - number of sectors on disk
(maybe - double - serial number)
After the bootsector there will be some sectors which record which sectors are in use. 1 bit per sector. So this "table" is <No. of sectors> / 8 / 512 sectors long.
Then comes the root dir. This is 5 sectors long (allowing 80 32byte entries)
The directory entries are 32bytes long and are structured as follows:-
char [19] - Filename
double - First Sector
double - Size
double - Edit Date (seconds since ???????)
char - Flags
The flags byte is:-
Bit 0 - Set if read only, clear if write allowed
Bit 1 - Set if hidden
Bit 2 - Set if system file
Bit 3 - Set if file, clear if directory
Bit 4 - Set if executable
Bit 5 - Set if file is a driver module
Bit 6 - Set if file is the kernel
Bit 7 - Reserved = 1
Therefore for the kernel the flags byte is 0xFF
The bootsector recognises the kernel by the flags being all set. The filename can be used as a version number string or whatever.
The rest. At the start sector there are 2 doubles. One specifies the length of the block and the second the start sector of the next block. If it is the last block it is set to 0. (this means that block sizes are variable)
May have forgotten some stuff but I'm sure someone will ask for it if I have
Pete
Re:Filesystem design
No, just...eww...no.
Sorry if that sounds harsh, but designing a decent filesystem is a large project in its own right. It isn't something you can do by just getting inspired and jotting down a few notes.
Sorry if that sounds harsh, but designing a decent filesystem is a large project in its own right. It isn't something you can do by just getting inspired and jotting down a few notes.
Re:Filesystem design
Looks like you've reinvented FAT, but not quite as well.
Putting the 'next block' link in the block itself is a pain: it's much easier to dedicate the whole block to data and put the links all in one place. This is what FAT does.
What's the use in variable-length blocks?
I agree with Curufir: effective file system design is independent of OS design. You can have a file system without its own OS, and you can have an OS without its own file system. I suggest you either stick with somebody else's FS (I recomment FAT for simplicity and compatibility, or ext2 for niceness), or make a proper go at designing a database I mean file system.
Putting the 'next block' link in the block itself is a pain: it's much easier to dedicate the whole block to data and put the links all in one place. This is what FAT does.
What's the use in variable-length blocks?
I agree with Curufir: effective file system design is independent of OS design. You can have a file system without its own OS, and you can have an OS without its own file system. I suggest you either stick with somebody else's FS (I recomment FAT for simplicity and compatibility, or ext2 for niceness), or make a proper go at designing a database I mean file system.
Re:Filesystem design
But I thought that the disk access times would be lower if the info was sstored at the beginning of the block. So there are less seek times. That's also the advantage of variable size blocks.
Pete
Pete
Re:Filesystem design
but making an initial cache with all those info things is harder since you must seek around the disk. Look at NTFS.Pete wrote: But I thought that the disk access times would be lower if the info was sstored at the beginning of the block. So there are less seek times. That's also the advantage of variable size blocks.
Variable size blocks: call them extents and people here will think you've made a better design. It's the same.
Direntries: they need lots more information. PS: think about using inodes.
Things I think you should look at: NTFS design, Ext2 design, LogFS design, Modern OSes (Tanenbaum), Orange Book specification (USA DoD).
They all contain hints at ways to make your FS better, and they all contain their own flaws .
HTH, Candy
Re:Filesystem design
I've tried reading the design and implementation of ext2 but its too long. Is there a summary of the filesystem somewhere.
The idea was that there was no need for a cache of the "FAT" table
Pete
The idea was that there was no need for a cache of the "FAT" table
Pete
Re:Filesystem design
Caches are designed to speed up performance. Don't remove the cache, devise a way to make it more efficient / take less space. Using an extent instead of a block list is a good idea, it brings the storage of the disk space layout more similar to the usage of it.Pete wrote: The idea was that there was no need for a cache of the "FAT" table
As for the more peculiar choices, why 32-byte direntries? Why not CMA timestamps? (Create, Modify, Access) Why only 8 flags? Why no longer names than 19 chars? How do you handle extensions (.TXT things) ? What do you do with multiple links to one file, if someone wants to?
Re:Filesystem design
I wrote a bit about choosing a file system for Mobius. There's an overview of ext2 there, as well as a link to Dave Poirier's ext2-doc project.
http://mobius.sourceforge.net/documenta ... al/mfs.php
http://mobius.sourceforge.net/documenta ... al/mfs.php
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Filesystem design
i see few "design" in this ... merely implementation arbitrary choices ... Imho, unless you plan to do something CompletelyDifferent (tm) with your files that cannot be achieved with SomethingThatAlreadyExists, there is no real reason to come with YetAnotherFilesystem.
One should keep in mind that beyond blocks organization, there are a lot of API stuff that will depend on your file system design (for instance, it's especially tricky to port an application that requires C/M/A times on a file system that doesn't provide this facility ...)
One should keep in mind that beyond blocks organization, there are a lot of API stuff that will depend on your file system design (for instance, it's especially tricky to port an application that requires C/M/A times on a file system that doesn't provide this facility ...)
Re:Filesystem design
This was only an idea. I probally won't use it in my OS. I almost definiately won't following your points
Pete
Pete
Re:Filesystem design
I'm also designing a new filesystem. It's different from others, imo, partly because of the structure, partly because of the dynamic metadata handling (which imo is a proper design from the guesses we made here combined).
! Concepts
A number of separate spaces, called Entities. The entities for a disk are stored in the Root Entity Table, from cluster 8 through 15. This limits the number of root entities to 1024, however, from the point of usage this is a reasonable maximum. In some future version it might be extended easily, it's a file itself too.
A number of inodes per disk. Each disk has a number of inodes proportional to its size, but the amount can grow. The inode number itself is a structure that takes both the local inode number (being 48 bits wide) and the global disk ID (being 16 bits wide) in account, to come out to an inode number that is valid in the computer system, and unique in the system.
All special structures are themselves files, easing the updating of them by normal software (as opposed to using special hacks). They appear in the system entity, and are only to be found there. They map directly to the clusters that they are supposed to be, they're not an image of them.
Each inode contains only the created and modified-dates. It also contains the ACL for the file, some file-specific flags (compression, encryption, double-extent) and a typecode. The rest of the inode is filled with an extentlist. If the double-extent flag is set, the extent(s) pointed to by the inode are themselves extentlists for the file. If your disk is so full the file cannot fit in a small number of segments, you might as well suffer the consequences. The inode is only 64 bytes, leaving only 32 for extents (in the worst case, meaning only a single normal extent can be used).
The directory entries are more complex than with linux filesystems. As it is not interesting for a file link to know when the inode was last seen by anybody, it's more natural to know when it was last looked at by this file link. So, the accessed timestamp is moved here. The deleted timestamp is also placed here. If the dtime is in the past, the file link (and possibly inode) should be deleted instantly. If it is in the future, a future delete should be scheduled by the OS, so the file will be deleted at that moment. Not sure whether this is a useful feature, but it does seem like a smart idea to me.
They also contain the inode number (64-bit) and the name pointer (to a name buffer, not designed yet).
Each disk (or partition.....) contains a number of files and its own system entity. The system entity contains these files:
* BOOT - cl0-7
* INODES - cl16-xx
* RET -cl8-15
* CONFIG - random cl
* BAD -random cl
* FREE - random cl
Config contains a config file specifying what disks need to be present too to make the system work. Bad points to an existing block (or existing blocks), that contains extents of bad blocks. Free points to an existing block (same as bad), but this contains extents of free blocks. Both BAD and FREE can also be indirected one step more so they can actually contain all free segments (possibly needing defragmentation, but that's a module thing).
The extents are like in NTFS, with a slight twist.
Extents are for example 22 00 10 00 20 for an extent at 2000 with size 1000 (you'll see those two if you think little endian), and the 22 stands for 2-byte size spec plus 2-byte offset spec
Since this left both slack space existant, and didn't exhaust the code space, I decided to add one to it.
29 00 10 10 00 01
this one contains a block at clusternumber 1000, with it containing from 010 to 020 the last 10h bytes from this file. This is a construction that does not need to appear, it can appear if the OS decides the file is very unlikely to change, and doesn't need to waste space.
Some more thought going in tomorrow, hoping to get some constructive criticism on this. Followed up by a post about the metadata thing, this one is full (forum definition of full).
! Concepts
A number of separate spaces, called Entities. The entities for a disk are stored in the Root Entity Table, from cluster 8 through 15. This limits the number of root entities to 1024, however, from the point of usage this is a reasonable maximum. In some future version it might be extended easily, it's a file itself too.
A number of inodes per disk. Each disk has a number of inodes proportional to its size, but the amount can grow. The inode number itself is a structure that takes both the local inode number (being 48 bits wide) and the global disk ID (being 16 bits wide) in account, to come out to an inode number that is valid in the computer system, and unique in the system.
All special structures are themselves files, easing the updating of them by normal software (as opposed to using special hacks). They appear in the system entity, and are only to be found there. They map directly to the clusters that they are supposed to be, they're not an image of them.
Each inode contains only the created and modified-dates. It also contains the ACL for the file, some file-specific flags (compression, encryption, double-extent) and a typecode. The rest of the inode is filled with an extentlist. If the double-extent flag is set, the extent(s) pointed to by the inode are themselves extentlists for the file. If your disk is so full the file cannot fit in a small number of segments, you might as well suffer the consequences. The inode is only 64 bytes, leaving only 32 for extents (in the worst case, meaning only a single normal extent can be used).
The directory entries are more complex than with linux filesystems. As it is not interesting for a file link to know when the inode was last seen by anybody, it's more natural to know when it was last looked at by this file link. So, the accessed timestamp is moved here. The deleted timestamp is also placed here. If the dtime is in the past, the file link (and possibly inode) should be deleted instantly. If it is in the future, a future delete should be scheduled by the OS, so the file will be deleted at that moment. Not sure whether this is a useful feature, but it does seem like a smart idea to me.
They also contain the inode number (64-bit) and the name pointer (to a name buffer, not designed yet).
Each disk (or partition.....) contains a number of files and its own system entity. The system entity contains these files:
* BOOT - cl0-7
* INODES - cl16-xx
* RET -cl8-15
* CONFIG - random cl
* BAD -random cl
* FREE - random cl
Config contains a config file specifying what disks need to be present too to make the system work. Bad points to an existing block (or existing blocks), that contains extents of bad blocks. Free points to an existing block (same as bad), but this contains extents of free blocks. Both BAD and FREE can also be indirected one step more so they can actually contain all free segments (possibly needing defragmentation, but that's a module thing).
The extents are like in NTFS, with a slight twist.
Extents are for example 22 00 10 00 20 for an extent at 2000 with size 1000 (you'll see those two if you think little endian), and the 22 stands for 2-byte size spec plus 2-byte offset spec
Since this left both slack space existant, and didn't exhaust the code space, I decided to add one to it.
29 00 10 10 00 01
this one contains a block at clusternumber 1000, with it containing from 010 to 020 the last 10h bytes from this file. This is a construction that does not need to appear, it can appear if the OS decides the file is very unlikely to change, and doesn't need to waste space.
Some more thought going in tomorrow, hoping to get some constructive criticism on this. Followed up by a post about the metadata thing, this one is full (forum definition of full).
Re:Filesystem design
The metadata system works by having a number of modules, defined by the OS structures (higher level), that take a file of type X, and apply *some* algorithm to get to a statistic S. This statistic can then be cached for all files that have a statistic S, and for the files with a statistic of type S an index can be made. They can thus also be searched. Files can have more than one statistic of S coming out of that algorithm, and of course, can have none.
This allows you to sort your documents on page count, word count, read complexity, relevance to <subject>, find any document that uses the word combination "obsessive compulsive disorder" or the acronym "OCD", stuff like that.
Also, this is done without explicitly putting in the OS any sort of statistic. The OS only allows you to add modules that convert a filetype to a statistic, and allows you to make indexes on those statistics (alphabetically, by size, etc).
This solves the n most common problems with metadata
- You don't have to keep it up to date. Most people can't bring it up to adjust an ID3 tag if it's bad, so why would they do this instead? It's not needed here.
- The program that knows most does the statistics. The OS doesn't have to learn DOC or PDF to search it, and any new file type will be able to have their contents indexed without having to wait for me.
- It's up to date instantly. You don't have to refresh or do anything weird, if the cache is older than the files, the cache will not be used. Simple as that, and it works like a charm.
End of metadata story. The cache is stored in every directory that has a cache (if no statistics, why make an empty cache?) under the name of $metadata_cache. All indexes are stored as $index_<name>.
------
I need to get these things posted on my website instead of littering the forum with them. Excuse me for this rant.
This allows you to sort your documents on page count, word count, read complexity, relevance to <subject>, find any document that uses the word combination "obsessive compulsive disorder" or the acronym "OCD", stuff like that.
Also, this is done without explicitly putting in the OS any sort of statistic. The OS only allows you to add modules that convert a filetype to a statistic, and allows you to make indexes on those statistics (alphabetically, by size, etc).
This solves the n most common problems with metadata
- You don't have to keep it up to date. Most people can't bring it up to adjust an ID3 tag if it's bad, so why would they do this instead? It's not needed here.
- The program that knows most does the statistics. The OS doesn't have to learn DOC or PDF to search it, and any new file type will be able to have their contents indexed without having to wait for me.
- It's up to date instantly. You don't have to refresh or do anything weird, if the cache is older than the files, the cache will not be used. Simple as that, and it works like a charm.
End of metadata story. The cache is stored in every directory that has a cache (if no statistics, why make an empty cache?) under the name of $metadata_cache. All indexes are stored as $index_<name>.
------
I need to get these things posted on my website instead of littering the forum with them. Excuse me for this rant.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Filesystem design
Integrating and automating the use of metadata is also one of my primary goals for Clicker's IFS. I'm currently developing a Linux-based emulation of the features i wish to store metadata such as PDF document's title, bibliography list etc. and searching them quite efficiently (by simply grep'ing the collection of metadata stores) while still hiding all these stuff from the user.
The document are mimetype-classified when you insert it in the system and you can later "cast" it to a specific sub-class (like trying to find the bibliography when meta-cast article mypaper.pdf is called) and the system might complain with things like "found 'Chapter 1' in mypaper.pdf; likely to be a book. If you're sure this is a paper, retry using the --force flag".
The next step is to integrate g00gling to the system (allowing an auto-lookup for lyrics when you will meta-cast song madonna-latest-hit.mp3)
The document are mimetype-classified when you insert it in the system and you can later "cast" it to a specific sub-class (like trying to find the bibliography when meta-cast article mypaper.pdf is called) and the system might complain with things like "found 'Chapter 1' in mypaper.pdf; likely to be a book. If you're sure this is a paper, retry using the --force flag".
The next step is to integrate g00gling to the system (allowing an auto-lookup for lyrics when you will meta-cast song madonna-latest-hit.mp3)
Re:Filesystem design
Oh, sometimes, ranting helps for it loosens the knots sitting at crucial places in /dev/brain (thx pype).
Your Metadata idea resembles a bit of Apple HFS+ (Ressource fork&Data fork - just a whisp of this kind )
Would be nice to see this in action but be careful: *especially* joe average in his helpless mistrust might come to the idea not to trust something which takes the thinking for him - and he doesn't see where it has it's brain.
By the way, what exactly do you mean with "statistics?" in terms of metadata? I 'm a bit confused about this part. It's ok, the os doesn't need to know about file types and sort of ... this is for special modules as far as I understand it. and these modules then do the actual searching/combining/page or line counting (something like a wc?)?
@pype: Joe Average is of course happy with stuff hidden away from him, but will it make sarah programatzki happy, if she can't delve into the systems depths? May be you are right, may be there are more of Joe Average than of Sarah Programatzki outta there.
Your Metadata idea resembles a bit of Apple HFS+ (Ressource fork&Data fork - just a whisp of this kind )
Would be nice to see this in action but be careful: *especially* joe average in his helpless mistrust might come to the idea not to trust something which takes the thinking for him - and he doesn't see where it has it's brain.
By the way, what exactly do you mean with "statistics?" in terms of metadata? I 'm a bit confused about this part. It's ok, the os doesn't need to know about file types and sort of ... this is for special modules as far as I understand it. and these modules then do the actual searching/combining/page or line counting (something like a wc?)?
@pype: Joe Average is of course happy with stuff hidden away from him, but will it make sarah programatzki happy, if she can't delve into the systems depths? May be you are right, may be there are more of Joe Average than of Sarah Programatzki outta there.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Filesystem design
don't worry for Sarah! so far, she has "meta-geek-browse" to see the different 'forks' as unix files in a virtual directory and "meta-geek-edit" where she can see the whole meta-data for a given file expressed as an XML file. As XML will likely to be the default format for exporting metadata (on a CDROM, for instance), any class of document will have to provide a metadata marshaller/unmarshaller to convert internal stuff into XML and those meta-geek stuff will certainly be kept for the production version ...