File System Design.

BASICFreak · Post by **BASICFreak** » Fri Oct 16, 2015 7:16 pm

I've been offline and not updating GIT for the past week or so, instead I have been designing and testing a new File System.

My understanding of a File System may be a little flawed, as the only one I implemented thus far has been FAT, so bare with me just a little bit.

The design I have come up with take some points from FAT, EXT, and ELF - along with some personal assumptions.

The goal of this file system is quickly being able to open a file without searching through a directory tree (although you still could) - meaning that after initializing cache you only have one cluster to read before the actual file data. The way this works is through a hash table which hashes the full file path and associates said hash with the file block.

Before I go any further here are the data structures:

Code: Select all

typedef struct BFS_InfoBlock {
		uint8_t JumpCode[3];
		uint8_t SectorsPerCluster;
		char BFS_Signiture[4]; // "BFS1"
		uint16_t BytesPerSector;
		uint32_t TotalClusters;
		uint32_t HashFileBlock;
		uint32_t RootFileBlock;
		uint32_t StringTableFileBlock;
		uint32_t BATCluster;
		uint16_t ClustersPerBAT;
		char VolumeLabel[32];
		uint8_t BootCode[446];
		uint16_t BootSigniture;
	} __attribute__((packed)) BFSib_t, *BFSib_p;

	typedef uint32_t* BAT_p; // Binary Allocation Table

	typedef struct BFS_FileBlock {
		uint32_t Parrent;
		uint32_t ClusterChain[116];
		uint32_t Flags;
		uint16_t UID;
		uint16_t GID;
		uint32_t TimeStampCreated;
		uint32_t TimeStampModified;
		uint32_t HashValue[3];
		uint64_t SizeInBytes;
		uint32_t StringBlockOffset;
		uint32_t SlaveChainList; // FFFFFFFF = appended in this cluster, 0 = none
	} __attribute__((packed)) BFSfb_t, *BFSfb_p;

	typedef struct BFS_SlaveBlock {
		uint32_t ClusterChain[127];
		uint32_t SlaveChainList; // FFFFFFFF = appended in this cluster, 0 = none
	} __attribute__((packed)) BFSsb_t, *BFSsb_p;

	typedef struct BFS_DirectoryEntry {
		uint32_t FileBlockCluster;
		uint32_t StringBlockOffset;
		uint32_t Flags;
		uint16_t UID;
		uint16_t GID;
	} __attribute__((packed)) BFSde_t, *BFSde_p;

	typedef struct BFS_DirectoryTable {
		BFSde_t DirectoryEntry[32];
	} __attribute__((packed)) BFSdt_t, *BFSdt_p;

	typedef struct BFS_HashEntry {
		uint32_t Hash[3];
		uint32_t FileBlockCluster;
	} __attribute__((packed)) BFShe_t, *BFShe_p;

	typedef struct BFS_HashTable {
		BFShe_t HashEntry[32];
	} __attribute__((packed)) BFSht_t, *BFSht_p;

	typedef char* BFSst_p; // String Table

And the Hash function:

Code: Select all

uint8_t temp;
hash[0] = 2147481863;
hash[1] = 1147440961;
hash[2] = 4180001983;
while((temp = (uint8_t)*input++)) {
	hash[0] += temp;
	hash[1] -= temp;
	hash[2] /= temp;
	hash[0] ^= hash[1];
	hash[1] ^= hash[2];
	hash[2] *= hash[0];
}

Currently I have found no conflicts with my hash method (I hashed the full path to all 1.5 million files on my internal HDDs with 0 collisions. [which took ~700 ms - took longer to search for collisions...])

And am currently about 20-30% done with the utility for accessing the FS in *nix. (Able to create an image and read existing files - though cannot add new files...)

The hash table, string table, and root directory are stored as files on the disk - so they can be dynamically allocated (couldn't decide on a fixed size and dynamic seems better IMO)

The current design allows for a disk anywhere from 4KB to 32TB with 1 to 64 Sectors per cluster respectfully - though it would be easy to increase this to 16PiB but it wastes a-lot of space (128MB at 32TB). (note at 4KB you have no free space on disk as it is all allocated for the FS)

So to access a file you should:
1) Hash the full path
2) Search the hash table
3) Read file block
4) read file clusters...

To add a file you should:
1) Hash the full path of the sub directory
2) Search hash table
(if not found ERROR or search DIR tree)
3) Create new file block
4) Write file to disk (while setting Cluster Number in file block / slave block(s))
5) Add string (file name) to string table
6) Read sub dir file block
7) Read sub dir clusters (usually only need the last cluster)

Find blank entry <- Smiley not supposed to be there...
9) Insert new entry.
A) Hash the full path of the new file
B) Insert Hash Table Entry

The root directory is stored in the Hash Table as "/", all other directories are stored as "/dir" (no ending '/')

So, to the main point of this post...

Since my knowledge of File Systems is scarce, I would really appreciate feedback on this design.

Am I missing something important?
Do I have something I shouldn't?
etc...?

NOTE: I got side-tracked a few times writing this... So it may be missing some [important] info.

Brendan · Post by **Brendan** » Fri Oct 16, 2015 10:23 pm

Hi,

Random thoughts...

There's no real need to have much/any information in the first sector; and the first sector is typically much more important for boot code. Also, if you ever plan to put the file system on a floppy (or a device emulating a floppy) then you're going to get conflicts with the BPB.

A hash needs to be large so that you don't get too many collisions when there's millions of files; but it also needs to be small enough that that you can have a hash table in memory (e.g. "myHashTable[hash]"). Computers won't have enough RAM to support 96-bit hashes for at least several thousand years. In practice the file system driver is only going to be able to use a fraction of the full hash (e.g. "myHashTable[hash & 0x000FFFFF]") and the remaining bits of the hash only waste disk space.

"Parrent" should be "Parent".

I don't know what your timestamps are; but 32-bits is too small. Either the precision isn't enough (e.g. "seconds since the epoch"), or the range isn't enough (e.g. "microseconds since the epoch").

When you're allocating a new cluster (for whatever reason) and the disk is almost full, is there any way to avoid doing a linear search through all (4 billion?) Binary Allocation Table entries? What if there are no free clusters at all?

If you delete a file, do you end up with a hole in both the hash table and the string table; and if so how do you find suitable free areas/holes within the hash table and string table efficiently next time you create a file (especially the string file where the entries are variable length)?

How do you ensure that the file system doesn't become corrupted if/when (e.g.) power fails in the middle of any operation (or is unplugged by a careless user, or the OS or file system code or storage device driver crashes at the wrong time)? For example:

Hash the full path of the sub directory
Search hash table to make sure sub directory exists
If power fails before this point, everything is fine
Find and allocate a new file block.
If power fails at this point you end up with lost clusters (is there a way to find/reclaim them later or do they become permanently wasted?)
Write file to disk (while setting Cluster Number in file block / slave block(s))
If power fails at this point you end up with multiple lost clusters
Add string (file name) to string table
If power fails at this point you end up with multiple lost clusters and an entry in the string table that doesn't correspond to a valid file.
Read sub dir file block
Read sub dir clusters (usually only need the last cluster)
Insert new entry.
How does "insert" work (and are you sure you didn't mean "append")? If you need to relocate existing entries to insert a new entry before them, then you risk wiping out entire sub directory contents if power fails at the wrong time.
If power fails after this is finished successfully; you've got a file and directory entry but no entry in the hash table; so the file would mostly exist but couldn't be found.
Hash the full path of the new file and insert Hash Table Entry
How does "insert" work (and are you sure you didn't mean "append" here too)?

How do you ensure that the file system doesn't become corrupted if/when a sector fails (e.g. flash device starts getting close to its "max. number of writes")? Is there any redundancy in critical data structures? For example, if part of the cluster allocation table becomes unreadable is it possible to recover the files?

For "rotating disk" storage devices; if you create a lot of small files will the disk heads be constantly seeking between binary allocation table, hash table, string table, directories and file data spread everywhere throughout the disk (and not near each other to make seeks less expensive)?

How do you plan to minimise fragmentation (of files, hash table, string table, etc), and/or efficiently defragment the file system?

If all names (including the root directory) begin with a "/" character, what is the point of storing this character or including it in hashes?

Assuming strings/names are 100 bytes each on average, does this mean you can't store more than 43 million files (regardless of how huge the partition is) because the string table would be larger than 4 GiB and the "StringBlockOffset" field is only 32 bits?

Cheers,

Brendan

BASICFreak · Post by **BASICFreak** » Fri Oct 16, 2015 11:37 pm

I'll reply to what I have the answer for right now, and the rest when I find the answer...

Brendan wrote:There's no real need to have much/any information in the first sector; and the first sector is typically much more important for boot code. Also, if you ever plan to put the file system on a floppy (or a device emulating a floppy) then you're going to get conflicts with the BPB.

I'm fairly sure only MS file systems use BPB (and now EFI - which I have no support for anyways)

http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/determining-filesystem-type.html wrote:Linux mis-uses the Microsoft Data partition, by creating volumes in it that do not have valid BPBs. This is a fault in Linux. (EXT2/EXT3/EXT4 partitions could easily be given BPBs. In actual fact, the mke2fs tool has possessed the code for creating valid BPBs since 1996, when it was added by Matthieu Willm. It has simply been conditionally compiled out and relegated to an OS/2 porting section, of all things.).

Brendan wrote:A hash needs to be large so that you don't get too many collisions when there's millions of files; but it also needs to be small enough that that you can have a hash table in memory (e.g. "myHashTable[hash]"). Computers won't have enough RAM to support 96-bit hashes for at least several thousand years. In practice the file system driver is only going to be able to use a fraction of the full hash (e.g. "myHashTable[hash & 0x000FFFFF]") and the remaining bits of the hash only waste disk space.

Well seeing how my Hash Table uses 30.5MB for 1 million entries, any computer since 1999 should be able to cash at least 3 Million entries (91.5MB) without filling the RAM but 1/6th the way (saying 512MB in 1999 [I had 768MB in '99])
And seeing how all 5 of my PC's internal HDDs have a total of 1.5 million files - about 3.2TB including 2 windows installs, lubuntu, and many many games, backups, and codes... And seeing how even my test system has 4GB of ram...

But my design choices here was 32-bit hash, 96-bit hash, or 224-bit hash (the 96-bit seems safest)

Also explain why in practice it can only use a fraction of the hash... I don't get that one.

"Parrent" should be "Parent".

Possibly, but it's just a variable name...

I don't know what your timestamps are; but 32-bits is too small. Either the precision isn't enough (e.g. "seconds since the epoch"), or the range isn't enough (e.g. "microseconds since the epoch").

Well I was going to post my timestamp, but I realized it didn't work when I placed it in the reply box...

When you're allocating a new cluster (for whatever reason) and the disk is almost full, is there any way to avoid doing a linear search through all (4 billion?) Binary Allocation Table entries? What if there are no free clusters at all?

Currently there is no way to avoid searching the entire table, but it is planned soon to have a first free cluster variable. And upon finding no free clusters you will get a response of cluster number 0, which cannot be used as it is the boot cluster - and then you are just SOL, just like any other FS.

Skip a few... (I'll get back to them when I have the answer, I'm tired and cannot fully think clearly...)

For "rotating disk" storage devices; if you create a lot of small files will the disk heads be constantly seeking between binary allocation table, hash table, string table, directories and file data spread everywhere throughout the disk (and not near each other to make seeks less expensive)?

If you are using directory's and string table then there may be a decent bit of seeking, though if you are looking for a file 90% of the time you know the full path - in which case the hash table will point directly to the file block. (and if the hash table is in RAM you only have one read before File Data.) The only time I expect you to actually have to search the directories is for and only for directory listings (though if you are not doing recursive listing you can get the directory file block from hash table - and even if you are you should get the "root" of the listing from the hash table.)

How do you plan to minimise fragmentation (of files, hash table, string table, etc), and/or efficiently defragment the file system?

For files, I (will) attempt to allocate clusters in a row where possible to keep the file together. As for the tables it may likely need to go through a defragmenter every now and then to keep the clusters in order (or close) - though you could allocate more than one cluster to start with, which should help at least some.

If all names (including the root directory) begin with a "/" character, what is the point of storing this character or including it in hashes?

I felt like it. Plus my VFS (or my old one - haven't made one on this kernel...) passes the path with the leading '/'.

Assuming strings/names are 100 bytes each on average, does this mean you can't store more than 43 million files (regardless of how huge the partition is) because the string table would be larger than 4 GiB and the "StringBlockOffset" field is only 32 bits?

Good point... Especially since the max size for the string table itself is 64-bit...

And yes by insert I meant append.

Anyways Brendan you make some good points, and I will defiantly reflect on them all. These are the exact reasons I post before fully implementing an entire system.

Brendan · Post by **Brendan** » Sat Oct 17, 2015 3:07 am

Hi,

BASICFreak wrote:
Brendan wrote:There's no real need to have much/any information in the first sector; and the first sector is typically much more important for boot code. Also, if you ever plan to put the file system on a floppy (or a device emulating a floppy) then you're going to get conflicts with the BPB.
I'm fairly sure only MS file systems use BPB (and now EFI - which I have no support for anyways)

For USB flash the firmware on some systems uses the existence or non-existence of a BPB in the first sector of the disk to determine if it should pretend the device is a floppy or hard disk. For actual floppies you don't need a BPB; but if you don't have one Windows whines about the disk being unformatted and prompts the user to format it (and users don't understand, decide that Microsoft must be right and that your OS must be broken, and format the disk).

Mostly; you want a BPB in the first sector of an unpartitioned disk to prevent these sorts of problems (where first sector of an unpartitioned disk is also the first sector of the file system). Ext2/3/4 is mostly only used on partitioned hard drives where it isn't a problem.

BASICFreak wrote:
Brendan wrote:A hash needs to be large so that you don't get too many collisions when there's millions of files; but it also needs to be small enough that that you can have a hash table in memory (e.g. "myHashTable[hash]"). Computers won't have enough RAM to support 96-bit hashes for at least several thousand years. In practice the file system driver is only going to be able to use a fraction of the full hash (e.g. "myHashTable[hash & 0x000FFFFF]") and the remaining bits of the hash only waste disk space.
Well seeing how my Hash Table uses 30.5MB for 1 million entries, any computer since 1999 should be able to cash at least 3 Million entries (91.5MB) without filling the RAM but 1/6th the way (saying 512MB in 1999 [I had 768MB in '99])
And seeing how all 5 of my PC's internal HDDs have a total of 1.5 million files - about 3.2TB including 2 windows installs, lubuntu, and many many games, backups, and codes... And seeing how even my test system has 4GB of ram...

But my design choices here was 32-bit hash, 96-bit hash, or 224-bit hash (the 96-bit seems safest)

For a 32-bit hash to be actually useful (assuming 64-bit pointers - e.g. "pointer_to_first_entry_in_linked_list_for_hash = hashTable[hash]") you'd need to devote 32 GiB of RAM for the hash table alone. That's more RAM than most computers have. It makes no difference if there's 3000 files or 3 billion files.

For a 96-bit hash to be actually useful you'd need to wait for Intel to increase virtual address space size (from the current 48-bit virtual addresses up to a minimum of 100-bit virtual addresses). Then you're going to need to wait for RAM prices to drop significantly just so the wealthiest country on the planet can afford to buy the 576460752303423488 TiB of RAM to use for the hash table.

BASICFreak wrote:Also explain why in practice it can only use a fraction of the hash... I don't get that one.

Try doing "entry *myArray[ (1 << 96) ];" in a C compiler and see what happens. More practical is "entry *myArray[ ARRAY_SIZE ];" and then "pointer_to_first_entry_in_linked_list_for_hash = myArray[hash % ARRAY_SIZE];", but that just means most of the hash is being pointlessly wasted. It's fine (smart even) to have the hash a little bigger than necessary (so that you can use a larger array one day) but 96-bit is much much more than "a little bigger than necessary".

BASICFreak wrote:
"Parrent" should be "Parent".
Possibly, but it's just a variable name...

I honestly don't care if you fix any of the things I've mentioned. I've only mentioned them to help you improve the file system and its documentation.

BASICFreak wrote:
When you're allocating a new cluster (for whatever reason) and the disk is almost full, is there any way to avoid doing a linear search through all (4 billion?) Binary Allocation Table entries? What if there are no free clusters at all?
Currently there is no way to avoid searching the entire table, but it is planned soon to have a first free cluster variable. And upon finding no free clusters you will get a response of cluster number 0, which cannot be used as it is the boot cluster - and then you are just SOL, just like any other FS.

If you mean that instead of doing this:

Code: Select all

    allocated_cluster = expensive_linear_search();

..you're going to do this:

Code: Select all

    allocated_cluster = first_free_cluster;
    first_free_cluster = expensive_linear_search();

..then let me suggest a linked list of free clusters, like this:

Code: Select all

    allocated_cluster = first_free_cluster;
    first_free_cluster = BAT[allocated_cluster];

Of course this won't help fragmentation problems; which means it's extremely likely that there's a faster/better way.

BASICFreak wrote:
For "rotating disk" storage devices; if you create a lot of small files will the disk heads be constantly seeking between binary allocation table, hash table, string table, directories and file data spread everywhere throughout the disk (and not near each other to make seeks less expensive)?
If you are using directory's and string table then there may be a decent bit of seeking, though if you are looking for a file 90% of the time you know the full path - in which case the hash table will point directly to the file block. (and if the hash table is in RAM you only have one read before File Data.) The only time I expect you to actually have to search the directories is for and only for directory listings (though if you are not doing recursive listing you can get the directory file block from hash table - and even if you are you should get the "root" of the listing from the hash table.)

If you're creating a lot of small files; then for every file you create you have to write the file's data, modify the BAT, insert/append to the hash table, insert/append to the string table, and modify the directory. You might also have to modify the BAT just to increase the size of the hash table, string table or directory before you add the new file's hash/string/directory entry. This means that you need to seek many times to create one file; and it's likely (especially when the disk gets close to being full) that none of these things will be close to each other and the heads will be bouncing between the middle/end of the disk (where file data is) and the very start of the disk (where BAT is).

If you have a look at (e.g.) ext2 you'll see they spread metadata all over the disk to minimise seeks times. You can steal the same basic idea; and (e.g.) have the part of the BAT that keeps track of clusters 1000 to 1099 immediately before cluster 1000, the part of the BAT that keeps track of clusters 1100 to 1199 immediately before cluster 1100, the part of the BAT that keeps track of clusters 1200 to 1299 immediately before cluster 1200, and so on; so that the part of the BAT you need is always very close to the cluster you're using.

Cheers,

Brendan

onlyonemac · Post by **onlyonemac** » Sat Oct 17, 2015 10:17 am

Clever idea, but it pretty much does away with the concept of file paths at the lower levels. An interesting idea though, to index the files by the hashes of their paths, essentially "compressing" the path and eliminating the need to walk file trees.

Another (although somewhat unrelated) idea that I have heard of is to refer to the files by their own hashes, although in that case there are no file paths (so yeah, very unrelated).

BASICFreak · Post by **BASICFreak** » Sun Oct 18, 2015 3:21 pm

Brendan,

You make many good points, my design is s***, I'm back off to the drawing board.

I did some calculations on my design and though the hash table would not be as large as you say there are many many other issues with wasted space that I did not foresee, especially when the drive is smaller - a 512GB drive would require far more space than a 32TB drive, and the 512GB could hold more files than the 32TB... So yea, terrible design.

I will likely be back with an improved design - and I would really appreciate your opinion on it.

Also I hope my sleep deprived sarcasm about the variable name did not offend you too much, it seemed to come off worse than I wanted it to... (possibly just due to the

instead of

)

Thanks again,

B!

Brendan · Post by **Brendan** » Sun Oct 18, 2015 8:58 pm

Hi,

BASICFreak wrote:You make many good points, my design is s***, I'm back off to the drawing board.

The design isn't really s***; but when you design anything new (and non-trivial) there's always a lot of different things flying around inside your mind, there's always some things you overlook, and always one or more revisions because of that.

BASICFreak wrote:I will likely be back with an improved design - and I would really appreciate your opinion on it.

I'd be very tempted to split the file system into 2 pieces; where the lower level piece handles free space and either "sequences of bytes" or "sequences of blocks" (sort of like "malloc()" but for disk space not memory); and the higher level piece handles directories, file names, hash tables, etc (and uses the lower level piece to allocate/resize/delete disk space).

This makes it a easier to design/implement the file system (which is hard, especially when you want performance, robustness and efficient use of space) - basically; split a harder problem into 2 easier problems. It also means you might be able to have different "lower level pieces" with very different allocation/optimisation strategies (e.g. one for rotating hard disks, and one for USB flash and SSD where fragmentation and seek times aren't as important).

BASICFreak wrote:Also I hope my sleep deprived sarcasm about the variable name did not offend you too much, it seemed to come off worse than I wanted it to... (possibly just due to the instead of )

I don't get offended that easily.

Cheers,

Brendan

OSDev.org

File System Design.

File System Design.

Re: File System Design.

Re: File System Design.

Re: File System Design.

Re: File System Design.

Re: File System Design.

Re: File System Design.