Filesystem design: B-trees

nakst · Post by **nakst** » Wed Dec 26, 2018 2:21 pm

I've been reading about ext4 and NTFS. They both seem to use variants of B-trees to index their directories to improve file lookup time.
I was wondering whether it would make sense to use a single B-tree for the entire volume which contains every file and folder, using a hash of their full path and filename as keys. Therefore when a file is opened, the filesystem driver does not need to traverse every leading directory, but only a single B-tree.

So...

Why do these filesystems index at directory level with filename keys, rather than at the volume level with path+filename keys?
What are the advantages and disadvantages of doing this?
I think the answer is no, but would a (large) hash table be more efficient?

Thanks.

MichaelFarthing · Post by **MichaelFarthing** » Thu Dec 27, 2018 4:14 am

A hash system would hinder a search on a partial name - a useful facility.
You could, of course, provide a hash search just as one option.

thepowersgang · Post by **thepowersgang** » Thu Dec 27, 2018 4:17 am

Smaller maps are more time efficient to search (both due to being faster to load, and having less items to check), and since filesystems are for storage mediums where space isn't at a premium, it's more important to optimise for faster accesses (and notably avoiding needing to load lots of data from the disk)

Having each directory store its contents also improves locality. If you've just opened one file from a directory, then it's likely you're going to need another one sooon. If the first file's record is within the same block as the second, then it will already be loaded into the disk cache.

There's also the desire to be able to (cheaply) list a directory's contents, which necessitates grouping each directory's entries.

nakst · Post by **nakst** » Thu Dec 27, 2018 5:37 am

thepowersgang wrote:There's also the desire to be able to (cheaply) list a directory's contents, which necessitates grouping each directory's entries.

Apologies that my question wasn't very clear.
I am intending to have the B-tree for path to file/directory lookup, not to replace the directory entries. Each directory still has list of directory entries containing filenames and inodes* (allowing fast directory enumeration), and the nodes of B-tree point to these directory entries.

*I actually am thinking about storing inode data inline in the directory. For a file with multiple hard links, there will be one "master" entry that contains the inode information, and that the other hard links point to. If the master link is deleted then another hard link is chosen as the master, and the inode information is moved.

Korona · Post by **Korona** » Fri Jan 11, 2019 1:46 am

thepowersgang wrote:There's also the desire to be able to (cheaply) list a directory's contents, which necessitates grouping each directory's entries.

That's an operation that can also be supported with a global b-tree: If the tree is sorted by path depth (number of slashes) first, and then by full file system path, the contents of a directory are placed next to each other.

nakst · Post by **nakst** » Sun Jan 13, 2019 6:31 am

After implementing a single global file B-tree in my new filesystem, I have found the following problems:

You can't implement symbolic links, since you'd have to lookup each leading directory in the tree to check if was a link (which defaults the purpose of having the single tree, as the whole point was to only need 1 tree search operation to open a file)
You can't implement directory permissions, since you'd have to lookup each leading directory
You can't move directories without iterating over every child, removing them from the index tree, recalculating the hash of their path, and inserting them into the tree again.

With these problems, I've decided to switch to a more standard per-directory index tree. I'd be interested to hear if anyone has solutions for these problems though!

Korona · Post by **Korona** » Sun Jan 13, 2019 9:44 am

I think you have to accept that you need to look at each component of the path. IMHO that is not the goal that a single global B-tree tries to solve. Rather, I'd see them as a solution to fragmentation problem: if you store one B-tree per directory, you end of with lots of wasted blocks (because all directory entries fit into the root of the B-tree). The directory-move problem is solved by the "order by depth, then path" strategy that I described above.

Another issue though: Why are you inserting hashes into the tree instead of dealing directly with filenames?

nakst · Post by **nakst** » Sun Jan 13, 2019 11:37 am

Korona wrote:I think you have to accept that you need to look at each component of the path. IMHO that is not the goal that a single global B-tree tries to solve. Rather, I'd see them as a solution to fragmentation problem: if you store one B-tree per directory, you end of with lots of wasted blocks (because all directory entries fit into the root of the B-tree).

Does the root of the B-tree have to be in a separate block to the rest of the directory metadata? With 1KB directory entries, I have roughly 900 bytes of leftover space for inline data, in which I could fit about 28 keys (with an 8 byte hash). Looking at my Linux partition, 90% of folders have less than 28 files. So maybe a B-tree per directory isn't that much of a problem?

Korona wrote:The directory-move problem is solved by the "order by depth, then path" strategy that I described above.

I don't understand how this works, sorry. Surely you still have to move all the keys to a different place in the tree, and update their path?

Korona wrote:Another issue though: Why are you inserting hashes into the tree instead of dealing directly with filenames?

I'm assuming using fixed size keys is simpler than variable size paths/filenames.

Korona · Post by **Korona** » Sun Jan 13, 2019 11:59 am

nakst wrote:
Korona wrote:I think you have to accept that you need to look at each component of the path. IMHO that is not the goal that a single global B-tree tries to solve. Rather, I'd see them as a solution to fragmentation problem: if you store one B-tree per directory, you end of with lots of wasted blocks (because all directory entries fit into the root of the B-tree).
Does the root of the B-tree have to be in a separate block to the rest of the directory metadata? With 1KB directory entries, I have roughly 900 bytes of leftover space for inline data, in which I could fit about 28 keys (with an 8 byte hash). Looking at my Linux partition, 90% of folders have less than 28 files. So maybe a B-tree per directory isn't that much of a problem?

Yeah, but the idea would be to not require a full block per directory, exactly because most folders do not contain many files.

nakst wrote:
Korona wrote:The directory-move problem is solved by the "order by depth, then path" strategy that I described above.
I don't understand how this works, sorry. Surely you still have to move all the keys to a different place in the tree, and update their path?

If you tree is ordered by path depth first (i.e. number of slashes in a path) and the path itself second, all entries from a given directory are next to each other. The other would be:
foo/bar1
foo/bar2
foo/baz/bazbar1
foo/baz/bazbar2
and so on. Because the entries are adjacent in the tree, to move the full directory (including subdirectories), you only have to: (1) find the B-tree node that contains the first file in the directory, (2) find the B-tree node that contains the last file in the directory, (3) move the entries out of the first and last block to their new location and (4) rewrite the blocks in between the first and the last block, without actually copying data. You could store path names relative to some (suitably chosen) parent B-tree node to compress the paths and avoid making changes to the paths during a move.

I did not look into this in detail though, so those are only some rough suggestions. It might make sense to look into btrfs that does store the whole FS in a single B-tree.

nyc · Post by **nyc** » Mon Jan 13, 2020 2:43 pm

Apart from the global B trees, it might make sense to use thresholds to switch between algorithms at different directory sizes. Hashing has some limits with searches on partial keys, though I think POSIX-like system call API's don't really do much with that. There are variations on tries, hash tries, which use open-addressed hash tables in lieu of arrays for mostly-empty nodes that are good for sparse key spaces I've heard of being used in external algorithms for databases, though one can use one level open-addressed hash tables in a pinch. I have rarely seen filesystems use hash tables or hash tries and am likely not experienced enough with filesystems to say why. I've only seen them used as external i.e. on-disk indexing structures in databases.

bzt · Post by **bzt** » Mon Jan 13, 2020 4:12 pm

nakst wrote:I was wondering whether it would make sense to use a single B-tree for the entire volume which contains every file and folder

The HFS+ did exactly that. It had 3 B-trees, one for the filenames, one for the extents (allocation table-like info), and one for the extended attributes. It was good for many years, but now Apple has obsoleted that in favor of APFS (not sure if that has a single B-tree catalog too).

So I guess you could do some performance tests on a HFS+ volume to see if a single B-tree worth it. Also, the source code of hfsplus is publicly available, so you can peek how to implement this efficiently.

Cheers,
bzt

OSDev.org

Filesystem design: B-trees

Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees

Re: Filesystem design: B-trees