Nora's Reliable FileSystem (NRFS) v0.2

Demindiro · Post by **Demindiro** » Mon Jan 30, 2023 1:13 pm

It took me longer than expected but I've finally finished v0.2 of my filesystem.
Some of the new features aren't properly supported by the tool & driver yet as I want to get v0.2 out ASAP.
v0.2.1 will be focused on fully supporting these features.

The on-disk format for the object store is almost fully stable. While I may still change the layout of the header slightly (& possibly other changes) they should be trivial to add in a backwards-compatible way.
Hence, I will guarantee that every next major version will be able to load & upgrade the previous version of the filesystem, i.e. v0.3 will be able to load & upgrade v0.2 filesystems.

I've completely overhauled the cache. It is now nearly fully concurrent, i.e. many I/O operations can run in parallel with few exceptions (e.g. no multiple concurrent resizes on the same object). This also allows packing data in parallel, significantly improving performance on multicore systems.
While it is possible to do concurrent reads, the FUSE driver does not take advantage of this yet. Making it concurrent is on the todo-list for v0.2.1.

I've also added encryption. I was originally going to leave this to the block driver but I figured it'd be more efficient and ergonomic to handle this directly.
It uses two keys: one for the header and one for the rest of the filesystem. The latter key is stored in the header.
This allows changing the key/password quickly and cheaply if it is compromised.
The tool and FUSE driver currently only support password-based encryption but an option to supply a key directly will be added for v0.2.1.

Mirroring and error correction have also been implemented, though the make tool currently only supports creating a filesystem on a single disk only. Again, I aim to support it properly by v0.2.1.

You can find the source code on github and on sourcehut.
Binaries for Linux are also provided.

Now I'm off to find a drive that doesn't make clicky sounds and copy a bunch of data to it. Let's see how long it takes before the filesystem borks itself

.

BenLunt · Post by **BenLunt** » Wed Feb 01, 2023 7:20 pm

I just wanted to quickly comment that I am interested in looking at your specs. I have always been fond of file systems. However, I currently am working on a different project, so it might be a while, but I plan to and give comments when I do.

Thanks,
Ben

Demindiro · Post by **Demindiro** » Thu Feb 02, 2023 11:28 am

BenLunt wrote:I just wanted to quickly comment that I am interested in looking at your specs.

You can find the latest specifications here

architecture.rst describes how the driver itself works. It isn't complete and mostly serves as a reference to explain design decisions in the driver itself.

object_store.rst describes the layer that is responsible for storing data on disk. It provides a list of objects, each of which contains arbitrary byte data, to the layer above.

filesystem.rst describes the actual filesystem, i.e. that with directories, symlinks, metadata like mtime, permissions etc.

The specification is a bit of a mess as I tend to mix terminology (e.g. "record" refers to both a reference to data as well as the data itself). Finding proper terms is still on my TODO list.

Demindiro · Post by **Demindiro** » Thu Feb 02, 2023 5:07 pm

I've released v0.2.1-alpha0.

Changelog wrote: Tooling has improved to be more UX friendly.

New features:

* fuse, tool: create & load multi-image filesystems
* fuse, tool: a key file can be saved & provided to skip key derivation
* tool: argon2id parameters are configurable
* tool: dump shows more filesystem-wide info
* fuse: implement statfs
* fuse: add option to control cache size

Fixes:

* fuse: use rwx-r-xr-x permissions as default
* nros: fix block count mismatch between chains

I also realized I made a very silly mistake: the block size field is in the encrypted part of the header. This means that, for now, it is impossible to load encrypted filesystems with a block size other than 12 (which is the hard-coded default the FUSE driver and the tool happen to have).

Demindiro · Post by **Demindiro** » Sat Feb 18, 2023 1:55 pm

v0.2.1 is almost finished. The main changes are:

nros: Added "hard" memory limit which helps avoid huge memory usage by forcing the main task to stop once in a while.
It's not perfect and there is one case where the limit can still be exceeded, but it already helps a great deal.
Do note that the total memory usage of the FUSE driver depends on how many dentries the kernel decides to keep in cache.
fuse: the driver is *almost* fully concurrent, though I will probably postpone it for v0.3. See below for explanation.
fuse: implement statfs() properly.

Also a few fixes:

tool: fix record size setting provided by the user being ignored.
nros: fix erroneous object bitmap index calculation.
Due to this bug, in combination with the previous tool bug, any filesystem with more than 512K objects is likely corrupt.
I failed to find this previously as the fuzzer, due to corpus length limits, wasn't able to create enough objects.
I've increased the limit accordingly and will keep increasing it over time.

However, I'm unsatisfied with the current design. To fix it I will have to break my promise to guarantee backwards compatibility though.

Vision

My vision for this filesystem is to have a relatively simple design that can reasonably be implemented from scratch by any single OS developer while providing some modern features such as compression, CoW ...

The on-disk itself is indeed quite simple. There are few data structures and the structures themselves are compact. The specification is no more than 770 lines. I do not expect it to become much larger either.

The implementation, especially that of the object store, is becoming far more complex than I originally intended. While I've managed to make it *almost* work while still being concurrent I believe it can be simplified & potentially more performant by changing some of the data structures.

Reflection on the current design

To efficiently deal with compression & encryption the filesystem employs its own cache. Initially this was a simple mapping of LBA to "unpacked" records with a writeback buffer on top. Later this was replaced with a cache that uses the "bubble up" approach.

Making the latter approach proved to be difficult as objects can be resized at any time, which may increase or decrease the depth of each object's tree. Especially shrinking was difficult as data may need to be destroyed.

At first, I attempted to solve this by iterating the tree from top to bottom, but this conflicted with the bubble up mechanism and was abandoned.
Then I considered simply writing zeros to all leaves past the original length, but this is impractical for very large objects (multiple exabytes).
I finally ended up making a function `write_zeros`, which can walk up the tree if it finds a region full of zeros (which take up no disk space).

Even with `write_zeros` the extra records are present in the cache until they are evicted. Eviction for dirty records is an expensive operation, so these records are moved to a "pseudo object" where these records are evicted in the background.

However, a task may already be working with a record that has been moved to a pseudo object. To solve this each entry has a pointer to a "key" that is updated whenever it is moved. Tasks can take a reference to this key to keep track of the real location. While this works well it is difficult to keep mentally track of. This difficulty manifests as various concurrency bugs where a task did not properly keep track of an entry's location.

Recently, in preparation for proper concurrency, I've added various locks to prevent e.g. concurrent resizes, writes during transaction commit ...
The one thing I forgot to add to the locks was tracking of moved entries. Worse, due to entry moves a single lock may need to somehow lock multiple objects. At this point I became convinced the current design is too complex.

Redesigning the object store

After some consideration, I decided to make three major changes:

1. Reduce LBA size to 39 bits

This reduces the size of a reference (more in 2.), which allows packing more references in a single object or parent record, in turn reducing tree depth and hence latency.

I settled on 39 bits since:

32 bits is too little IMO. With 32 bits and 4K blocks you can address up to 16TiB of data. There are already disks that are larger than that, not to mention many setups feature multiple disks.
39 bits with 4K blocks allows addressing up to 2PiB of data, which I'm confident by the time anyone reaches this they will use either multiple filesystems or some "enterprise" filesystem.
39 bits leaves 25 bits for the record length, which allows records up to 16MiB (hey, I get to fix another design mistake: you cannot store the number 2^24 in just 24 bits).

2. Add headers to records

The header can contain the hash, compression algorithm used ... which then does not need to be stored in the reference. This allows shrinking the reference further to just 64 bits.

This will lead to some slack for large files with no compressible data, though I don't believe this is a major issue: with the default record size of 128K and 4K blocks, up to 3.125% of blocks may be wasted due to this scheme. In practice, I believe this is negligible compared to other sources of slack or other kinds of space waste.

Incidentally, this makes the old caching scheme (with the simple LBA mapping) practical again. Much time has been wasted and many lessons are learned *sigh*.

3. Multiple trees per object

This is the biggest change and is what motivated the previous changes. Having multiple trees per object and giving each a different depth avoids the need to be able to grow or shrink trees, avoiding all the complexity described above.

Originally I was going to have 3 trees per object with depths of 0, 1 and 3 respectively. This would have worked out to 128 bytes per object and (IMO) sufficiently large maximum file sizes.
However, I realized I could keep the size of each object at 64 bytes and have 4 trees per object by reducing the size of each reference to 8 bytes.

This does impose smaller size limits on objects though. The maximum size of each object can be calculated with

Code: Select all

max_size_of_tree(record_size_p2, depth) = 2**(record_size_p2 + (record_size_p2 - 3) * (depth - 1))
max_size_of_object(record_size_p2) = sum(max_size_of_tree(record_size_p2, d) for d from 1 to 4)

which yields the following limits (for a few possible record sizes):

Record size of 2**9: ~130MiB
Record size of 2**12: ~513GiB
Record size of 2**14: ~128TiB
Record size of 2**17 (default): ~512PiB
Record size of 2**19: ~128EiB, in practice 16EiB due to the range of a 64 bit integer.

I believe these limits are satisfactory, especially since very small record sizes like 512 bytes should be uncommon as they are useless in practice.

Redesigning the directory structure

While this doesn't strictly need an update and works fine as is, I believe it could be better.
In particular, I do not like the fact each directory consists of 3 separate objects (entry list, hash map and heap). I think putting everything in a single object would be cleaner and simpler.
It also currently has a requirement that the objects' IDs are consecutive. This was originally intended to allow fetching the three objects simultaneously, though in hindsight this is a useless micro-optimization.

---

Comments are welcome. I'd like to hear what others think since it helps finding more mistakes I may have made.

OSDev.org

Nora's Reliable FileSystem (NRFS) v0.2

Nora's Reliable FileSystem (NRFS) v0.2

Re: Nora's Reliable FileSystem (NRFS) v0.2

Re: Nora's Reliable FileSystem (NRFS) v0.2

Re: Nora's Reliable FileSystem (NRFS) v0.2

Re: Nora's Reliable FileSystem (NRFS) v0.2