v0.2.1 is almost finished. The main changes are:
- nros: Added "hard" memory limit which helps avoid huge memory usage by forcing the main task to stop once in a while.
It's not perfect and there is one case where the limit can still be exceeded, but it already helps a great deal.
Do note that the total memory usage of the FUSE driver depends on how many dentries the kernel decides to keep in cache.
- fuse: the driver is *almost* fully concurrent, though I will probably postpone it for v0.3. See below for explanation.
- fuse: implement statfs() properly.
Also a few fixes:
- tool: fix record size setting provided by the user being ignored.
- nros: fix erroneous object bitmap index calculation.
Due to this bug, in combination with the previous tool bug, any filesystem with more than 512K objects is likely corrupt.
I failed to find this previously as the fuzzer, due to corpus length limits, wasn't able to create enough objects.
I've increased the limit accordingly and will keep increasing it over time.
However, I'm unsatisfied with the current design. To fix it I will have to break my promise to guarantee backwards compatibility though.
Vision
My vision for this filesystem is to have a relatively simple design that can reasonably be implemented from scratch by any single OS developer while providing some modern features such as compression, CoW ...
The on-disk itself is indeed quite simple. There are few data structures and the structures themselves are compact. The specification is no more than 770 lines. I do not expect it to become much larger either.
The implementation, especially that of the object store, is becoming far more complex than I originally intended. While I've managed to make it *almost* work while still being concurrent I believe it can be simplified & potentially more performant by changing some of the data structures.
Reflection on the current design
To efficiently deal with compression & encryption the filesystem employs its own cache. Initially this was a simple mapping of LBA to "unpacked" records with a writeback buffer on top. Later this was replaced with a cache
that uses the "bubble up" approach.
Making the latter approach proved to be difficult as objects can be resized at any time, which may increase or decrease the depth of each object's tree. Especially shrinking was difficult as data may need to be destroyed.
At first, I attempted to solve this by iterating the tree from top to bottom, but this conflicted with the bubble up mechanism and was abandoned.
Then I considered simply writing zeros to all leaves past the original length, but this is impractical for very large objects (multiple exabytes).
I finally ended up making a function `write_zeros`, which can walk up the tree if it finds a region full of zeros (which take up no disk space).
Even with `write_zeros` the extra records are present in the cache until they are evicted. Eviction for dirty records is an expensive operation, so these records are moved to a "pseudo object" where these records are evicted in the background.
However, a task may already be working with a record that has been moved to a pseudo object. To solve this each entry has a pointer to a "key" that is updated whenever it is moved. Tasks can take a reference to this key to keep track of the real location. While this works well it is difficult to keep mentally track of. This difficulty manifests as various concurrency bugs where a task did not properly keep track of an entry's location.
Recently, in preparation for proper concurrency, I've added various locks to prevent e.g. concurrent resizes, writes during transaction commit ...
The one thing I forgot to add to the locks was
tracking of moved entries. Worse, due to entry moves a single lock may need to somehow
lock multiple objects. At this point I became convinced the current design is too complex.
Redesigning the object store
After some consideration, I decided to make three major changes:
1. Reduce LBA size to 39 bits
This reduces the size of a reference (more in 2.), which allows packing more references in a single object or parent record, in turn reducing tree depth and hence latency.
I settled on 39 bits since:
- 32 bits is too little IMO. With 32 bits and 4K blocks you can address up to 16TiB of data. There are already disks that are larger than that, not to mention many setups feature multiple disks.
39 bits with 4K blocks allows addressing up to 2PiB of data, which I'm confident by the time anyone reaches this they will use either multiple filesystems or some "enterprise" filesystem.
- 39 bits leaves 25 bits for the record length, which allows records up to 16MiB (hey, I get to fix another design mistake: you cannot store the number 2^24 in just 24 bits).
2. Add headers to records
The header can contain the hash, compression algorithm used ... which then does not need to be stored in the reference. This allows shrinking the reference further to just 64 bits.
This will lead to some slack for large files with no compressible data, though I don't believe this is a major issue: with the default record size of 128K and 4K blocks, up to 3.125% of blocks may be wasted due to this scheme. In practice, I believe this is negligible compared to other sources of slack or other kinds of space waste.
Incidentally, this makes the old caching scheme (with the simple LBA mapping) practical again. Much time has been wasted and many lessons are learned *sigh*.
3. Multiple trees per object
This is the biggest change and is what motivated the previous changes. Having multiple trees per object and giving each a different depth avoids the need to be able to grow or shrink trees, avoiding all the complexity described above.
Originally I was going to have 3 trees per object with depths of 0, 1 and 3 respectively. This would have worked out to 128 bytes per object and (IMO) sufficiently large maximum file sizes.
However, I realized I could keep the size of each object at 64 bytes and have 4 trees per object by reducing the size of each reference to 8 bytes.
This does impose smaller size limits on objects though. The maximum size of each object can be calculated with
Code: Select all
max_size_of_tree(record_size_p2, depth) = 2**(record_size_p2 + (record_size_p2 - 3) * (depth - 1))
max_size_of_object(record_size_p2) = sum(max_size_of_tree(record_size_p2, d) for d from 1 to 4)
which yields the following limits (for a few possible record sizes):
- Record size of 2**9: ~130MiB
- Record size of 2**12: ~513GiB
- Record size of 2**14: ~128TiB
- Record size of 2**17 (default): ~512PiB
- Record size of 2**19: ~128EiB, in practice 16EiB due to the range of a 64 bit integer.
I believe these limits are satisfactory, especially since very small record sizes like 512 bytes should be uncommon as they are useless in practice.
Redesigning the directory structure
While this doesn't strictly need an update and works fine as is, I believe it could be better.
In particular, I do not like the fact each directory consists of 3 separate objects (entry list, hash map and heap). I think putting everything in a single object would be cleaner and simpler.
It also currently has a requirement that the objects' IDs are consecutive. This was originally intended to allow fetching the three objects simultaneously, though in hindsight this is a useless micro-optimization.
---
Comments are welcome. I'd like to hear what others think since it helps finding more mistakes I may have made.