Need advices for designing a simple file system

Brendan · Post by **Brendan** » Sun Apr 22, 2012 12:55 am

pauldinhqd wrote:
Brendan wrote:If you want to develop a new high performance file system; I'd start by splitting it in half - a lower level part and a higher level part.

The lower level part would mainly manage "linked lists of extents" - allocating new extents, freeing them, resizing them, reading them and writing them. Naturally it'd also manage free space, handle things like bad sectors, etc; and it'd probably be the best place to do things like "sparse lists of extents" (used for sparse files support), encryption, compression, etc.

The higher level part relies on the lower level part. A file's data is just a linked lists of extents. Directories are linked lists of extents that contain directory entries instead of file data. More advanced things like snapshots and forks are just more linked lists of extents.

If you split things like this you should be able to test the lower level part on its own; and could possibly have several "lower level parts" and several "higher level parts" and combine them in various ways. For example, perhaps one "lower level part" that tries harder to avoid moving extents for SSD and USB flash, and another "lower level part" that tries harder to avoid fragmented extents for hard drives; and perhaps one "higher level part" designed to be efficient for lots of small files and one designed to be efficient for lots of large files (plus maybe a stable version of each and a beta/development version of each).
These are great points however I'm still a bit confused whether linked-list
may give good performance, coz, let say when we need to find some file with its
name and its path, it seems that we to have to scan 'linearly' thru' certain linked-lists,
and what if these linked-lists contain lots of items

Possible to allow O(logN) file search on these linked-lists
What may happen when we seek to a byte in a very large file

The "linked lists of extents" (or "LLoE" to save me some typing) would be used for storing arbitrary things.

If it makes it easier, think of it like "malloc()". The "malloc()" function just gives you a pointer you can use to access "n bytes of storage in RAM". It doesn't care what you store in the allocated space, and you can use it to store hash tables, btrees, arrays, structures or anything else. You also don't need to care if the storage space is actually contiguous or not - it might be many separate physical pages that look contiguous due to paging. Of course "malloc()" is just part of a set of functions, and you can "free()", "realloc()", etc too.

In the same way, the lower level gives you an "LLoE ID" (instead of a pointer) that you can use to access "n bytes of storage on disk" (instead of in RAM). It doesn't care what you store in the allocated space, and you can use it to store hash tables, btrees, arrays, structures or anything else. You also don't need to care if the storage space is actually contiguous or not - it might be many separate physical sectors (or extents) that look like contiguous space because the lower level does some translation to make offsets within the LLoE look contiguous even if they actually aren't. Of course the lower level would let you allocate LLoEs, free them, resize them, etc.

Cheers,

Brendan

pauldinhqd · Post by **pauldinhqd** » Sun Apr 22, 2012 3:39 am

I'm designing the FS like this:
- Devide the whole storage space into equal-sized blocks (aka cluster)
- Consider each file is a chain of blocks (a true 2-way linked list)
- The first block of file chain points to the first block of next file
- The very first block in storage is the free space block chain

Other details are in the picture in attachement.
However, in this design, I still have problem with:
(1) Finding a file by its name and path
(2) Seeking to a byte in file

The implementation utilize linked-lists, so scanning thru' each list
from the first block to the last block in file chain is a pain

pauldinhqd · Post by **pauldinhqd** » Sun Apr 22, 2012 10:50 pm

berkus wrote: Why not read the design documents of existing filesystems (simple one would be ext2 and befs, complex but very flexible ones are reiser3/4, jfs, xfs, zfs and btrfs)? I mean read and understand them, it would make designing your own filesystem much easier - you will see you cannot design anything better and stick with one of the existing ones.

yes, right thing!

i found a book by Dominic Giampaolo called Practical File System Design
and i'm reading it
http://www.nobius.org/~dbg/practical-fi ... design.pdf

Brendan · Post by **Brendan** » Mon Apr 23, 2012 9:28 pm

Hi,

berkus wrote:Why not store file names in a patricia trie or in a btree indexed by file uuids + separate hash for filenames (basically any kind of index with O(log n) or faster search times, preferably NOT using string compares), for storing file blocks extents could work quite well, especially if you do not let it have too much fragmentation. (i.e. for a typical file the storage would be described by a single record [start cluster, end cluster]).

I'd assume that the normal optimisations you'd do for "things in RAM" may not work well for "things on disk", where access time is far greater.

For an example, in RAM I might be tempted to try a hash table - maybe a 256 entry array of pointers to the first entry, for (up to) 256 linked lists of entries. On disk, that might cause the drive's heads to be bouncing all over the place (fetch the 256 entry array, then fetch each sector touched when following the linked list of entries). An array of entries stored in contiguous sectors could be a lot faster - it would minimise the number of (relatively painful) seeks at the expense of (mostly irrelevant) CPU time used for searching.

I'd also assume it would depend a lot on what sort of caching is done at higher levels.

For example, if an application asks the VFS to open the file "/foo/bar/hello.txt"; then the VFS might ask the file system for the entire "foo" directory and cache that information in a patricia trie in RAM, then ask the file system for the entire "foo/bar" directory and cache that information in a patricia trie in RAM too. In this case, the file system itself never does any searching - the file system would only be fetching entire directories, and would therefore want to pack as many directory entries into the least number of (contiguous) sectors as it can.

Also, if caching is done at higher levels (e.g. "write-through" caching for everything implemented in the VFS) then you can assume that the VFS would normally only ask the file system to fetch data that hasn't been fetched before. Basically, the file system has to be designed for "cold cache"; as it'd be wrong to assume that anything it needs will be cached anywhere (including "cached by the lower level storage device driver"). In a similar way, it'd be pointless for the file system itself to cache most things (directory entries, file data) as the VFS would cache it anyway and the VFS won't ask for anything that the file system has cached.

As an alternative example, if the higher levels don't cache anything at all then you have the exact opposite situation - you want to cache as much as you can at the file system level, and you'd want to optimise the file system design for searches.

I guess what I'm saying is that maybe it'd be best to design the VFS and/or higher levels first, so that you can know what you're optimising the file system for.

Cheers,

Brendan

Yoda · Post by **Yoda** » Tue Apr 24, 2012 8:13 am

One? Saving space?? You are joking, probably

.
The primary target was performance of searching the entry by name. When the directory (for example, browser cache) contains thousands of files searching the file without B-tree means comparison of the given name with the directory entries one-by-one until you find it or make a conclusion that it doesn't exist. In the later case you need to check all of your thousands of file names before you state "no such file". B-tree allows to make both conclusions in log(N) times instead N.

Brendan · Post by **Brendan** » Tue Apr 24, 2012 7:14 pm

Hi,

Yoda wrote:One? Saving space?? You are joking, probably .
The primary target was performance of searching the entry by name. When the directory (for example, browser cache) contains thousands of files searching the file without B-tree means comparison of the given name with the directory entries one-by-one until you find it or make a conclusion that it doesn't exist. In the later case you need to check all of your thousands of file names before you state "no such file". B-tree allows to make both conclusions in log(N) times instead N.

Let's make up some numbers. Let's say that seeking to a different sector costs an average of 10 ms, reading one sector costs 1 ms, and doing one "strcmp()" costs 1 us. Let's also say that on average a file's name is 10 bytes and for each directory entry there's another 6 bytes of data (file permissions, some sort of "ID of first extent").

From this we can calculate that a linear search of 1000 directory entries stored as contiguous sectors would cost 10 ms for one seek, 32 ms to read 32 sectors (16000 bytes), then 1 ms for doing "strcmp()" 1000 times. Total cost is 43 ms.

For a B-tree, if each node has an average of 4 child nodes, to search 1000 entries we'd need to look at 4 parent nodes and a child node. That's 50 ms for seeks and 5 ms for reading sectors. That's slower than a simple linear search.

Of course this assumes there's no caching. If you add caching to this, for the linear search you'd read all 1000 entries and (to cache them) you might store them in a B-tree in RAM. Because touching anything causes the complete directory to be added to the cache, you never need to look at the disk again for that directory. If the directory is stored on disk as a B-tree, because you've only touched 4 parent nodes and one child node you've got almost none of the data into the cache for later searches and you have to do a lot more seeking and reading. Basically, for "linear" we're looking at maybe 45 ms to load everything into the cache and almost nothing after that, while for "B-tree on disk" we're maybe looking at 55 ms for the first access, 40 ms for the second access, 30 ms for the third access, etc. It's going to be excruciatingly slow (maybe 5 times slower than linear search).

To improve the remarkably bad performance of searching "B-tree on disk" (caused by excessive seeking, etc) you could read the entire B-tree into the cache the first time you search for anything in the directory. Because a B-tree leaves space unused for future insertions and has more overhead (e.g. references to child nodes, etc) you might have to read 50 sectors instead of 32; so instead of taking 10 ms for one seek and 32 ms for reading sectors it's going to cost 10 ms for one seek and 50 ms for reading sectors - that's still about 40% slower than "linear".

Basically, "Big-O" notation lies - it only makes sense if the operations you're comparing cost the same, but operations almost never cost the same, and "O(n)" fast operations is better than "O(log n)" slow operations for reasonable values of 'n'.

The "B-tree on disk" approach only makes sense if the directory is idiotically massive (e.g. so big that it doesn't make sense to cache it in RAM) - e.g. many millions of files in the directory, not thousands. This is far from normal - no sane person is ever likely to have such a high number of files in a directory, and no sane person would expect decent performance if they did. The "B-tree on disk" approach is only good for irrelevant pathological cases.

I'll let you decide if I'm playing the Devil's advocate or not...

Cheers,

Brendan

JackScott · Post by **JackScott** » Tue Apr 24, 2012 7:33 pm

Brendan wrote:The "B-tree on disk" approach only makes sense if the directory is idiotically massive (e.g. so big that it doesn't make sense to cache it in RAM) - e.g. many millions of files in the directory, not thousands. This is far from normal - no sane person is ever likely to have such a high number of files in a directory, and no sane person would expect decent performance if they did. The "B-tree on disk" approach is only good for irrelevant pathological cases.

I can't remember why he said this, but one of the guys at Oracle who has been working on btrfs said to use XFS in this case (for storing lots of big data etc), and that btrfs was more useful in the root filesystem scenario.

Might be worth looking at his talk from LCA2012 (It's also on Youtube if you don't feel like downloading): http://mirror.internode.on.net/pub/linu ... _btrfs.ogv

Yoda · Post by **Yoda** » Wed Apr 25, 2012 6:53 am

Brendan wrote:I'll let you decide if I'm playing the Devil's advocate or not...

Of course, you are

.
First of all, I didn't say that the introduction of B-tree allowed to gain high performance in practice. I just noticed the primary intention of developers.
Second, you made an assumption that B-tree records are scattered on the disk while linear are placed in adjacent records. But AFAIK most practical implementations of B-trees try to locate records close to each other.
Third, the relative performance of different filesystems may greatly vary in time, depending on present technology available. For example, RISC architecture outperforms CISC on one scale of cryctal coplexity and clock speeds but loose on others. The same concerns FS. Seek time, rotation speed, data transfer speed, cacheing, CPU and memory performance give different contribution to FS performanse and one or other FS may win in different circumstances. For example for SSD and flash drives the seek time and rotation speed (two components of latency) is zero and your calculations won't work.
Fourth, the question of real performance may only be solved in practice by tests and benchmarks. But even in this case we should keep in mind the quality of FS driver implementation and it's primary intention - performance or reliability. As for me, I don't know for sure which approach is better, but, IMHO, B-tree with good implementation should win in most cases. I want to write FS benchmark test, but it seems that it is not easy task and needs a series of additional investigations.

turdus · Post by **turdus** » Wed Apr 25, 2012 7:32 am

Yoda wrote:I want to write FS benchmark test, but it seems that it is not easy task and needs a series of additional investigations.

Why not use Bonnie++? http://www.coker.com.au/bonnie++/

Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance.
...
The main program tests database type access to a single file (or a set of files if you wish to test more than 1G of storage), and it tests creation, reading, and deleting of small files which can simulate the usage of programs such as Squid, INN, or Maildir format email.

Brendan · Post by **Brendan** » Wed Apr 25, 2012 8:17 am

Hi,

Yoda wrote:
Brendan wrote:I'll let you decide if I'm playing the Devil's advocate or not...
Of course, you are .

Am I?

Yoda wrote:First of all, I didn't say that the introduction of B-tree allowed to gain high performance in practice. I just noticed the primary intention of developers.

Good intentions?

Yoda wrote:Second, you made an assumption that B-tree records are scattered on the disk while linear are placed in adjacent records. But AFAIK most practical implementations of B-trees try to locate records close to each other.

Initially (third and fourth paragraphs) I assumed the B-tree is scattered. However, my fifth paragraph focused on a B-trees stored entirely in contiguous sectors. Please note that this is "best case", and even if you only read some of the sectors (rather than all of them) you're looking at rotational latency. Basically all the sectors that pass under the drive's heads could have been read at almost no extra cost while you're waiting for a specific/desired sector to pass under the drive's heads.

Yoda wrote:Third, the relative performance of different filesystems may greatly vary in time, depending on present technology available. For example, RISC architecture outperforms CISC on one scale of cryctal coplexity and clock speeds but loose on others. The same concerns FS. Seek time, rotation speed, data transfer speed, cacheing, CPU and memory performance give different contribution to FS performanse and one or other FS may win in different circumstances. For example for SSD and flash drives the seek time and rotation speed (two components of latency) is zero and your calculations won't work.

We're primarily looking at hard disks, for the simple reason that they're ubiquitous. The underlying characteristics (e.g. expensive seek times compared to sequential reads) have remained since their introduction in the late 1950s.

For less common cases (e.g. tape drives, and SSD and flash) I agree that there's different problems and different solutions, and that I've made no attempt to consider these differences (and will continue to make no attempt to consider them, as these less common cases are less common, and any discussion of them detracts from point I'm trying to make).

Yoda wrote:Fourth, the question of real performance may only be solved in practice by tests and benchmarks. But even in this case we should keep in mind the quality of FS driver implementation and it's primary intention - performance or reliability. As for me, I don't know for sure which approach is better, but, IMHO, B-tree with good implementation should win in most cases. I want to write FS benchmark test, but it seems that it is not easy task and needs a series of additional investigations.

The problem with file system benchmarks is that you'd need to compare apples with apples. You'd need to test file systems that are essentially identical except for the "packed/linear" vs. B-trees difference; and couldn't compare a file system with one set of features and one sector allocation strategy that happens to use "packed/linear" to another file system with a completely different set of features and completely different allocation strategy that happens to use B-trees. Even if you can do "apples vs. apples" tests, the results are going to be heavily influenced by the design of any higher or lower levels (especially where caching is done); and aren't going to be useful anyway as a generalisation anyway.

For reference; I don't know if I'm playing the Devil's advocate or not. What I am doing is saying that assumptions (like "B-trees are good") are useless without knowing more about the likely access patterns, the likely file sizes, where caching happens, storage device characteristics, etc.

Basically:

Brendan wrote:I guess what I'm saying is that maybe it'd be best to design the VFS and/or higher levels first, so that you can know what you're optimising the file system for.

However, I do suspect that B-trees are nowhere near as good as people assume; and I also suspect that people are mislead by "O(1) is better than O(n) which is better than O(log n)" over-simplifications.

Cheers,

Brendan

Yoda · Post by **Yoda** » Wed Apr 25, 2012 8:18 am

turdus wrote:Why not use Bonnie++?

Thank you for that link. I noted it and will examine it thoroughly when my set of filesystem drivers will be ready.
But I meant that true research for benchmarks should include for example the following considerations:
- What kind of access occures with what frequency?
- What is the distribution of sizes of files in most installations?
- What is the distribution of amount of files in directories and what is the average depth of directory tree in practice?
- and so on...
I mean that it will be good idea to analyze the sample set of practical installations of different systems for different applications. But of course we may use fully synthetic (and relatively simple) tests as a first approach.

Yoda · Post by **Yoda** » Wed Apr 25, 2012 4:42 pm

Brendan wrote:For reference; I don't know if I'm playing the Devil's advocate or not. What I am doing is saying that assumptions (like "B-trees are good") are useless without knowing more about the likely access patterns, the likely file sizes, where caching happens, storage device characteristics, etc.

The same I do. I just noticed what the developers kept in mind when developed B-tree based FS. And nothing more. And I think that their assumptions had a transparent basis: 10 comparisons are better than 1024. Reading 10 sectors is better than reading 64. Also I suspect that with proper implementation B-tree may have the performance not worse that linear. That's all!

pauldinhqd · Post by **pauldinhqd** » Fri Apr 27, 2012 2:30 pm

I've just finished reading the book by Dominic about file systems;
and with the discussions above of you all. I believe I should use linked-list for
allocation of free space, and btree for directory structures.

Free space allocation with linked-list is faster than bitmap I believe,
(when I need a block for storing data, I just take right from the beginning of the list).
Although linked-list means fragmentation but future storage seems
to have random-access feature instead of sequential CHS.

Directory structures with btree allows the search time O(n)
where n is the length of full path to a file; while directory structures
with linked-lists (without hashing) requires the search time O(n*m) where m is the
total number of files stored on drive (the worst case is that all files are stored under
root and the file being searched is the last file).

This big-O is based on the number of byte-to-byte comparisons.
Am I getting this right

bluemoon · Post by **bluemoon** » Fri Apr 27, 2012 5:21 pm

pauldinhqd wrote:Although linked-list means fragmentation but future storage seems
to have random-access feature instead of sequential CHS.

Data are accessed randomly, however the TOC or some meta info are usually fetch into the OS at once, or sequential.
An example will be directory content, since each entry is a few bytes you may just read the whole directory into the system, or even group child directory entries together(like hot-zone), which may span across few sequential storage blocks. This may sound odd but for example reading 16 sectors in one go may be just as fast as reading 1 sector, or even faster than reading 3 sectors in 3 go(as in walking 3 nodes in tree). You need to take account for seek time and disk IO preparation overhead. For SSD drives the seek time is zero but the IO preparation overhead still hold.

TIPS: re-read what Brendan said.

gerryg400 · Post by **gerryg400** » Fri Apr 27, 2012 7:38 pm

Directory structures with btree allows the search time O(n) where n is the length of full path to a file

No. That implies that it is independent of the number of entries in the directory. It's not.

OSDev.org

Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system

Re: Need advices for designing a simple file system