I am a self-taught CS student and i was reading the book : The DESIGN OF THE UNIX OPERATING SYSTEM. In chapter 4, the author writes about the internal representation of a file system. In that version, the directory structure is just unordered array of { filename, inode} structs. This implies that insertion time complexity will be O(1) but search time complexity is O(N) (Linear search). In order to improve the search time complexity, the author suggests using K-Ary trees or Hashtables (Actually, it is not a suggestion but it is a question to the reader to design such a system). K-Ary trees will make a O(logk(N)) complexity, whereas Hashtables will have a best case of O(1) and an average case of O(K) (Amortized linear search).
I was thinking how to design the Hashtable and the only idea that comes to my mind is using a linear probing scheme. Each directory block will be organized as a Hashtable of struct {filename, inode}. In case of a collision, a linear search will be tempted until we find a free slot otherwise the block is full and a new block is allocated to the directory and that entry will be inserted in the Hashtable of that new block.
For the design with N-Ary trees, i really don't have a clear idea how to do that. The reason is because we have no order between the nodes (Order of insertion). If it is a binary search tree or a K-Ary heap, then we could just implement it as an array but there will be a cost associated with every insertion and deletion (That looks like a lot of overhead) and in the case we want to keep the tree balanced that's really another level of overhead. I hear things about using B-trees in file systems (Btrfs) but i don't know if they are used in this context.
What do you propose as a design idea using those different data structures ? And are there any possible flaws with the ideas i mentioned above ? Thanks.
NB: i quote the question from the previously mentioned book : "11 . Design a directory structure that improves the efficiency of searching for path names by avoiding the linear search. Consider two techniques: hashing and n -ary trees."
Designing a directory structure using n-ary trees|hashtables
Re: Designing a directory structure using n-ary trees|hashta
Hi,
From my perspective; the VFS layer should be used as a cache, to cache both directory information and file data; and when there's a "VFS cache miss" on a directory the VFS would ask the file system for the entire directory's contents (and not just one thing in that directory). For example, if someone opens the file "/foo/bar/baz.txt" but the directory "/foo/bar" isn't cached in VFS, then VFS would ask the file system to fetch the entire "/foo/bar" directory (and not just the entry for "/foo/bar/baz.txt" alone). This is partly because it simplifies the VFS code (either all directory entries are cached or no directory entries are cached), partly because it's likely that something else in the same directory will be needed soon anyway, and partly because fetching the entire directory is likely to take a similar amount of time as fetching one entry in the directory (because the "on disk" structures used by a lot of file systems put all directory entries together in the same sectors of the disk). Of course the VFS would also cache a few "file system specific location values" for each file and directory on behalf of the file system (e.g. where one of these values might be a "starting cluster number" for FAT, or an "starting LBA of the file's data" for ISO9660, or "LBA for the list of extents" for ext2, or ... ; and where the other might be where the directory information is stored). That way when a file's data isn't cached in VFS but the file's directory info is; when the VFS asks the file system to read or write from the file the VFS can tell the corresponding file system its own "file system specific location values", and the file system code doesn't need to find or cache any information about where a file is itself.
This also means that the data structures used by VFS and the data structures used by file systems have extremely different usage. Specifically, (except for finding free space) the file system's code code will never need to search for or find anything itself, and the file system's "on disk" data structures can be designed knowing that the file system's code code will never need to search for anything - e.g. storing directory entries as a linear array of entries, or a linked list of entries, or (my preference) a linked list of arrays of entries.
For a full example, assuming nothing is cached by VFS; if a process asks to open the file "/foo/bar/baz.txt", the VFS would:
Cheers,
Brendan
I'm not sure if you're talking about the internal representation (in RAM) used by the VFS, or the internal representation (on disk) used by a specific file system.afr0ck wrote:I am a self-taught CS student and i was reading the book : The DESIGN OF THE UNIX OPERATING SYSTEM. In chapter 4, the author writes about the internal representation of a file system. In that version, the directory structure is just unordered array of { filename, inode} structs. This implies that insertion time complexity will be O(1) but search time complexity is O(N) (Linear search). In order to improve the search time complexity, the author suggests using K-Ary trees or Hashtables (Actually, it is not a suggestion but it is a question to the reader to design such a system). K-Ary trees will make a O(logk(N)) complexity, whereas Hashtables will have a best case of O(1) and an average case of O(K) (Amortized linear search).
From my perspective; the VFS layer should be used as a cache, to cache both directory information and file data; and when there's a "VFS cache miss" on a directory the VFS would ask the file system for the entire directory's contents (and not just one thing in that directory). For example, if someone opens the file "/foo/bar/baz.txt" but the directory "/foo/bar" isn't cached in VFS, then VFS would ask the file system to fetch the entire "/foo/bar" directory (and not just the entry for "/foo/bar/baz.txt" alone). This is partly because it simplifies the VFS code (either all directory entries are cached or no directory entries are cached), partly because it's likely that something else in the same directory will be needed soon anyway, and partly because fetching the entire directory is likely to take a similar amount of time as fetching one entry in the directory (because the "on disk" structures used by a lot of file systems put all directory entries together in the same sectors of the disk). Of course the VFS would also cache a few "file system specific location values" for each file and directory on behalf of the file system (e.g. where one of these values might be a "starting cluster number" for FAT, or an "starting LBA of the file's data" for ISO9660, or "LBA for the list of extents" for ext2, or ... ; and where the other might be where the directory information is stored). That way when a file's data isn't cached in VFS but the file's directory info is; when the VFS asks the file system to read or write from the file the VFS can tell the corresponding file system its own "file system specific location values", and the file system code doesn't need to find or cache any information about where a file is itself.
This also means that the data structures used by VFS and the data structures used by file systems have extremely different usage. Specifically, (except for finding free space) the file system's code code will never need to search for or find anything itself, and the file system's "on disk" data structures can be designed knowing that the file system's code code will never need to search for anything - e.g. storing directory entries as a linear array of entries, or a linked list of entries, or (my preference) a linked list of arrays of entries.
For "on disk data structures", to me this sounds horrible (especially for "rotating disk" hard drives where seeks are very expensive). If you're fetching the entire directory from disk you don't want the disk heads to bounce all over the place (as you seek to each different hashtable entry and follow the linked list of entries with the same hash). Of course a file system designed for "rotating disk" (where the main consideration is minimising seeks) is very different to a file system designed for SSD/NVRAM (where seeks are irrelevant).afr0ck wrote:I was thinking how to design the Hashtable and the only idea that comes to my mind is using a linear probing scheme. Each directory block will be organized as a Hashtable of struct {filename, inode}. In case of a collision, a linear search will be tempted until we find a free slot otherwise the block is full and a new block is allocated to the directory and that entry will be inserted in the Hashtable of that new block.
"No order" just means that you're free to choose whatever order you feel like (e.g. maybe alphabetical order of file name, maybe numerical order of creation time, etc), and as soon as you make a choice it's ordered.afr0ck wrote:For the design with N-Ary trees, i really don't have a clear idea how to do that. The reason is because we have no order between the nodes (Order of insertion).
For VFS (where you're caching entries in RAM, don't care about seeks, and do care about things like CPU data cache misses and TLB misses); I'd optimise data structures for the "everything cached" (because a VFS cache miss will involve actual IO and will be slow regardless of how the VFS's data structures are optimised). I'd probably start with a single large (about half the size of L3 cache, or a few MiB) hash table based on the full path not the file name (e.g. "hash = getHash("foo/bar/baz.txt")" and not "hash = getHash("baz.txt")"); so that (when there's no collision, and if the entry is in the VFS's cache) you can go directly to the entry with the least CPU data cache misses and TLB misses. Directories (which would only be accessed for things like "ls", and wouldn't be used for things like "stat" or "open" because you can go directly to the right entry in those cases) would probably just be a simple array of pointers (because you'd never need to search them).afr0ck wrote:If it is a binary search tree or a K-Ary heap, then we could just implement it as an array but there will be a cost associated with every insertion and deletion (That looks like a lot of overhead) and in the case we want to keep the tree balanced that's really another level of overhead. I hear things about using B-trees in file systems (Btrfs) but i don't know if they are used in this context.
What do you propose as a design idea using those different data structures ? And are there any possible flaws with the ideas i mentioned above ? Thanks.
NB: i quote the question from the previously mentioned book : "11 . Design a directory structure that improves the efficiency of searching for path names by avoiding the linear search. Consider two techniques: hashing and n -ary trees."
For a full example, assuming nothing is cached by VFS; if a process asks to open the file "/foo/bar/baz.txt", the VFS would:
- Use the single large hash table to try to find the entry for "/foo/bar/baz.txt", and realise it's not in the cache
- Use the single large hash table to try to find the entry for "/foo/bar", and realise it's not in the cache
- Use the single large hash table to try to find the entry for "/foo", and realise it's not in the cache
- Ask the file system to fetch the entire "/foo" directory and add all that to the cache (while remembering where the entry for "/foo/bar" is when you add that to the cache)
- Ask the file system to fetch the entire "/foo/bar" directory and add all that to the cache (while remembering where the entry for "/foo/bar/baz.txt" is when you add that to the cache)
- Once the entry for "/foo/bar/baz.txt" is found (or not found because it doesn't exist) do whatever you need to do to open the file (create a file handle structure), possibly including telling the file system code to create the file if the file doesn't already exist.
- Use the single large hash table to try to find the entry for "/foo/bar/baz.txt"
- Once the entry for "/foo/bar/baz.txt" is found (or not found because it doesn't exist) do whatever you need to do to open the file (create a file handle structure), possibly including telling the file system code to create the file if the file doesn't already exist.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Designing a directory structure using n-ary trees|hashta
Hi again,
Sorry for being too late, i was a little bit busy & i didn't want to just give a quick answer without a reasonable analysis of the suggestions.
However, I've been thinking about using BTrees & they seem to give a great result because BTrees never change the position of an inserted element, all it does is creating new nodes if there's no space in the parent. And guess what ! The BTree doesn't consume any additional space if implemented as an array so only offsets will be needed to travel through the nodes and a bunch of #defines will do a great work (I guess the code of a BTree implementation will be so fast and tight, cache friendly and will have very minimal overhead).
Sorry for being too late, i was a little bit busy & i didn't want to just give a quick answer without a reasonable analysis of the suggestions.
I do understand that the VFS is an interface that On-Disk filesystems should respect by implementing a set of operations required by the OS and the operations return data structures that the OS can understand (For example, the OS can ask the fs to read an inode & return an array of `struct inode` regardless of the underlying organization implemented by that particular fs). However, in my discussion, it is assumed that there's no distinction between the two layers (You can say it is a kind of a direct mapping where the underlying fs has the same organization as the VFS & the VFS just serves as a cache & not an interface).Brendan wrote:I'm not sure if you're talking about the internal representation (in RAM) used by the VFS, or the internal representation (on disk) used by a specific file system.
Actually, the hashtable doesn't have any linked lists, it's implemented as an array, the only overhead is the hash function. Beside that, the hashtable is the internal organization of the directory's data block. That means it doesn't index the data blocks themselves but only the entries inside the data block. Once you grab the inode of the corresponding directory, you will have to loop through all the data blocks (Using the block array inside the inode) but once you are inside the data block (that is already cached in the block cache), you will have to use a hash function to find the entry of the file your looking for (Which looks faster that linear search).Brendan wrote:For "on disk data structures", to me this sounds horrible (especially for "rotating disk" hard drives where seeks are very expensive). If you're fetching the entire directory from disk you don't want the disk heads to bounce all over the place (as you seek to each different hashtable entry and follow the linked list of entries with the same hash). Of course a file system designed for "rotating disk" (where the main consideration is minimising seeks) is very different to a file system designed for SSD/NVRAM (where seeks are irrelevant).
That's really a great problem. If we use an alphabetical order, it will be so hard to maintain the tree ordered & every insertion will cost us a O(N) time and sometimes you really need to do some huge work by swapping the entries that are adjacent until you reach the end of the tree. When you are searching you will need O(nLogn) because once you have determined that the entry you looking for is less then a certain parent, you will have to compare it with all it's children & you can't directly go the level below. Numerical order of creation time is not useful at all because you when you are searching for a file (Let's say "/foo/bar") you don't actually provide its creation time but its name & the OS has no idea how to get the creation time of a file by providing only it's name (That creation time will be used later to search inside the tree).Brendan wrote:"No order" just means that you're free to choose whatever order you feel like (e.g. maybe alphabetical order of file name, maybe numerical order of creation time, etc), and as soon as you make a choice it's ordered.
However, I've been thinking about using BTrees & they seem to give a great result because BTrees never change the position of an inserted element, all it does is creating new nodes if there's no space in the parent. And guess what ! The BTree doesn't consume any additional space if implemented as an array so only offsets will be needed to travel through the nodes and a bunch of #defines will do a great work (I guess the code of a BTree implementation will be so fast and tight, cache friendly and will have very minimal overhead).
It looks like a very interesting idea (Having a multiple level cache could make things faster) but i don't really find it very useful to cache the whole path name. Why not only the filename ? Because you are finally looking for the same file. I think you get identical results by just carefully coding the hashtable with enough space, a good collision resolution strategy and a fairly well distributed hash function.Brendan wrote: I'd probably start with a single large (about half the size of L3 cache, or a few MiB) hash table based on the full path not the file name (e.g. "hash = getHash("foo/bar/baz.txt")" and not "hash = getHash("baz.txt")"); so that (when there's no collision, and if the entry is in the VFS's cache) you can go directly to the entry with the least CPU data cache misses and TLB misses
Re: Designing a directory structure using n-ary trees|hashta
Hi,
If that's correct; then I don't see much problem, other than needing something to manage (allocate, free, resize, de-fragment) the area that contains directory information that doesn't break the "lists/arrays of entries that share the same hash value".
Note that "array" doesn't really mean anything on its own (array of what?). For a silly example, you could say that RAM is just an array of bits and therefore everything in RAM is stored in a single array. If you mean an array of "struct inode" then you'll be wasting space because each entry in an array has the same size - e.g. if maximum file name length is 256 bytes and average file name length is 16 bytes, then you'd be wasting an average of 240 bytes per file.
Note that for modern computers cache misses typically cost around 150 cycles each; and at the start of every system call you can assume that most/all of kernel's data was probably evicted from cache (replaced by user-space data). Halving the number of cache misses could improve performance by 50% (and TLB misses are similar/worse); especially for unpredictable access patterns (like hash table lookups and trees) where pre-fetching can't be used to mitigate the costs.
Cheers,
Brendan
So you'd have (e.g.) 5 MiB of data stored on disk (for a directory containing several thousand files); where that data might contain a 1 MiB hash table followed by 4 MiB of directory entries. When creating a new file (and checking that it doesn't already exist), you'd load the entire 5 MiB into RAM, calculate a hash value from the new file's name, and use the hash table (in the first 1 MiB) to find the start of a list of entries that had the same hash value (where this list of entries is stored as an array and not a linked list), then do a linear search of the array to see if the file name exists in the list of files that had the same hash value?afr0ck wrote:Actually, the hashtable doesn't have any linked lists, it's implemented as an array, the only overhead is the hash function. Beside that, the hashtable is the internal organization of the directory's data block. That means it doesn't index the data blocks themselves but only the entries inside the data block. Once you grab the inode of the corresponding directory, you will have to loop through all the data blocks (Using the block array inside the inode) but once you are inside the data block (that is already cached in the block cache), you will have to use a hash function to find the entry of the file your looking for (Which looks faster that linear search).Brendan wrote:For "on disk data structures", to me this sounds horrible (especially for "rotating disk" hard drives where seeks are very expensive). If you're fetching the entire directory from disk you don't want the disk heads to bounce all over the place (as you seek to each different hashtable entry and follow the linked list of entries with the same hash). Of course a file system designed for "rotating disk" (where the main consideration is minimising seeks) is very different to a file system designed for SSD/NVRAM (where seeks are irrelevant).
If that's correct; then I don't see much problem, other than needing something to manage (allocate, free, resize, de-fragment) the area that contains directory information that doesn't break the "lists/arrays of entries that share the same hash value".
I think most file systems that use trees use a single tree for the entire file system, with everything in "alphabetical" (numerical) order of (fully qualified) file name, so that "foo/bar/bar" and "foo/bar/baz" follow the same nodes from the root of the tree up until you get to the last letter. In other words, you naturally end up with a sub-tree of things starting with "foo/bar/" (which contains a sub-sub-tree of things that start with "foo/bar/b", which contains...).afr0ck wrote:That's really a great problem. If we use an alphabetical order, it will be so hard to maintain the tree ordered & every insertion will cost us a O(N) time and sometimes you really need to do some huge work by swapping the entries that are adjacent until you reach the end of the tree. When you are searching you will need O(nLogn) because once you have determined that the entry you looking for is less then a certain parent, you will have to compare it with all it's children & you can't directly go the level below. Numerical order of creation time is not useful at all because you when you are searching for a file (Let's say "/foo/bar") you don't actually provide its creation time but its name & the OS has no idea how to get the creation time of a file by providing only it's name (That creation time will be used later to search inside the tree).Brendan wrote:"No order" just means that you're free to choose whatever order you feel like (e.g. maybe alphabetical order of file name, maybe numerical order of creation time, etc), and as soon as you make a choice it's ordered.
Assume you had a directory containing several thousand files but then deleted all of them except one file; and the info for the file that wasn't deleted was at the end of the space and never changes its position; so now you have a large amount of wasted space before the only entry at the end of the space. Is "large amount of wasted space" what you mean when you say "doesn't consume additional space"?afr0ck wrote:However, I've been thinking about using BTrees & they seem to give a great result because BTrees never change the position of an inserted element, all it does is creating new nodes if there's no space in the parent. And guess what ! The BTree doesn't consume any additional space if implemented as an array so only offsets will be needed to travel through the nodes and a bunch of #defines will do a great work (I guess the code of a BTree implementation will be so fast and tight, cache friendly and will have very minimal overhead).
Note that "array" doesn't really mean anything on its own (array of what?). For a silly example, you could say that RAM is just an array of bits and therefore everything in RAM is stored in a single array. If you mean an array of "struct inode" then you'll be wasting space because each entry in an array has the same size - e.g. if maximum file name length is 256 bytes and average file name length is 16 bytes, then you'd be wasting an average of 240 bytes per file.
With a single global hash table you do one hash table look-up to find (the list containing) "foo/bar/baz/woot/hello.txt"; which works out to a worst case of one cache miss. With a separate hash table for each directory you need to do 5 times as many hash table look-ups to find (the list containing) "foo/bar/baz/woot/hello.txt"; which works out to a worst case of 5 cache misses.afr0ck wrote:It looks like a very interesting idea (Having a multiple level cache could make things faster) but i don't really find it very useful to cache the whole path name. Why not only the filename ? Because you are finally looking for the same file. I think you get identical results by just carefully coding the hashtable with enough space, a good collision resolution strategy and a fairly well distributed hash function.Brendan wrote: I'd probably start with a single large (about half the size of L3 cache, or a few MiB) hash table based on the full path not the file name (e.g. "hash = getHash("foo/bar/baz.txt")" and not "hash = getHash("baz.txt")"); so that (when there's no collision, and if the entry is in the VFS's cache) you can go directly to the entry with the least CPU data cache misses and TLB misses
Note that for modern computers cache misses typically cost around 150 cycles each; and at the start of every system call you can assume that most/all of kernel's data was probably evicted from cache (replaced by user-space data). Halving the number of cache misses could improve performance by 50% (and TLB misses are similar/worse); especially for unpredictable access patterns (like hash table lookups and trees) where pre-fetching can't be used to mitigate the costs.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.