Page 1 of 1

Filesystem interface desgin

Posted: Mon Oct 23, 2006 10:51 pm
by proxy
OK, i have a fairly functional VFS design which supports both memory file systems and actual files systems, it even supports namespaces where things can be not in the traditional unix style fs, but still accessible through URL like prefixes (ex. "devices://mouse" or "processes://12453").

I have come across something i dont like. EXT2/3, my reference disk based fs implementation is designed such that to get a given directory entry, you need and inode and an index within that inode. First you locate the inode on disk, which will result in an "array" of dentries which are not equal size (size determined by reclen member of each element). Then you loop reading each entry until you have found the element at your index.

This bothers me because if you want to get a given element, say the 5th entry in a directory, i need to read the 4 before it to get there. This can result in a lot of redundant reads from disk!

So, I was looking at how Linux might deal with it, it seems that perhaps it does not..at least not directly. the primary interface is "getdents" which also results an array of non-uniform directory entries. This at least gives me the impression that Linux returns whole directory listings at a time to user space, leaving it up to user space to parse and store this data efficiently if needed.

So my question is really, how would you guys organize around this seemingly inefficient technique? I am toying with having my FileSystem objects return std::list<DirectoryEntry> objects (I have a fairly con formant libstdc++/libc in kernel land :)) This could easily by passed up to user space efficiently given they want the whole directory listing at a time, and would make iterating a given directory listing easy to the caller, just use a list iterator.

Keep in mind that I am talking mostly about when the kernel talks to the VFS, not user space, if i can do it efficiently in the kernel, i can give this information to user space efficiently as well i believe.

So the first thing that comes to mind as a fault with this idea is what if there is like a LOT of entries in the directory? Is it worth creating this huge list then? caching would help with redundant reads and such, but may be an issue once I have more complete write support at least in the added complexity area. As far as caching, I am thinking I could store a certain amount of megabytes worth of previous std::list objects which any write to the associated inode would invalidate instantly (causing a refetch from disk). When I hit the upper limit of the cache, oldest gets pushed out. I feel this would be fairly efficient given a decent hit-rate.

Just looking for some thoughts :)
proxy

Posted: Mon Oct 23, 2006 11:41 pm
by spix
getdents is part of the SUS and so as linux is a unix clone, it follows that design.

How would you access a file from a filesystem with an inode? most filesystems use a string of characters to represent files and inodes are a device for the filesystem to refer to the specifics of the file, ie dentry points to inode points to file blocks

If say you knew how many files were in a directory, and you knew all their filenames, and the record lengths were the same so you could jump to the entry immediatly, i think you will find that once you have obtained all the information above, you've already used up time it would have taken to use a getdents style approach.

perhaps if you could invent a system where the string of characters representing the file was actually the inode reference as well that might work.

Sorry that probably doesnt help much.

Posted: Tue Oct 24, 2006 9:18 am
by proxy
unfortunately, this is not really the type of answer i was looking for ;)

Firstly, if you have the inode number of a file, you are done looking it up mostly, since all inodes have a fixed location on disk. Secondly, you can almost never guarantee that dirents will have uniform reclens, so you pretty always have to start at the first and iterate to the one you want.

Now, ext2/3 has a directory structure layout which is very similar to a getdents result, i believe this is no coincidence...

My question really is, aside from copying the UNIX style getdents, how else would people do this?

proxy

Posted: Tue Oct 24, 2006 9:38 am
by spix
Secondly, you can almost never guarantee that dirents will have uniform reclens
Are we talking about ext2/3 or filesystems in general? Sure if ext2/3, but you could easily ensure reclens are the same, you just pad the record with zeros. Infact, if you look in to C Library implementations you'll find (at least with newlib) your dirent structures are uniform in size regardless of filesystems, the name length is padded out to the max name len (which varies between OSes)

But that's not really the point.

I think you would have to design a new filesystem if you want to use a different approach. (which is what I meant with some way of representing the inode with the actual filename) If you want to use ext2/3 for compatibility you are going to want to read sequentially, and if you are reading from a block device this isn't a huge performance penalty, especially if you are using preallocation and goals so you get all the blocks near one another.

Sorry that's not much help either.

Posted: Tue Oct 24, 2006 11:14 am
by proxy
I was referring to the actual directory entry structures within the ext2/3 filesystem, not the slightly abstracted version that you get by calling getdents.

I have little to no control in regard to what it looks like on disk if i want to support ext2/3, since i can't guarantee that I will only look at a filesystem produced by me. Besides, what's the point in implementing an existing filesystem if my version is unable to read other people's implementation, heh?

My current implementation works...it's more of a matter that I am not happy with the internals at the moment, i find it to be a messy and inefficient implementation.

I think you are kinda missing my question though. Right now my in kernel VFS interface revolves around iterators which boil down to inode/index pairs. the inode refers to the owning directory and the index is which entry in the directory. In order to get at index n i have to under the hood iterate from index 0 to index n in order since they are not uniform in size. If ext2/3 had directory entries contain a table of offsets or something at a fixed location relative to the directories inode then this could be made into a non-issue, but would be wasteful of space.

So the question is, if i have to iterate anyway, perhaps i should present the data differently to the kernel? As previously mentioned, I was thinking about abstracting out the concept of a "directory" into a std::list<DirectoryEntry> object making it so you request a whole directory listing at a time. But this could be expensive with large directory listings. I could help this by "remembering" recently requested directory listings, but I don't feel like this is an ideal solution.

I am not currently interesting in making my own FS or my own variant of an existing one, I am trying to have a conforming and compatible ext2/3 implementation in my VFS design. Really the crux of the issue is, given how ext2/3 is organized, what are some options to redesign my VFS around.

proxy

Posted: Tue Oct 24, 2006 12:20 pm
by Brendan
Hi,
proxy wrote:Right now my in kernel VFS interface revolves around iterators which boil down to inode/index pairs. the inode refers to the owning directory and the index is which entry in the directory. In order to get at index n i have to under the hood iterate from index 0 to index n in order since they are not uniform in size.
Could your in kernel VFS structures be modified into fixed size?

Code: Select all

struct {
    char * address_of_actual_name_string_in_dynamically_allocated_space;
    unsigned int hash_of_name_string_to_reduce_string_comparisons;
    unsigned int offset_of_directory_entry_on_the_disk_in_case_its_needed
    ...
    <other fixed size stuff>
}
Just an idea...


Cheers,

Brendan

Posted: Tue Oct 24, 2006 1:32 pm
by proxy
well yes and no, of course once i get the data, i can store it however i want.
The tricky part is efficiently getting the data...but you did give me an idea.

The problem is that ext2/3 is what it is, no changing that. But I could have my ext2/3 code keep caches of useful information such as inode/table pairs which would associate a list of offsets to directory entries to a given inode (obviously this would have to be invalidated on write).

we'll see how this works out, any other ideas would be nice of course.

proxy