Page 1 of 1

Identical file detection in file system design

Posted: Thu Aug 05, 2010 5:33 am
by Candy
Hi all,

I'm trying to incorporate identical file detection into my file system design, so that it implicitly merges them (and marks them COW) to save space. That means, you can put the same file in 10 places and pay for it only once - or use structures other than directory-based.

The files are then identified by their hash. That works out fine for all merging & COWing activities, except for one tiny bit. When you modify a huge file, the hash for the file needs to be kept up to date. Recalculating a full hash is very expensive especially as you only do a write(4k byte block somewhere) and expect that to take time somehow related to writing 4k of data.

My thought about a solution is to make a composite hash. Usually you take a file, take a hash over the entire file and store that. Taking the hash over a part and then combining these (say, with XOR) fixes the modification - XOR out the old hash and XOR in the new. There's a downside in that swapped blocks won't incur a hash change, so hash collisions become very likely.

The second idea I had (and the one I still think will work) is hashing the block with an IV (or initial bit to hash) equal to the block number. That way, moving a block makes the hash for that block different so moves won't weaken the hash. I don't know other problems with this method but it seems I'm the first to think of it.

Does anybody see a weakness with it? Any thoughts?

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 6:37 am
by NickJohnson
Just an idea: why not implement this on a block level instead of a file level? I'm not sure how all the metadata about COW would be stored, but if it does work, it would also solve your large file problem (and probably remove even more redundancy.)

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 8:32 am
by Candy
berkus wrote:Calculate hashes of larger blocks, like 128Kb or more. The hash of these hashes in order will be the file's master hash.
I was hoping for a directly-updateable mechanism without extra intermediates. I don't see why this would work better.

I agree that it would also work, and that without potentially introducing weaknesses that I may do now. I prefer my own solution though :-)
Blocks are addressed by their hashes (content-addressable filesystem, like git storage), so different blocks will be in different places and same blocks will be in same places, COWing naturally. The file tree could be addressed via hashes too, making it auto-COW-able (e.g. a tree hash is a hash of all file hashes contained therein).
I see no advantage in stretching the hashes to blocks; files only is good enough. Tree hashes have the downside that if you sniff a file deep down, you have to update all blocks until the root about the modified atime. I have a way to make versioning work in the filesystem without this problem; details will follow (if there's interest, of course) when it works.

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 8:47 am
by Owen
Do people still run their filesystems in atime mode? Most Unices now mount them in relatime mode by default (Or, only update it if its <= to the modification time). Windows now does the same too.

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 9:47 am
by Candy
Owen wrote:Do people still run their filesystems in atime mode? Most Unices now mount them in relatime mode by default (Or, only update it if its <= to the modification time). Windows now does the same too.
The point is also valid for logfiles, or pathological cases such as while 1; do rm file; touch file; done. Rewriting the root directory all the time sounds like a bad idea.

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 12:43 pm
by Ready4Dis
Why not just store something like file updated since last hash, and run something in the background that calculates the hash value for files that have been updated and not hashed yet? The benefit is that programs that generate huge files (like video recording, or ripping dvd's) will still run just as fast, once they are done, and there are free cpu cycles, a hash will be generated.

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 1:22 pm
by Owen
Candy wrote:
Owen wrote:Do people still run their filesystems in atime mode? Most Unices now mount them in relatime mode by default (Or, only update it if its <= to the modification time). Windows now does the same too.
The point is also valid for logfiles, or pathological cases such as while 1; do rm file; touch file; done. Rewriting the root directory all the time sounds like a bad idea.
Hmm. Is there any way you could manage lazy updates? Perhaps you need a way of storing separate journal entries for directory updates, which can be done more lazily? You would only need to apply them when the directory is evicted from cache, or when performing a journal replay (i.e. boot)

EDIT: Actually, it occurs to me that, if your files are constructed of immutable blocks, said block writes don't need to pass through the journal: Allocate a new block, write into that, then place the meta data updates in the journal

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 2:00 pm
by Candy
Owen wrote:Hmm. Is there any way you could manage lazy updates? Perhaps you need a way of storing separate journal entries for directory updates, which can be done more lazily? You would only need to apply them when the directory is evicted from cache, or when performing a journal replay (i.e. boot)
Lazy updates would defeat the point of having hashes of a file readily accessible. Many systems want to know when a file changed and right now use access time for that. That means that when a file is replaced with itself (save on a non-changed file) all those systems will reload. Make doesn't really care if you wrote to a file, it cares if you changed it. It should use the hash - but right now, that's a bit expensive. It should be cheaper.
EDIT: Actually, it occurs to me that, if your files are constructed of immutable blocks, said block writes don't need to pass through the journal: Allocate a new block, write into that, then place the meta data updates in the journal
That's actually a good point. I'll think about that overnight...

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 2:02 pm
by Candy
Ready4Dis wrote:Why not just store something like file updated since last hash, and run something in the background that calculates the hash value for files that have been updated and not hashed yet? The benefit is that programs that generate huge files (like video recording, or ripping dvd's) will still run just as fast, once they are done, and there are free cpu cycles, a hash will be generated.
The hash system I outlined in my opening post does that too, and realtime as opposed to after-the-fact-idletime-calculation. It also works on regularly-updated huge files such as big databases, which are multi-gigabyte and don't sit still for long enough to take a hash of them. If versioned, they'd create new versions so fast you couldn't possibly keep up - while this system can.

Re: Identical file detection in file system design

Posted: Thu Aug 05, 2010 11:24 pm
by Candy
berkus wrote:
Candy wrote:Tree hashes have the downside that if you sniff a file deep down, you have to update all blocks until the root about the modified atime. I have a way to make versioning work in the filesystem without this problem; details will follow (if there's interest, of course) when it works.
Yes, please, btrfs guys couldn't, so I'm wondering how you did.
The filesystem starts with a list of raw files enumerated by their hash. I haven't decided best how to make this quickly-accessible but I guess a hash table will suffice. The layer above this describes logical files, which may or may not refer to the same physical file. Physical files are never changed (only dropped/deleted), logical files can be modified if not marked versioned. Versioned logical files (which is what you wanted to know) work by taking the original file and modifying it. The original file is stored as an inverse diff to this version. That makes the current-most version always stored as a full file and the previous file as a delta (in case of the original having multiple references, it's copied instead of modified to a diff). With the algorithm from the OP you can update the hash of the old file to now be of the new file.

The directories are a layer that contains simple entries. They describe which logical file was present in that directory, and the start and end time of that record. Adding a new file only updates this record and the hash of the directory and the file, but not the directories above that.

That has the implications that the directory hash hashes the directory contents (the direct contents) and not the recursive directory contents, so you can't use it as an atime replacement (which probably btrfs have as a requirement). Instead, I intend to figure out some way to make a tree-delta-tag that makes the OS write all changes in the tree to a file, which your make system can then read & empty. You should be able to add multiple of such a tag to do that N times, if you have a make and a website with data, and a something else that needs to know.

Re: Identical file detection in file system design

Posted: Sun Aug 15, 2010 5:52 am
by nikito
I would make this: Writes mark files as changed, testing for equal need to update the hash only if the change-bit is set, and updating the hash clears the change bit. This also could require extra logic at mount-unmount time.

Re: Identical file detection in file system design

Posted: Wed Aug 18, 2010 6:47 am
by MasterLee
I don't think the idea is good ad all. For example i have one big file. 100MB for example now i make an copy of that file. Your file system will merge both copies. Now we have two directory entries pointing the same data. Then i modify a tiny part of the copy. And then what. The original is also altered altered? Or the file system must make the copy that time point. Short when i have a copy of an blob of adapter i want have an copy of that blob not an extra link to the original data. I can make hard link myself if i want.

Re: Identical file detection in file system design

Posted: Wed Aug 18, 2010 10:54 am
by Combuster
MasterLee wrote:I don't think the idea is good ad all. For example i have one big file. 100MB for example now i make an copy of that file. Your file system will merge both copies. Now we have two directory entries pointing the same data. Then i modify a tiny part of the copy. And then what. The original is also altered altered? Or the file system must make the copy that time point. Short when i have a copy of an blob of adapter i want have an copy of that blob not an extra link to the original data. I can make hard link myself if i want.
You missed the point there: Copying of actual data is not performed until either one of the files is written. Until then, there are two distinct files that are trying to share disk sectors for as long as they can while keeping the impression of separate files. That's different from linking, which only gives the same data two names and altering one actually changes the other.

Re: Identical file detection in file system design

Posted: Wed Aug 18, 2010 1:48 pm
by MasterLee
Combuster wrote:
MasterLee wrote:I don't think the idea is good ad all. For example i have one big file. 100MB for example now i make an copy of that file. Your file system will merge both copies. Now we have two directory entries pointing the same data. Then i modify a tiny part of the copy. And then what. The original is also altered altered? Or the file system must make the copy that time point. Short when i have a copy of an blob of adapter i want have an copy of that blob not an extra link to the original data. I can make hard link myself if i want.
You missed the point there: Copying of actual data is not performed until either one of the files is written. Until then, there are two distinct files that are trying to share disk sectors for as long as they can while keeping the impression of separate files. That's different from linking, which only gives the same data two names and altering one actually changes the other.
I had seen that point/problem. Think about on editing an large uncompressed video file of an Chinese military parade in high resolution. Unfortunately an flasher run trough the parade and some dissident show anti-governmental banners. Now the censor agent has to put black boxes over them. As first he makes an copy the original is later needed for the secret agency. He wonders the copy is done nearly no time. He open his video editor advance the time line to the appearance of the flasher. He set black boxes and saves. Even the video editing software only saves changed frames the save took ages. After waiting an hour he can continue with the protectors.
I would only make sense when for example programs come with similar data. For example two games share the same physical engine which is an dynamic link library. When now one game patches the engine an real copy would be created. And then some days later when the second game is updated the library files are merged again by double file detection. But that problem only appears where no central package management is available.
Then i could only appear when two users of the same computer/storage space download the same file.

But when the change occur i must copy the whole file anywhere. Why don't rehash the whole file during copy? Beside i wouldn't rely on hash values only.