please help comment on my plan for disk storage
please help comment on my plan for disk storage
I've been using an EXT2 disk image loaded by grub (as a boot module). It works well and allowed me to load programs built against a minimal C layer (printf, etc) to test out various syscalls and features of the kernel.
Now I'm thinking of moving towards a real libc (looking at new lib) and felt that a more persistent storage is probably a prerequisite for that. Esp. since the longer term plan is to port GCC.
So below is my high level plan for supporting disk storage. Hope to get some comments from you experts so I don't make "costly to fix later" decisions early on.
Device layer:
PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
Interface would be "R/W n bytes from byte offset m from/to disk x"
Cache layer:
Indexed by logical (byte) offset into disk, use AVL tree for look up, use linked list for true LRU replacement
Any cache miss goes down to Device layer and minimal granularity is 1 "cache line" or 64KB, write policy is write back
The AVL and LL structures will be pre allocated at boot (mount?), the "cacheline"s will be from the main physical heap at runtime
High usage on the main physical heap will force this cache to "write back and invalidate" some parts based on LRU
At shutdown (umount?) the whole cache will be written back. Might also do write back without invalidate periodically?
Interface would be the same as Device layer (so the cache can be bypassed at first/when debugging)
This layer also needs to handle page table mapping of the "cache lines". So it can copy to/from the buffer given by user to the buffer representing disk content. If the cache is bypassed, device layer needs to do it, or there needs to be a 'mapper' layer that sits on top of either the cache or the device layer.
File system layer:
Plan is to expand the EXT2 to support write.
Probably some naive way to linearly allocate space without regard to fragmentation for now.
Interface is inode based right now but plan to switch to pure file name based, like "R/W n bytes from byte offset m from/to file named X". So that when a caller writes a file it doesn't have to make the decision on which inodes to write to.
Host:
Plan on making the file system layer also runnable on a host so it (esp. the write parts) can be easily unit tested.
It can also be used as the core of a tool to package/verify/extract the disk image, and maybe even be able to 'de-fragment' it.
Future work:
Support more interesting device like USB mass storage (another implementation of the Device layer interface).
Some sort of overlay support (like a live CD) that sits between the file system layer and the cache layer to filter writes into a separate ramdisk, so it can safely run on real machines without too much fear of damaging stuff
Now I'm thinking of moving towards a real libc (looking at new lib) and felt that a more persistent storage is probably a prerequisite for that. Esp. since the longer term plan is to port GCC.
So below is my high level plan for supporting disk storage. Hope to get some comments from you experts so I don't make "costly to fix later" decisions early on.
Device layer:
PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
Interface would be "R/W n bytes from byte offset m from/to disk x"
Cache layer:
Indexed by logical (byte) offset into disk, use AVL tree for look up, use linked list for true LRU replacement
Any cache miss goes down to Device layer and minimal granularity is 1 "cache line" or 64KB, write policy is write back
The AVL and LL structures will be pre allocated at boot (mount?), the "cacheline"s will be from the main physical heap at runtime
High usage on the main physical heap will force this cache to "write back and invalidate" some parts based on LRU
At shutdown (umount?) the whole cache will be written back. Might also do write back without invalidate periodically?
Interface would be the same as Device layer (so the cache can be bypassed at first/when debugging)
This layer also needs to handle page table mapping of the "cache lines". So it can copy to/from the buffer given by user to the buffer representing disk content. If the cache is bypassed, device layer needs to do it, or there needs to be a 'mapper' layer that sits on top of either the cache or the device layer.
File system layer:
Plan is to expand the EXT2 to support write.
Probably some naive way to linearly allocate space without regard to fragmentation for now.
Interface is inode based right now but plan to switch to pure file name based, like "R/W n bytes from byte offset m from/to file named X". So that when a caller writes a file it doesn't have to make the decision on which inodes to write to.
Host:
Plan on making the file system layer also runnable on a host so it (esp. the write parts) can be easily unit tested.
It can also be used as the core of a tool to package/verify/extract the disk image, and maybe even be able to 'de-fragment' it.
Future work:
Support more interesting device like USB mass storage (another implementation of the Device layer interface).
Some sort of overlay support (like a live CD) that sits between the file system layer and the cache layer to filter writes into a separate ramdisk, so it can safely run on real machines without too much fear of damaging stuff
-
- Member
- Posts: 5568
- Joined: Mon Mar 25, 2013 7:01 pm
Re: please help comment on my plan for disk storage
It sounds like either way you block the caller. Is your PIO driver polling instead of using IRQs?xeyes wrote:PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
Modern storage hardware can queue multiple requests (but not when you're using IDE compatibility).
Are you allowing accesses that aren't aligned to block boundaries?xeyes wrote:Interface would be "R/W n bytes from byte offset m from/to disk x"
Don't forget a partition layer.
Re: please help comment on my plan for disk storage
Thanks!
I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
Will look into HW queues, I was thinking of a software queue where each ISR would try to set up the next request and point the interrupt to its originating CPU (so it will take over and do the same thing when the next interrupt happens and the chain goes on until the queue is used up)
Good point about partition, totally ignored it:( I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
Yes the caller has to wait for disk either way, by using the DMA the CPU is not blocked and can run something else from its run queue.Octocontrabass wrote:It sounds like either way you block the caller. Is your PIO driver polling instead of using IRQs?xeyes wrote:PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
Modern storage hardware can queue multiple requests (but not when you're using IDE compatibility).
I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
Will look into HW queues, I was thinking of a software queue where each ISR would try to set up the next request and point the interrupt to its originating CPU (so it will take over and do the same thing when the next interrupt happens and the chain goes on until the queue is used up)
Probably, since in normal operations (cache is inline) the cache will "absorb" any alignment issue due to its 64KB line size, I'll put in something to read across the blocks as needed for the caller when running without the cache. A bad idea to allow unaligned access?Octocontrabass wrote:Are you allowing accesses that aren't aligned to block boundaries?xeyes wrote:Interface would be "R/W n bytes from byte offset m from/to disk x"
Don't forget a partition layer.
Good point about partition, totally ignored it:( I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
-
- Member
- Posts: 5568
- Joined: Mon Mar 25, 2013 7:01 pm
Re: please help comment on my plan for disk storage
The CPU needs to wait for the disk to be ready before it can loop over port I/O. If you use interrupts, the CPU can do something useful while waiting for the disk.xeyes wrote:I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
Whether or not that's valuable depends on how many PIO commands you'll be using once you have DMA working. (Some ancient hardware doesn't support DMA.)
I think so. It's a lot of extra work for the device layer, and you're not going to use it at all once the cache layer works.xeyes wrote:A bad idea to allow unaligned access?
Yes, but don't forget about GPT.xeyes wrote:I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
Re: please help comment on my plan for disk storage
Thanks again!
I'll perhaps give interrupt driven PIO a shot, seems like the control flow is same as DMA (PIO does the copy in a loop after seeing interrupt and DMA gets notified that the copy is done or failed after seeing interrupt) so unless it causes lots of debugging trouble it shouldn't take that long to implement.
Also I'm trying to keep them (with or without cache) at parity from the interface's point of view so that issues can be quickly isolated with a #if or runtime setting change, so I'm hesitant to let the upper layers know that there's a cache so it can issue any access or there's no cache so it needs to do aligned access only.
Haha my actual worry is that my code is too ancient compared to the hardware as all of them have nvme disks which as I understand is some sort of PCIe protocol (A.K.A. too advanced for me now). So all ATA code would be for virtual machines only. But gotta start from somewhere right?Octocontrabass wrote:The CPU needs to wait for the disk to be ready before it can loop over port I/O. If you use interrupts, the CPU can do something useful while waiting for the disk.xeyes wrote:I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
Whether or not that's valuable depends on how many PIO commands you'll be using once you have DMA working. (Some ancient hardware doesn't support DMA.)
I'll perhaps give interrupt driven PIO a shot, seems like the control flow is same as DMA (PIO does the copy in a loop after seeing interrupt and DMA gets notified that the copy is done or failed after seeing interrupt) so unless it causes lots of debugging trouble it shouldn't take that long to implement.
Didn't quite get where the extra complexity comes from. Even if all accesses are aligned the Device layer would still have to be able to read/write across blocks/sectors right? As the caller can issue a request such as "read 8MB for me". So unaligned access is just inefficient as it might have to read extra stuff out and throwaway but not much more complex than supporting a big access?Octocontrabass wrote:I think so. It's a lot of extra work for the device layer, and you're not going to use it at all once the cache layer works.xeyes wrote:A bad idea to allow unaligned access?
Also I'm trying to keep them (with or without cache) at parity from the interface's point of view so that issues can be quickly isolated with a #if or runtime setting change, so I'm hesitant to let the upper layers know that there's a cache so it can issue any access or there's no cache so it needs to do aligned access only.
I thought GPT means UEFI which means x64? But I'm planning to stick with x86 until I have meaningful load that can force me to run out of either virtual or physical space.Octocontrabass wrote:Yes, but don't forget about GPT.xeyes wrote:I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
-
- Member
- Posts: 5568
- Joined: Mon Mar 25, 2013 7:01 pm
Re: please help comment on my plan for disk storage
Unaligned reads aren't too difficult, but what about unaligned writes?xeyes wrote:Didn't quite get where the extra complexity comes from. Even if all accesses are aligned the Device layer would still have to be able to read/write across blocks/sectors right? As the caller can issue a request such as "read 8MB for me". So unaligned access is just inefficient as it might have to read extra stuff out and throwaway but not much more complex than supporting a big access?
Nope! It's just a partition table, you can use it on any hard disk.xeyes wrote:I thought GPT means UEFI which means x64?
Re: please help comment on my plan for disk storage
Isn't it the same as how a CPU cache works? Like:Octocontrabass wrote:Unaligned reads aren't too difficult, but what about unaligned writes?xeyes wrote:Didn't quite get where the extra complexity comes from. Even if all accesses are aligned the Device layer would still have to be able to read/write across blocks/sectors right? As the caller can issue a request such as "read 8MB for me". So unaligned access is just inefficient as it might have to read extra stuff out and throwaway but not much more complex than supporting a big access?
1. Read out the full sector(s) around the unaligned write
2. Modify the written parts (while keeping the reminder the same)
3. Write the whole sector(s) back?
Thanks for correcting me! GPT -> UEFI is apparently a Windows boot limitation and with grub I guess it isn't an issue.Octocontrabass wrote:Nope! It's just a partition table, you can use it on any hard disk.xeyes wrote:I thought GPT means UEFI which means x64?
-
- Member
- Posts: 426
- Joined: Tue Apr 03, 2018 2:44 am
Re: please help comment on my plan for disk storage
Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?xeyes wrote:Isn't it the same as how a CPU cache works? Like:Octocontrabass wrote: Unaligned reads aren't too difficult, but what about unaligned writes?
1. Read out the full sector(s) around the unaligned write
2. Modify the written parts (while keeping the reminder the same)
3. Write the whole sector(s) back?
The main difference between the cache interface and the underlying block interface is you don't necessarily want the blocks to transparently persist in memory. Say you're writing out the contents of a newly written file. The file contents are already cached in your filesystem cache, and your reads/write to them can be via memory mapped regions.
But when you want to persist that newly written data to disk, you actually want to pass in the physical buffer for the data to write, and as you already have it in your cache, you can just lock the page and point the buffer at the physical memory. Your buffer is transient, and can be discarded once the transfer is complete.
Re: please help comment on my plan for disk storage
Some filesystems have what they call "fragments", but it's something of an advanced feature and I haven't heard of it for years. I think it probably was most useful on systems with a large block size and many small files. It may be somewhat obsolete as wider addresses allow smaller blocks to be used efficiently.thewrongchristian wrote:Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
-
- Member
- Posts: 426
- Joined: Tue Apr 03, 2018 2:44 am
Re: please help comment on my plan for disk storage
If you're thinking of FFS (aka UFS), fragments are still multiples of the underlying sector size, generally 1024 or 2048 bytes (assuming 512 byte sectors) with a block size of 8K. FFS uses fragments as the smallest unit of allocation. By preference, BSD allocated entire blocks (8K) where appropriate, and reserved fragments for small files or the tail end of larger files.eekee wrote:Some filesystems have what they call "fragments", but it's something of an advanced feature and I haven't heard of it for years. I think it probably was most useful on systems with a large block size and many small files. It may be somewhat obsolete as wider addresses allow smaller blocks to be used efficiently.thewrongchristian wrote:Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?
Some filesystems get round the external fragmentation by storing file tails in the inode itself. I think ReiserFS does this, for example.
Re: please help comment on my plan for disk storage
That makes sense, thanks for detailing it. It improves space-efficiency without going as far as read-modify-write on disk sectors.thewrongchristian wrote:If you're thinking of FFS (aka UFS), fragments are still multiples of the underlying sector size, generally 1024 or 2048 bytes (assuming 512 byte sectors) with a block size of 8K. FFS uses fragments as the smallest unit of allocation. By preference, BSD allocated entire blocks (8K) where appropriate, and reserved fragments for small files or the tail end of larger files.
I think you're right. I nearly mentioned ReiserFS because it was very fast with small files, but couldn't quite remember what it did. Around '08, I was using Source Mage Linux -- its repository was all tiny shell scripts. Tasks on the whole repo were visibly very much faster with ReiserFS than ext3. (If you're thinking otherwise, Debian introduced a patch which spoiled it because some people demanded a 1-partition install: no separate partition for /boot. In my subjective experience, this patch removed almost all of ReiserFS's speed advantage over ext3.)thewrongchristian wrote:Some filesystems get round the external fragmentation by storing file tails in the inode itself. I think ReiserFS does this, for example.
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
Re: please help comment on my plan for disk storage
Thanks, actually my implementation of EXT2 does from time to time peek small areas of the disk, such as meta data, and later might start to poke small areas to update metadata or even as a way to implement user writes. Small means less than block or even sector size here.thewrongchristian wrote:Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?xeyes wrote:Isn't it the same as how a CPU cache works? Like:Octocontrabass wrote: Unaligned reads aren't too difficult, but what about unaligned writes?
1. Read out the full sector(s) around the unaligned write
2. Modify the written parts (while keeping the reminder the same)
3. Write the whole sector(s) back?
The main difference between the cache interface and the underlying block interface is you don't necessarily want the blocks to transparently persist in memory. Say you're writing out the contents of a newly written file. The file contents are already cached in your filesystem cache, and your reads/write to them can be via memory mapped regions.
But when you want to persist that newly written data to disk, you actually want to pass in the physical buffer for the data to write, and as you already have it in your cache, you can just lock the page and point the buffer at the physical memory. Your buffer is transient, and can be discarded once the transfer is complete.
I've mentioned a few posts above that this won't be efficient until the cache is inline, but should still function correctly (i.e. can read/write small chunks of disk). Once the cache is inline its inefficiency would hopefully be largely hidden, or I might have to start thinking about a better cache or specialized caches for meta data.
By "I don't know of a filesystem which doesn't already read/write complete blocks" you meant all the "real" implementations of EXT2 restrict themselves to only do 4KB (or whatever the block size) reads and writes even if they just want to read a single dentry or write maybe a single lined text?
Would also be glad to hear your insights on a more efficient method to do non block/sector sized write access that doesn't involve the "read modify write" steps outlined though.
-
- Member
- Posts: 426
- Joined: Tue Apr 03, 2018 2:44 am
Re: please help comment on my plan for disk storage
I've only looked at two implementations of ext2, at a cursory level, and they both read only at block level.xeyes wrote: By "I don't know of a filesystem which doesn't already read/write complete blocks" you meant all the "real" implementations of EXT2 restrict themselves to only do 4KB (or whatever the block size) reads and writes even if they just want to read a single dentry or write maybe a single lined text?
Well, there isn't a more efficient way. You have to read blocks, at least for the time being as most mass storage devices are or look like block devices.xeyes wrote: Would also be glad to hear your insights on a more efficient method to do non block/sector sized write access that doesn't involve the "read modify write" steps outlined though.
Of course, there's no technical harm in hiding all this from the block device consumer (like your filesystem) and mapping the device to memory using the same mechanism as your cache (presumably demand paging based.) It also makes loopback devices a cinch, as they can just use the provided image file as is, without an intermediate layer.
But having a different API helps delineate device IO and will make you think about how and when you read/write data.
It's an issue I've grappled with, but in the end, I went with a separate block level interface basically because the lifecycle of the block can be more easily and explicitly controlled.
Re: please help comment on my plan for disk storage
Thanks for the context, the write part of EXT2 has quite a bit more details than I thought (didn't help that I assumed rm -r was a file sys feature) and took till now to have the basic functions (folders, files and soft links) working somewhat stable for single threaded tests.thewrongchristian wrote:I've only looked at two implementations of ext2, at a cursory level, and they both read only at block level.xeyes wrote: By "I don't know of a filesystem which doesn't already read/write complete blocks" you meant all the "real" implementations of EXT2 restrict themselves to only do 4KB (or whatever the block size) reads and writes even if they just want to read a single dentry or write maybe a single lined text?
Well, there isn't a more efficient way. You have to read blocks, at least for the time being as most mass storage devices are or look like block devices.xeyes wrote: Would also be glad to hear your insights on a more efficient method to do non block/sector sized write access that doesn't involve the "read modify write" steps outlined though.
Of course, there's no technical harm in hiding all this from the block device consumer (like your filesystem) and mapping the device to memory using the same mechanism as your cache (presumably demand paging based.) It also makes loopback devices a cinch, as they can just use the provided image file as is, without an intermediate layer.
But having a different API helps delineate device IO and will make you think about how and when you read/write data.
It's an issue I've grappled with, but in the end, I went with a separate block level interface basically because the lifecycle of the block can be more easily and explicitly controlled.
But through the process I do see the point of having bigger sized IO better now. Otherwise certain operations like inode I/O probably wouldn't play too well with the generic cache I thought of (which isn't mapping based but more like large chunks of the disk kept in memory, with memcpy on demand) as any inode I/O would cause the cache to go read/write a lot of unrelated stuff.
Probably going to need special caches for dentry (thinking about hashing the path for look up), inode and the group bit maps (this one is obviously page sized). Still not too sure about inode cache granularity though, as 1 page can hold 32 inodes and it doesn't seem likely that more than a few of the 32 would be actually needed.