Page 1 of 1
please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 5:34 pm
by xeyes
I've been using an EXT2 disk image loaded by grub (as a boot module). It works well and allowed me to load programs built against a minimal C layer (printf, etc) to test out various syscalls and features of the kernel.
Now I'm thinking of moving towards a real libc (looking at new lib) and felt that a more persistent storage is probably a prerequisite for that. Esp. since the longer term plan is to port GCC.
So below is my high level plan for supporting disk storage. Hope to get some comments from you experts so I don't make "costly to fix later" decisions early on.
Device layer:
PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
Interface would be "R/W n bytes from byte offset m from/to disk x"
Cache layer:
Indexed by logical (byte) offset into disk, use AVL tree for look up, use linked list for true LRU replacement
Any cache miss goes down to Device layer and minimal granularity is 1 "cache line" or 64KB, write policy is write back
The AVL and LL structures will be pre allocated at boot (mount?), the "cacheline"s will be from the main physical heap at runtime
High usage on the main physical heap will force this cache to "write back and invalidate" some parts based on LRU
At shutdown (umount?) the whole cache will be written back. Might also do write back without invalidate periodically?
Interface would be the same as Device layer (so the cache can be bypassed at first/when debugging)
This layer also needs to handle page table mapping of the "cache lines". So it can copy to/from the buffer given by user to the buffer representing disk content. If the cache is bypassed, device layer needs to do it, or there needs to be a 'mapper' layer that sits on top of either the cache or the device layer.
File system layer:
Plan is to expand the EXT2 to support write.
Probably some naive way to linearly allocate space without regard to fragmentation for now.
Interface is inode based right now but plan to switch to pure file name based, like "R/W n bytes from byte offset m from/to file named X". So that when a caller writes a file it doesn't have to make the decision on which inodes to write to.
Host:
Plan on making the file system layer also runnable on a host so it (esp. the write parts) can be easily unit tested.
It can also be used as the core of a tool to package/verify/extract the disk image, and maybe even be able to 'de-fragment' it.
Future work:
Support more interesting device like USB mass storage (another implementation of the Device layer interface).
Some sort of overlay support (like a live CD) that sits between the file system layer and the cache layer to filter writes into a separate ramdisk, so it can safely run on real machines without too much fear of damaging stuff
Re: please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 7:22 pm
by Octocontrabass
xeyes wrote:PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
It sounds like either way you block the caller. Is your PIO driver polling instead of using IRQs?
Modern storage hardware can queue multiple requests (but not when you're using IDE compatibility).
xeyes wrote:Interface would be "R/W n bytes from byte offset m from/to disk x"
Are you allowing accesses that aren't aligned to block boundaries?
Don't forget a partition layer.
Re: please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 7:43 pm
by xeyes
Thanks!
Octocontrabass wrote:
xeyes wrote:PIO ATA for now (blocks any caller)
DMA support later (yields any caller and resumes the caller in ISR, maybe also remap interrupt to the CPU the caller is on, probably also needs some request queuing)
It sounds like either way you block the caller. Is your PIO driver polling instead of using IRQs?
Modern storage hardware can queue multiple requests (but not when you're using IDE compatibility).
Yes the caller has to wait for disk either way, by using the DMA the CPU is not blocked and can run something else from its run queue.
I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
Will look into HW queues, I was thinking of a software queue where each ISR would try to set up the next request and point the interrupt to its originating CPU (so it will take over and do the same thing when the next interrupt happens and the chain goes on until the queue is used up)
Octocontrabass wrote:
xeyes wrote:Interface would be "R/W n bytes from byte offset m from/to disk x"
Are you allowing accesses that aren't aligned to block boundaries?
Don't forget a partition layer.
Probably, since in normal operations (cache is inline) the cache will "absorb" any alignment issue due to its 64KB line size, I'll put in something to read across the blocks as needed for the caller when running without the cache. A bad idea to allow unaligned access?
Good point about partition, totally ignored it:( I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
Re: please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 8:53 pm
by Octocontrabass
xeyes wrote:I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
The CPU needs to wait for the disk to be ready before it can loop over port I/O. If you use interrupts, the CPU can do something useful while waiting for the disk.
Whether or not that's valuable depends on how many PIO commands you'll be using once you have DMA working. (Some ancient hardware doesn't support DMA.)
xeyes wrote:A bad idea to allow unaligned access?
I think so. It's a lot of extra work for the device layer, and you're not going to use it at all once the cache layer works.
xeyes wrote:I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
Yes, but don't forget about GPT.
Re: please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 9:55 pm
by xeyes
Thanks again!
Octocontrabass wrote:
xeyes wrote:I don't have the PIO driver yet but the plan is to poll as it is like a shorthand to get things going before normal operation (DMA mode) comes online, is it valuable to support interrupt for PIO instead? As I understand the CPU needs to loop over port I/O anyways?
The CPU needs to wait for the disk to be ready before it can loop over port I/O. If you use interrupts, the CPU can do something useful while waiting for the disk.
Whether or not that's valuable depends on how many PIO commands you'll be using once you have DMA working. (Some ancient hardware doesn't support DMA.)
Haha my actual worry is that my code is too ancient compared to the hardware as all of them have nvme disks which as I understand is some sort of PCIe protocol (A.K.A. too advanced for me now). So all ATA code would be for virtual machines only. But gotta start from somewhere right?
I'll perhaps give interrupt driven PIO a shot, seems like the control flow is same as DMA (PIO does the copy in a loop after seeing interrupt and DMA gets notified that the copy is done or failed after seeing interrupt) so unless it causes lots of debugging trouble it shouldn't take that long to implement.
Octocontrabass wrote:
xeyes wrote:A bad idea to allow unaligned access?
I think so. It's a lot of extra work for the device layer, and you're not going to use it at all once the cache layer works.
Didn't quite get where the extra complexity comes from. Even if all accesses are aligned the Device layer would still have to be able to read/write across blocks/sectors right? As the caller can issue a request such as "read 8MB for me". So unaligned access is just inefficient as it might have to read extra stuff out and throwaway but not much more complex than supporting a big access?
Also I'm trying to keep them (with or without cache) at parity from the interface's point of view so that issues can be quickly isolated with a #if or runtime setting change, so I'm hesitant to let the upper layers know that there's a cache so it can issue any access or there's no cache so it needs to do aligned access only.
Octocontrabass wrote:
xeyes wrote:I guess it is still logical enough that should be a layer right below file system and I should probably start with the good old MBR?
Yes, but don't forget about GPT.
I thought GPT means UEFI which means x64? But I'm planning to stick with x86 until I have meaningful load that can force me to run out of either virtual or physical space.
Re: please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 10:22 pm
by Octocontrabass
xeyes wrote:Didn't quite get where the extra complexity comes from. Even if all accesses are aligned the Device layer would still have to be able to read/write across blocks/sectors right? As the caller can issue a request such as "read 8MB for me". So unaligned access is just inefficient as it might have to read extra stuff out and throwaway but not much more complex than supporting a big access?
Unaligned reads aren't too difficult, but what about unaligned writes?
xeyes wrote:I thought GPT means UEFI which means x64?
Nope! It's just a partition table, you can use it on any hard disk.
Re: please help comment on my plan for disk storage
Posted: Tue Jan 05, 2021 11:28 pm
by xeyes
Octocontrabass wrote:
xeyes wrote:Didn't quite get where the extra complexity comes from. Even if all accesses are aligned the Device layer would still have to be able to read/write across blocks/sectors right? As the caller can issue a request such as "read 8MB for me". So unaligned access is just inefficient as it might have to read extra stuff out and throwaway but not much more complex than supporting a big access?
Unaligned reads aren't too difficult, but what about unaligned writes?
Isn't it the same as how a CPU cache works? Like:
1. Read out the full sector(s) around the unaligned write
2. Modify the written parts (while keeping the reminder the same)
3. Write the whole sector(s) back?
Octocontrabass wrote:
xeyes wrote:I thought GPT means UEFI which means x64?
Nope! It's just a partition table, you can use it on any hard disk.
Thanks for correcting me! GPT -> UEFI is apparently a Windows boot limitation and with grub I guess it isn't an issue.
Re: please help comment on my plan for disk storage
Posted: Wed Jan 06, 2021 3:46 pm
by thewrongchristian
xeyes wrote:Octocontrabass wrote:
Unaligned reads aren't too difficult, but what about unaligned writes?
Isn't it the same as how a CPU cache works? Like:
1. Read out the full sector(s) around the unaligned write
2. Modify the written parts (while keeping the reminder the same)
3. Write the whole sector(s) back?
Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?
The main difference between the cache interface and the underlying block interface is you don't necessarily want the blocks to transparently persist in memory. Say you're writing out the contents of a newly written file. The file contents are already cached in your filesystem cache, and your reads/write to them can be via memory mapped regions.
But when you want to persist that newly written data to disk, you actually want to pass in the physical buffer for the data to write, and as you already have it in your cache, you can just lock the page and point the buffer at the physical memory. Your buffer is transient, and can be discarded once the transfer is complete.
Re: please help comment on my plan for disk storage
Posted: Wed Jan 06, 2021 6:39 pm
by eekee
thewrongchristian wrote:Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?
Some filesystems have what they call "fragments", but it's something of an advanced feature and I haven't heard of it for years. I think it probably was most useful on systems with a large block size and many small files. It may be somewhat obsolete as wider addresses allow smaller blocks to be used efficiently.
Re: please help comment on my plan for disk storage
Posted: Wed Jan 06, 2021 7:06 pm
by thewrongchristian
eekee wrote:thewrongchristian wrote:Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?
Some filesystems have what they call "fragments", but it's something of an advanced feature and I haven't heard of it for years. I think it probably was most useful on systems with a large block size and many small files. It may be somewhat obsolete as wider addresses allow smaller blocks to be used efficiently.
If you're thinking of FFS (aka UFS), fragments are still multiples of the underlying sector size, generally 1024 or 2048 bytes (assuming 512 byte sectors) with a block size of 8K. FFS uses fragments as the smallest unit of allocation. By preference, BSD allocated entire blocks (8K) where appropriate, and reserved fragments for small files or the tail end of larger files.
Some filesystems get round the external fragmentation by storing file tails in the inode itself. I think ReiserFS does this, for example.
Re: please help comment on my plan for disk storage
Posted: Thu Jan 07, 2021 7:36 am
by eekee
thewrongchristian wrote:If you're thinking of FFS (aka UFS), fragments are still multiples of the underlying sector size, generally 1024 or 2048 bytes (assuming 512 byte sectors) with a block size of 8K. FFS uses fragments as the smallest unit of allocation. By preference, BSD allocated entire blocks (8K) where appropriate, and reserved fragments for small files or the tail end of larger files.
That makes sense, thanks for detailing it. It improves space-efficiency without going as far as read-modify-write on disk sectors.
thewrongchristian wrote:Some filesystems get round the external fragmentation by storing file tails in the inode itself. I think ReiserFS does this, for example.
I think you're right. I nearly mentioned ReiserFS because it was very fast with small files, but couldn't quite remember what it did. Around '08, I was using Source Mage Linux -- its repository was all tiny shell scripts. Tasks on the whole repo were visibly very much faster with ReiserFS than ext3. (If you're thinking otherwise, Debian introduced a patch which spoiled it because some people demanded a 1-partition install: no separate partition for /boot. In my subjective experience, this patch removed almost all of ReiserFS's speed advantage over ext3.)
Re: please help comment on my plan for disk storage
Posted: Thu Jan 07, 2021 8:52 pm
by xeyes
thewrongchristian wrote:xeyes wrote:Octocontrabass wrote:
Unaligned reads aren't too difficult, but what about unaligned writes?
Isn't it the same as how a CPU cache works? Like:
1. Read out the full sector(s) around the unaligned write
2. Modify the written parts (while keeping the reminder the same)
3. Write the whole sector(s) back?
Steps 1 and 3 are already reading/writing full sectors. As you'll have to read/write at block boundaries anyway, you might as well make the interface block based (blocks can be variable size though.) Filesystems are already defined in terms of blocks (or sectors). I don't know of a filesystem which doesn't already read/write complete blocks, so why duplicate the effort in a layer underneath the filesystem?
The main difference between the cache interface and the underlying block interface is you don't necessarily want the blocks to transparently persist in memory. Say you're writing out the contents of a newly written file. The file contents are already cached in your filesystem cache, and your reads/write to them can be via memory mapped regions.
But when you want to persist that newly written data to disk, you actually want to pass in the physical buffer for the data to write, and as you already have it in your cache, you can just lock the page and point the buffer at the physical memory. Your buffer is transient, and can be discarded once the transfer is complete.
Thanks, actually my implementation of EXT2 does from time to time peek small areas of the disk, such as meta data, and later might start to poke small areas to update metadata or even as a way to implement user writes. Small means less than block or even sector size here.
I've mentioned a few posts above that this won't be efficient until the cache is inline, but should still function correctly (i.e. can read/write small chunks of disk). Once the cache is inline its inefficiency would hopefully be largely hidden, or I might have to start thinking about a better cache or specialized caches for meta data.
By "I don't know of a filesystem which doesn't already read/write complete blocks" you meant all the "real" implementations of EXT2 restrict themselves to only do 4KB (or whatever the block size) reads and writes even if they just want to read a single dentry or write maybe a single lined text?
Would also be glad to hear your insights on a more efficient method to do non block/sector sized write access that doesn't involve the "read modify write" steps outlined though.
Re: please help comment on my plan for disk storage
Posted: Fri Jan 08, 2021 7:30 pm
by thewrongchristian
xeyes wrote:
By "I don't know of a filesystem which doesn't already read/write complete blocks" you meant all the "real" implementations of EXT2 restrict themselves to only do 4KB (or whatever the block size) reads and writes even if they just want to read a single dentry or write maybe a single lined text?
I've only looked at two implementations of ext2, at a cursory level, and they both read only at block level.
xeyes wrote:
Would also be glad to hear your insights on a more efficient method to do non block/sector sized write access that doesn't involve the "read modify write" steps outlined though.
Well, there isn't a more efficient way. You have to read blocks, at least for the time being as most mass storage devices are or look like block devices.
Of course, there's no technical harm in hiding all this from the block device consumer (like your filesystem) and mapping the device to memory using the same mechanism as your cache (presumably demand paging based.) It also makes loopback devices a cinch, as they can just use the provided image file as is, without an intermediate layer.
But having a different API helps delineate device IO and will make you think about how and when you read/write data.
It's an issue I've grappled with, but in the end, I went with a separate block level interface basically because the lifecycle of the block can be more easily and explicitly controlled.
Re: please help comment on my plan for disk storage
Posted: Sat Jan 16, 2021 1:36 am
by xeyes
thewrongchristian wrote:xeyes wrote:
By "I don't know of a filesystem which doesn't already read/write complete blocks" you meant all the "real" implementations of EXT2 restrict themselves to only do 4KB (or whatever the block size) reads and writes even if they just want to read a single dentry or write maybe a single lined text?
I've only looked at two implementations of ext2, at a cursory level, and they both read only at block level.
xeyes wrote:
Would also be glad to hear your insights on a more efficient method to do non block/sector sized write access that doesn't involve the "read modify write" steps outlined though.
Well, there isn't a more efficient way. You have to read blocks, at least for the time being as most mass storage devices are or look like block devices.
Of course, there's no technical harm in hiding all this from the block device consumer (like your filesystem) and mapping the device to memory using the same mechanism as your cache (presumably demand paging based.) It also makes loopback devices a cinch, as they can just use the provided image file as is, without an intermediate layer.
But having a different API helps delineate device IO and will make you think about how and when you read/write data.
It's an issue I've grappled with, but in the end, I went with a separate block level interface basically because the lifecycle of the block can be more easily and explicitly controlled.
Thanks for the context, the write part of EXT2 has quite a bit more details than I thought (didn't help that I assumed rm -r was a file sys feature) and took till now to have the basic functions (folders, files and soft links) working somewhat stable for single threaded tests.
But through the process I do see the point of having bigger sized IO better now. Otherwise certain operations like inode I/O probably wouldn't play too well with the generic cache I thought of (which isn't mapping based but more like large chunks of the disk kept in memory, with memcpy on demand) as any inode I/O would cause the cache to go read/write a lot of unrelated stuff.
Probably going to need special caches for dentry (thinking about hashing the path for look up), inode and the group bit maps (this one is obviously page sized). Still not too sure about inode cache granularity though, as 1 page can hold 32 inodes and it doesn't seem likely that more than a few of the 32 would be actually needed.