unacceptably slow filesystem access - possible causes?

mariuszp · Post by **mariuszp** » Tue Feb 16, 2016 4:22 pm

My OS currently has 2 different storage device drivers - for IDE and for AHCI. In Bochs, filesyetem reads/writes are fast enough to not cause a problem, but for some reason in VirtualBox it is so slow that it is literally impossible to manage. It takes about 20 minute to copy a slightly over 700 KB file from CD to hard drive, which I've calculated to be an average transfer speed of 600 bytes per second. That same file takes about 1-2 seconds to copy in Bochs.

The abstraction works as follows: the filesystem driver performs operations on a filesystem image, which is essentially just a regular file. Physical drives (/dev/sda, /dev/sdb, ...) are presented as regular files with random access (with 64-bit addressing that's enough for 16EB of data so it is not a problem). To abstract way the "blockyness" of the drive, when such a file is opened, the file descriptor buffers one sector. If you seek and then read/write, it reads the sector that contains the position you are looking for, overwrites it in the buffer, and, when switching to a different sector or closing the drive-file, it flushes the sector to disk (if it was modified).

To implement sector read/writes, there is a command queue in which the OS stores a direction, the sector data (or buffer, if reading), and the sector number. A single thread started by the HDD driver then reads from that queue and executes the commands.

I am yet to test on real hardware.

I am just wondering if this design is terribly flawed in your judgement, and why it's a problem in VirtualBox but not in Bochs. Also looking for ideas on how to diagnose those performance issues because I cannot find any way to optimise this.

Brendan · Post by **Brendan** » Tue Feb 16, 2016 5:51 pm

Hi,

mariuszp wrote:I am just wondering if this design is terribly flawed in your judgement, and why it's a problem in VirtualBox but not in Bochs. Also looking for ideas on how to diagnose those performance issues because I cannot find any way to optimise this.

First step is to determine where the performance problem comes from. Is it transferring data to/from disk, in the disk device driver, in higher levels, in the app itself? Maybe it's a problem with something that has nothing to do with disk IO at all ("Oops, local APIC timer is set to 2 MHz instead of 2 KHz on virtualBox and it's doing a 2 million task switches per second!").

My guess is that the majority of the problem is transferring data to/from disk. More specifically; my guess is that you're using PIO and hardware virtualisation.

The reason is that this is a "known bad" combination. What happens is you read a dword from an IO port, and this causes a "hypervisor trap" and an expensive switch from "guest state" to "hypervisor state", then the VM fetches a dword and shoves it in a guest register, then there's an expensive switch from "hypervisor state" back to "guest state". DMA/bus mastering with single sectors would reduce the overhead and probably make it about 100 times faster. DMA/bus mastering with larger reads/writes (e.g. 4 KiB) would be better still. Of course Bochs doesn't have the overhead of switching between guest and hypervisor states, so it doesn't suffer so much from PIO.

Note that I said "the majority of the problem". 1200 bytes per second (600 bytes per second reading from CD plus 600 bytes writing to disk) is still slower than PIO alone can explain; so (if my guess is right) I'd expect there's also something else adding to the problem.

Small "1 sector only" transfers and a lack of prefetching/caching would be my next guess; especially for CD where it's like a big spiral, and you've got a small amount of time between the end of one 2048 byte sector and the start of the next, where if you take too long you end up with an expensive seek operation.

Cheers,

Brendan

mariuszp · Post by **mariuszp** » Tue Feb 16, 2016 6:14 pm

My AHCI driver uses DMA and faces exactly the same problems.

EDIT: It appears to be rather inefficient in accessing a physical drive (on real hardware) too; takes a few seconds to create a directory.

Brendan · Post by **Brendan** » Tue Feb 16, 2016 7:55 pm

Hi,

mariuszp wrote:My AHCI driver uses DMA and faces exactly the same problems.

EDIT: It appears to be rather inefficient in accessing a physical drive (on real hardware) too; takes a few seconds to create a directory.

Then, first step is still determining where the performance problem comes from.

For example; maybe you can get "current time" at various places within the code that creates a directory and see which piece/s take the longest amount of time; then do similar for those piece/s to determine which piece/s of them are taking the longest time; then... It shouldn't take that long to be able to say "It's spending most of it's time doing .....".

Alternatively; you can add "statistic gathering" everywhere. For example, (if you're not doing these things already) you could add support for keeping track of "CPU time consumed by thread/task" (for each thread/task), "time spent idle" and "number of task switches" in the scheduler; add "number of bytes read" and "number of bytes written" to the disk driver; add "IRQs received" (for each IRQ) in the kernel; etc. From these statistics you should be able to get a better idea of what is happening (and not just for this problem - these sorts of statistics are useful for a wide variety of things, and are gathered by every OS for that reason).

Cheers,

Brendan

mariuszp · Post by **mariuszp** » Wed Feb 17, 2016 12:49 pm

OK, so I have set up a test where the storage device interfaces prints messages whenever it attempts to read/write a sector, and how long the operation takes.

It performs a lot of reads; each taking approximately 60-65 ms (one read = 1 sector). A lot of times, it reads consecutive sectors one-by-one, so perhaps it would be a good idea to make the SD file load multiple sectors into its buffer in case more will be used. There are also a lot of sectors it would recall many times, such as the root directory and fragment tables.

What is a typical way to implement disk caching? I'm thinking of having a block cache that would store for example 64 separate "blocks" (a block being a few sectors each), and it would be like a ring buffer to which every "block" read from the disk is pushed, with old ones being overwritten, and a block is pushed even if it was read from the ring itself (that way, blocks that are read often will stay in the ring). Is this a good method?

Brendan · Post by **Brendan** » Wed Feb 17, 2016 1:21 pm

Hi,

mariuszp wrote:What is a typical way to implement disk caching? I'm thinking of having a block cache that would store for example 64 separate "blocks" (a block being a few sectors each), and it would be like a ring buffer to which every "block" read from the disk is pushed, with old ones being overwritten, and a block is pushed even if it was read from the ring itself (that way, blocks that are read often will stay in the ring). Is this a good method?

I think everyone does it different.

My way is to cache file data in the VFS layer (so that if the data is cached already, processes can read files, get directory info, etc. without any of the overhead of file systems or storage device drivers being involved). For file system meta-data (e.g. the "cluster allocation table" in FAT) it's cached by the file system code (in whatever format the file system code prefers - e.g. FAT file system might convert "12-bit and 16-bit cluster allocation table" into 32-bit per entry to avoid duplicating code). There is no low level block cache (e.g. in storage device drivers) because there's no point in caching the same data in multiple places.

On top of this, you'd want pre-fetching somewhere (in the VFS). For example; if a process is sequentially reading a file (very common), it's easy to detect this and pre-fetch the file's data into the (VFS) file data cache before it's needed. Ideally (with DMA) it all happens in parallel - while process is decoding/parsing "block N", data for "block N+1" is being pre-fetched/transferred from disk to RAM by hardware/DMA, and (hopefully) the process never has to wait for IO.

Also note that there's no sane reason to limit the amount of RAM used by caches. You want to use as much RAM as possible (to improve performance as much as possible). Free RAM is RAM that isn't being used to improve performance, and you should do whatever you can to minimise the amount of free/wasted RAM. Instead you want to do "if a process needs RAM and there's none free, steal it back from file system cache and give it to the process".

Finally; because of pre-fetching and other things you want "IO priorities". Something that's urgent (e.g. reading from swap space) should be done before reading from a file that's needed now, which should be done before worrying about pre-fetching. Typically this ends up being some sort of IO scheduler (to schedule disk controller time and not CPU time) in the storage device driver/s, that's designed to compromise between honouring IO priority and maximising throughput (e.g. minimising seek times). Note that in some cases, some of the IO scheduling may be done by the hardware itself (e.g. NCQ).

Cheers,

Brendan

mariuszp · Post by **mariuszp** » Wed Feb 17, 2016 3:05 pm

How does your virtual memory manager manage this then? Do you just map all available physical memory pages to the disk cache and then just release them to processes when necessary?

Brendan · Post by **Brendan** » Wed Feb 17, 2016 3:53 pm

Hi,

mariuszp wrote:How does your virtual memory manager manage this then? Do you just map all available physical memory pages to the disk cache and then just release them to processes when necessary?

The VFS is mostly a process just like any other. It asks kernel to make areas of its virtual address space "allocate on demand", just like any process would.

When the kernel is running out of free physical RAM, it sends a "free some RAM" message (including how critical the situation is) out to processes that have requested it. VFS responds by freeing "less important" cached file data. A file system might do the same for its meta-data cache. Web browser might respond by dumping cached web page resources. DNS system might do the same for it's domain name cache. A word processor might have enough data to allow the user to undo the last 5000 things they've done, and might respond by reducing that so the user can only undo the last 3000 things they've done.

If that doesn't help enough, kernel sends out more "free some RAM" message saying the situation is more critical.

Swap space ties in with this. If the situation is only minor, then "extremely unlikely to be needed" stuff might be sent to swap space to free up RAM; and as the situation gets worse the OS starts sending "more likely to be needed" stuff to swap space. If the OS ever reaches "worst case" (no processes able to give up some RAM) it's relying on swap space alone.

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Wed Feb 17, 2016 4:33 pm

I've been experimenting with a slightly different idea where FS processes create and manage their page caches in kernel space using system calls. Basically an FS process calls CreateMemObject to create a cache area that will be used as the page cache for a file. When process X calls mmap on that file the FS process calls MmapMemObject to connect the cache area to the address space of process X. After that the mm in the kernel can handle page faults quite easily. If there is a page missing from the cache of the file, the mm will send a message to the FS to fill that page. A system call is provided for that. There are also system calls that the FS can use to send part of the cache to processes that do a Posix read(). The advantage to this method is that the mm in the kernel knows which pages are being used and how often and can reclaim pages easily without needing to consult the various FS processes.

I've only implemented some parts of this so far so the idea is not fully developed.

We've drifted off topic a little but it's good to see there are other people writing microkernels.

mariuszp · Post by **mariuszp** » Wed Feb 17, 2016 5:11 pm

Well, my kernel is monolithic and the various layers are actually loaded into the kernel, so I'll have to resolve it slightly differently. Perhaps expanding the block cache ring during high I/O traffic?

gerryg400 · Post by **gerryg400** » Wed Feb 17, 2016 5:34 pm

mariuszp wrote:Well, my kernel is monolithic and the various layers are actually loaded into the kernel, so I'll have to resolve it slightly differently. Perhaps expanding the block cache ring during high I/O traffic?

Ooops. Sorry, for some reason thought you were working on a microkernel.

mariuszp · Post by **mariuszp** » Wed Feb 17, 2016 5:57 pm

gerryg400 wrote:
mariuszp wrote:Well, my kernel is monolithic and the various layers are actually loaded into the kernel, so I'll have to resolve it slightly differently. Perhaps expanding the block cache ring during high I/O traffic?
Ooops. Sorry, for some reason thought you were working on a microkernel.

Well, I have worked on a microkernel but that was a few years ago, it was absolutely terrible, I really didn't have experience, so it's abandoned. You might have seen me post questions about it maybe.

onlyonemac · Post by **onlyonemac** » Thu Feb 18, 2016 5:41 am

I think you need to buffer more than one sector. Disks work best when you read a long stretch of sectors all in one go, rather than reading sectors one-by-one. But you don't want to buffer just a contiguous block of sectors; buffer, say, the most recently-used 16 sectors. Then when you're e.g. reading a directory, the system will be a lot faster at walking a tree because it's not going to have to re-read sectors that it's already read.

mariuszp · Post by **mariuszp** » Thu Feb 18, 2016 5:55 pm

OK, I have slightly improved the performance. I buffer "pages" from disk (4KB each), and read from the RAM cache before accessing the disk, and it is visibly faster and issues a lot less reads to the disk. I still need to do some optimisations for the writes (currently I just invalidate the cached pages that contain the sector being written; I am planning to make it so that it updates the cache after while writing to disk in parallel). I'll see how that goes.

The performance is actually visibly faster.

BASICFreak · Post by **BASICFreak** » Thu Feb 18, 2016 7:51 pm

Most, if not all, disk controllers allow a read (or write) of one cylinder (aka track) which IIRC on an IDE drive is usually 0x3F (64 sectors / cylinder).

I would (if space permits) hold a cache of the Partition's Root Directory, and a cache of relevant FS info (eg. the FAT FS has a File Allocation Table (FAT) which should be at-least partly cached IMO)

On an IDE disk w/64 sec/cyl and 512 byte sectors - you load 32KB at a time from the disk.
If that cylinder only contains (say) 24KB of requested information you still saved time from reading in smaller chunks. It would be 6 reads at 4KB each - with the overhead of the disk controller, DMA, and IRQs - Instead you send one read cluster command, setup DMA (if using) once, and get one IRQ when it is finished (success or error)

The disk is slow (anywhere from 1-100MB/s, with exceptions), you do not ever want to wait on it when it can be avoided - RAM on the other hand is quick (e.g. DDR2 800MHz is about 4GB/s [It's what's in my system - that's why it's the example])

Now, for a write that doesn't contain a whole cylinder you have a few choices - but I would recommend writing in one burst, if you need to write 24KB into a 32KB cylinder do it with a normal sector (count) write, just remember not to cross those cylinder bounds on writes or reads - it's usually a bad thing

(You could also read the cylinder, modify it, and write it - The choice is yours)

Best Regards,

B!

P.S. If I used "cluster" at any point, I absolutely meant "cylinder"

OSDev.org

unacceptably slow filesystem access - possible causes?

unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?

Re: unacceptably slow filesystem access - possible causes?