advice on debugging 'random' filesystem issue?
Posted: Fri Jan 29, 2021 9:36 pm
Been working on filesystem part last few weeks and ran into an issue that I'd appreciate some advice on debugging:
My setup on a high level is plain EXT2 on top of BM DMA accessing ATA drive. More details are in viewtopic.php?f=1&t=39488. There wasn't too much deviation from the 'plan' except that the disk cache has been split into 2 levels (last level cache is generic and catches all r/w before they go down to disk, upper level is specific for inode and block group metadata, normal data block accesses all bypass upper level and only use last level)
The workload is still a multi threaded test that keeps creating folders/empty files until the disk is full, then rm -rf the whole tree. The threads compete with each other for disk space so they create a random amount of folders/files each iteration, and that count is checked when the tree is deleted (I extended the rmdir syscall to be able to rm -rf directly and return the number of items deleted).
The issue: I left the computer on running this test overnight in Qemu, and sometime during the night it suffered a serious failure that not only locked up the sync syscall (called by the test when something fails or when it otherwise exits), but pretty much wiped out the partition (not the partition itself, but can't ls / or cd into anything). As sync locked up I wasn't able to pull much info out from the disk image either (some meta data stuck in memory).
Without changing code, I started running the same thing again thinking "it will repro soon" but it hasn't happened again after like 6 hours or so.
I'm sure something is not right, and the symptom is also scary (wiping out the metadata of a partition:(), but also a bit lost on how to track this down in filesystem. It may even have originated elsewhere as I've been monkeying around remote (not same CPU) task resume and mutxes last night as well but they usually fail in very different ways or get caught by asserts before things go very wrong.
Would appreciate any advice on debugging this kind of issue ("random", hard to repro, severe consequence) as my feeling is that this one is the beginning rather than the end as I keep adding stuff.
My setup on a high level is plain EXT2 on top of BM DMA accessing ATA drive. More details are in viewtopic.php?f=1&t=39488. There wasn't too much deviation from the 'plan' except that the disk cache has been split into 2 levels (last level cache is generic and catches all r/w before they go down to disk, upper level is specific for inode and block group metadata, normal data block accesses all bypass upper level and only use last level)
The workload is still a multi threaded test that keeps creating folders/empty files until the disk is full, then rm -rf the whole tree. The threads compete with each other for disk space so they create a random amount of folders/files each iteration, and that count is checked when the tree is deleted (I extended the rmdir syscall to be able to rm -rf directly and return the number of items deleted).
The issue: I left the computer on running this test overnight in Qemu, and sometime during the night it suffered a serious failure that not only locked up the sync syscall (called by the test when something fails or when it otherwise exits), but pretty much wiped out the partition (not the partition itself, but can't ls / or cd into anything). As sync locked up I wasn't able to pull much info out from the disk image either (some meta data stuck in memory).
Without changing code, I started running the same thing again thinking "it will repro soon" but it hasn't happened again after like 6 hours or so.
I'm sure something is not right, and the symptom is also scary (wiping out the metadata of a partition:(), but also a bit lost on how to track this down in filesystem. It may even have originated elsewhere as I've been monkeying around remote (not same CPU) task resume and mutxes last night as well but they usually fail in very different ways or get caught by asserts before things go very wrong.
Would appreciate any advice on debugging this kind of issue ("random", hard to repro, severe consequence) as my feeling is that this one is the beginning rather than the end as I keep adding stuff.