advice on debugging 'random' filesystem issue?

xeyes · Post by **xeyes** » Fri Jan 29, 2021 9:36 pm

Been working on filesystem part last few weeks and ran into an issue that I'd appreciate some advice on debugging:

My setup on a high level is plain EXT2 on top of BM DMA accessing ATA drive. More details are in viewtopic.php?f=1&t=39488. There wasn't too much deviation from the 'plan' except that the disk cache has been split into 2 levels (last level cache is generic and catches all r/w before they go down to disk, upper level is specific for inode and block group metadata, normal data block accesses all bypass upper level and only use last level)

The workload is still a multi threaded test that keeps creating folders/empty files until the disk is full, then rm -rf the whole tree. The threads compete with each other for disk space so they create a random amount of folders/files each iteration, and that count is checked when the tree is deleted (I extended the rmdir syscall to be able to rm -rf directly and return the number of items deleted).

The issue: I left the computer on running this test overnight in Qemu, and sometime during the night it suffered a serious failure that not only locked up the sync syscall (called by the test when something fails or when it otherwise exits), but pretty much wiped out the partition (not the partition itself, but can't ls / or cd into anything). As sync locked up I wasn't able to pull much info out from the disk image either (some meta data stuck in memory).

Without changing code, I started running the same thing again thinking "it will repro soon" but it hasn't happened again after like 6 hours or so.

I'm sure something is not right, and the symptom is also scary (wiping out the metadata of a partition:(), but also a bit lost on how to track this down in filesystem. It may even have originated elsewhere as I've been monkeying around remote (not same CPU) task resume and mutxes last night as well but they usually fail in very different ways or get caught by asserts before things go very wrong.

Would appreciate any advice on debugging this kind of issue ("random", hard to repro, severe consequence) as my feeling is that this one is the beginning rather than the end as I keep adding stuff.

eekee · Post by **eekee** » Wed Feb 03, 2021 1:59 pm

You could write debug prints to serial or network to be logged by another computer. I'm far from a filesystem expert, but that's what I'd do with any issue where it's difficult to see what's going on.

xeyes · Post by **xeyes** » Thu Feb 04, 2021 11:49 am

eekee wrote:You could write debug prints to serial or network to be logged by another computer. I'm far from a filesystem expert, but that's what I'd do with any issue where it's difficult to see what's going on.

Thanks, yes this is all on Qemu and I'm using code lifted from the wiki here to print to serial

Fixed more race conditions in the cache but there are more as it ran okay for like 50 hours and then went haywire again

Also noticed a strange asymmetry among the CPUs. The test thread on one of them blocks itself thousands of times per second. While all other CPUs only see this a few hundred times per second. But as far as I can think of and saw when manually following them, their load and sequences are pseudorandom but symmetrical.

Maybe it's a good time to move onto other features and deal with these when they become more obvious issues.

Would still like to hear advice or pointers on tracking down these 'random' issues though.

nexos · Post by **nexos** » Thu Feb 04, 2021 2:25 pm

I'm glad I'm not the only one. Back in my last OS, I would have a bug in the scheduler, fix it, only for it to go nuts a few hours later

. I assume it probably is race conditions. It may even be a bug in the scheduler that is just not coming to light. I plan on writing my full OS for uniprocesser, then making my kernel preemptible, then adding SMP support. I hope this will limit strange bugs.

xeyes · Post by **xeyes** » Thu Feb 04, 2021 9:07 pm

nexos wrote:I'm glad I'm not the only one. Back in my last OS, I would have a bug in the scheduler, fix it, only for it to go nuts a few hours later . I assume it probably is race conditions. It may even be a bug in the scheduler that is just not coming to light. I plan on writing my full OS for uniprocesser, then making my kernel preemptible, then adding SMP support. I hope this will limit strange bugs.

Turn off MP is an interesting idea. Any thoughts on what are you going to do when things fail when you turn on SMP though?

OSDev.org

advice on debugging 'random' filesystem issue?

advice on debugging 'random' filesystem issue?

Re: advice on debugging 'random' filesystem issue?

Re: advice on debugging 'random' filesystem issue?

Re: advice on debugging 'random' filesystem issue?

Re: advice on debugging 'random' filesystem issue?