Page 1 of 2

What is your longest bug?

Posted: Thu Apr 27, 2017 6:18 pm
by SpyderTL
After fighting with a particularly annoying EHCI issue for nearly 4 days, I'm trying to think of the longest time it's taken me to find and fix a bug. This one feels like the longest right now, but I'm sure I've spent over a week or two on one issue before. I could probably search back through the forums and find out.

What about you guys? What was your "white whale" bug that eluded you the longest? Or maybe still eludes you...

Re: What is your longest bug?

Posted: Thu Apr 27, 2017 7:41 pm
by eryjus
I will never forget it. 3 months. It was for work, not the personal stuff, and it was an ERP suite. It was all centered around memory leaks. My business locked me in a room for 2 weeks of that and slipped pizza under the door.

Re: What is your longest bug?

Posted: Thu Apr 27, 2017 8:16 pm
by hgoel
I've been dealing with SMP bugs for a while, although they've been different bugs, but all falling under the banner of 'SMP doesn't work, find and fix the race conditions'

Re: What is your longest bug?

Posted: Thu Apr 27, 2017 9:19 pm
by SpyderTL
I just looked back through my old posts, and realized that one of my first questions on this forum in 2013 was about how to set up queues properly for OHCI controllers. And my last question I posted yesterday was about how to set up EHCI queues properly.

I've really come a long way...

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 12:44 am
by Solar
I remember a bug in the printf() implementation for PDCLib that drove me to distraction. You can still see the comment here.

I hadn't checked in the source before I got this working, but the code I had gave almost correct results. I had been trying to track this bug down for weeks (not full time, of course, this has always been a spare-time endeavour). And I did spend virtually all of the Breakpoint 2006 demo party staring at the code, stepping through the debugger and generally tearing my hair out.

And then, after three days of drinking, eating junk food, and staring at the screen in frustration, it struck me like... well... #-o :oops: :evil: -- I was adding to the wrong variable...

(Unfortunately I hadn't checked in the previous version of the source yet, as I wanted it to work before checking in, so I cannot show you the diff.)

I had a similar issue at the office once, where I did spend almost two weeks full-time trying to nail down a bug, which turned out to be something along the lines of a sign error. I was ashamed to report this to my superior. But he smiled and said:

"Every bug is trivial... once you've found it."

Similarities of that uttering with my signature are not coincidental.

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 1:15 am
by SpyderTL
Solar wrote:...Breakpoint 2006 demo party...
8)

That's probably the worst thing about living in the U.S. I've always wanted to go to one of those.

I did get to watch some of Revision 2017 a few weeks ago live on Twitch. Almost like being there...

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 1:58 am
by Kevin
I haven't looked at it for a few years now and I probably won't fix it any more, but the most puzzling OS-Dev bug for me is still in my sis900 driver. I have one computer with an early revision where it works just fine. And I have another computer with a newer revision on-board and that one works fine, too - but only as long as you never send a packet that is larger than 128 bytes (exactly 128 is fine). If you do, the NIC is dead and won't receive or send anything any more. I spent quite some hours on it back then and I still don't know what the problem is. It doesn't really make any sense to me, but I'm not usually using this test computer any more, so whatever. But I'm sure it would be one of the trivial bugs once I would have found it.

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 5:40 am
by sleephacker
My second longest bug was an issue with my GDT when I tried switching to PMode for the first time, I think it took me about one or two weeks to solve it. I originally had my 32bit PMode data selector at offset 0x08 and my code selector at 0x10, as opposed to CS = 0x08 and DS = 0x10 which is what almost all tutorials and code examples used, but it shouldn't matter as long as you put the right descriptor in the right register. After switching it around the jump to PMode worked but loading DS didn't. It took me ages to figure it out because both entries worked when placed at offset 0x08 but neither did when placed at offset 0x10... So it couldn't be a problem with the selector structures themselves, at least that's what I thought for at least a week. At some point I decided to go through it character by character, comparing it to everything else I could find on the internet, and it turned out to be a typo that caused the length of my descriptor structure to be off by one byte, which is why the second descriptor never worked: it wasn't at the actual offset it was supposed to be.

My longest bug still hasn't been properly fixed, but I found a way to at least make it work. When I reboot in bochs (only in bochs) using either the PS/2 or the port 0xcf9 method (but not when triple faulting or pressing 'reset'), the RTC doesn't fire interrupts anymore. The way I made this work is by doing a int 28h, which is mapped to IRQ 8 (the RTC interrupt), after doing that it worked. The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 5:57 am
by alexfru
Kevin wrote:I haven't looked at it for a few years now and I probably won't fix it any more, but the most puzzling OS-Dev bug for me is still in my sis900 driver. I have one computer with an early revision where it works just fine. And I have another computer with a newer revision on-board and that one works fine, too - but only as long as you never send a packet that is larger than 128 bytes (exactly 128 is fine). If you do, the NIC is dead and won't receive or send anything any more. I spent quite some hours on it back then and I still don't know what the problem is. It doesn't really make any sense to me, but I'm not usually using this test computer any more, so whatever. But I'm sure it would be one of the trivial bugs once I would have found it.
I've read erratas on a few of Intel one and ten gigabit chips and saw their DPDK code (in places contradicting the chip documentation (e.g. doing exactly what the doc says not to)) and learned that the whole thing is a big mess. I also had an interesting bug in that the ring buffer would sometimes get stuck before completing the first round if data was arriving fast. But if I did some extra flushing or something of the sort during that first round only, everything would then just work. Still not sure if it was some odd caching issue as the workaround was found sufficient at the time.

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 6:04 am
by alexfru
sleephacker wrote:When I reboot in bochs (only in bochs) using either the PS/2 or the port 0xcf9 method (but not when triple faulting or pressing 'reset'), the RTC doesn't fire interrupts anymore. The way I made this work is by doing a int 28h, which is mapped to IRQ 8 (the RTC interrupt), after doing that it worked. The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
What if you just do iret without doing int 0x28?

I had a weird PS/2 mouse problem years ago. I could never properly disable it on one PC. I don't remember if the PC hung or if the mouse couldn't be re-enabled again afterwards or it never got disabled. Things seemed to work on other PCs, though. Never got to the bottom of it.

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 6:23 am
by Octocontrabass
sleephacker wrote:The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
When you initialize the RTC, do you acknowledge its pending interrupts by reading status register C?

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 6:57 am
by sleephacker
Octocontrabass wrote:
sleephacker wrote:The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
When you initialize the RTC, do you acknowledge its pending interrupts by reading status register C?
It all makes sense now!

My int 28h handler reads status register C, which is why int 28h fixed it. I removed the int 28h and put a read from status reg C in the initialisation, and now it works!

Thank you!

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 10:48 am
by Korona
The hardest bug I remember is not related to OS development. It was a subtle race condition at work. We have a program that solves a certain computationally hard problem. It does this by running multiple iterations of some algorithm and that algorithm itself is distributed over many compute nodes. In extremely rare cases (like once in a billion) it was possible for one of the compute nodes to prove that the whole problem was infeasible without taking into account all other computations. Thus the code looked like this:

Code: Select all

while(problem_not_solved()) {
    Lots of precomputation, multiple code paths of MPI calls to set everything up.

    while(any_work_left()) {
        if(do_work() == GLOBALLY_INFEASIBLE)
            break;
    }
    wait_for_other_nodes();

    Multiple MPI code paths to collect results and postprocess them.
}
The program would work mostly correct but hang sometimes (like once per 500 runs or so). I spend days to debug the precomputation and postprocessing code and the actual do_work() procedure. Just to illustrate how difficult to debug this code was: It typically ran on ~160 cores (we had dual-socket Xeon nodes with 10 cores/socket) concurrently and the outer loop ran for some thousands of times per invocation while the inner loop ran billions of times per invocation. In the end the problem was that if the break statement was executed the work queue would not be emptied which prevented the wait_for_other_nodes() to complete. However there was a load balancer that moved work between multiple nodes. So that bug would actually go undetected because a node's work queue was still indirectly drained by other nodes. Unless all of them triggered the bug simultaneously! The fix was just to clear the local work queue before the break.

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 11:38 am
by osdever
U365 had the worst bug I ever encountered. It was a memory bug: memory was overlapping. Me and my team were forced to rewrite quite a 1/5 of the whole OS: the memory manager, the file system... It was a nightmare. The bug was open for more than a year... Now it's finally fixed. Fully.

Re: What is your longest bug?

Posted: Fri Apr 28, 2017 12:41 pm
by AMenard
My longest bug?

It was a millipedes... I called him Tony :-)