What is your longest bug?

All off topic discussions go here. Everything from the funny thing your cat did to your favorite tv shows. Non-programming computer questions are ok too.
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

What is your longest bug?

Post by SpyderTL »

After fighting with a particularly annoying EHCI issue for nearly 4 days, I'm trying to think of the longest time it's taken me to find and fix a bug. This one feels like the longest right now, but I'm sure I've spent over a week or two on one issue before. I could probably search back through the forums and find out.

What about you guys? What was your "white whale" bug that eluded you the longest? Or maybe still eludes you...
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
User avatar
eryjus
Member
Member
Posts: 286
Joined: Fri Oct 21, 2011 9:47 pm
Libera.chat IRC: eryjus
Location: Tustin, CA USA

Re: What is your longest bug?

Post by eryjus »

I will never forget it. 3 months. It was for work, not the personal stuff, and it was an ERP suite. It was all centered around memory leaks. My business locked me in a room for 2 weeks of that and slipped pizza under the door.
Adam

The name is fitting: Century Hobby OS -- At this rate, it's gonna take me that long!
Read about my mistakes and missteps with this iteration: Journal

"Sometimes things just don't make sense until you figure them out." -- Phil Stahlheber
User avatar
hgoel
Member
Member
Posts: 89
Joined: Sun Feb 09, 2014 7:11 pm
Libera.chat IRC: hgoel
Location: Within a meter of a computer

Re: What is your longest bug?

Post by hgoel »

I've been dealing with SMP bugs for a while, although they've been different bugs, but all falling under the banner of 'SMP doesn't work, find and fix the race conditions'
"If the truth is a cruel mistress, than a lie must be a nice girl"
Working on Cardinal
Find me at [url=irc://chat.freenode.net:6697/Cardinal-OS]#Cardinal-OS[/url] on freenode!
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: What is your longest bug?

Post by SpyderTL »

I just looked back through my old posts, and realized that one of my first questions on this forum in 2013 was about how to set up queues properly for OHCI controllers. And my last question I posted yesterday was about how to set up EHCI queues properly.

I've really come a long way...
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: What is your longest bug?

Post by Solar »

I remember a bug in the printf() implementation for PDCLib that drove me to distraction. You can still see the comment here.

I hadn't checked in the source before I got this working, but the code I had gave almost correct results. I had been trying to track this bug down for weeks (not full time, of course, this has always been a spare-time endeavour). And I did spend virtually all of the Breakpoint 2006 demo party staring at the code, stepping through the debugger and generally tearing my hair out.

And then, after three days of drinking, eating junk food, and staring at the screen in frustration, it struck me like... well... #-o :oops: :evil: -- I was adding to the wrong variable...

(Unfortunately I hadn't checked in the previous version of the source yet, as I wanted it to work before checking in, so I cannot show you the diff.)

I had a similar issue at the office once, where I did spend almost two weeks full-time trying to nail down a bug, which turned out to be something along the lines of a sign error. I was ashamed to report this to my superior. But he smiled and said:

"Every bug is trivial... once you've found it."

Similarities of that uttering with my signature are not coincidental.
Every good solution is obvious once you've found it.
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: What is your longest bug?

Post by SpyderTL »

Solar wrote:...Breakpoint 2006 demo party...
8)

That's probably the worst thing about living in the U.S. I've always wanted to go to one of those.

I did get to watch some of Revision 2017 a few weeks ago live on Twitch. Almost like being there...
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Kevin
Member
Member
Posts: 1071
Joined: Sun Feb 01, 2009 6:11 am
Location: Germany
Contact:

Re: What is your longest bug?

Post by Kevin »

I haven't looked at it for a few years now and I probably won't fix it any more, but the most puzzling OS-Dev bug for me is still in my sis900 driver. I have one computer with an early revision where it works just fine. And I have another computer with a newer revision on-board and that one works fine, too - but only as long as you never send a packet that is larger than 128 bytes (exactly 128 is fine). If you do, the NIC is dead and won't receive or send anything any more. I spent quite some hours on it back then and I still don't know what the problem is. It doesn't really make any sense to me, but I'm not usually using this test computer any more, so whatever. But I'm sure it would be one of the trivial bugs once I would have found it.
Developer of tyndur - community OS of Lowlevel (German)
User avatar
sleephacker
Member
Member
Posts: 97
Joined: Thu Aug 06, 2015 6:41 am
Location: Netherlands

Re: What is your longest bug?

Post by sleephacker »

My second longest bug was an issue with my GDT when I tried switching to PMode for the first time, I think it took me about one or two weeks to solve it. I originally had my 32bit PMode data selector at offset 0x08 and my code selector at 0x10, as opposed to CS = 0x08 and DS = 0x10 which is what almost all tutorials and code examples used, but it shouldn't matter as long as you put the right descriptor in the right register. After switching it around the jump to PMode worked but loading DS didn't. It took me ages to figure it out because both entries worked when placed at offset 0x08 but neither did when placed at offset 0x10... So it couldn't be a problem with the selector structures themselves, at least that's what I thought for at least a week. At some point I decided to go through it character by character, comparing it to everything else I could find on the internet, and it turned out to be a typo that caused the length of my descriptor structure to be off by one byte, which is why the second descriptor never worked: it wasn't at the actual offset it was supposed to be.

My longest bug still hasn't been properly fixed, but I found a way to at least make it work. When I reboot in bochs (only in bochs) using either the PS/2 or the port 0xcf9 method (but not when triple faulting or pressing 'reset'), the RTC doesn't fire interrupts anymore. The way I made this work is by doing a int 28h, which is mapped to IRQ 8 (the RTC interrupt), after doing that it worked. The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
alexfru
Member
Member
Posts: 1111
Joined: Tue Mar 04, 2014 5:27 am

Re: What is your longest bug?

Post by alexfru »

Kevin wrote:I haven't looked at it for a few years now and I probably won't fix it any more, but the most puzzling OS-Dev bug for me is still in my sis900 driver. I have one computer with an early revision where it works just fine. And I have another computer with a newer revision on-board and that one works fine, too - but only as long as you never send a packet that is larger than 128 bytes (exactly 128 is fine). If you do, the NIC is dead and won't receive or send anything any more. I spent quite some hours on it back then and I still don't know what the problem is. It doesn't really make any sense to me, but I'm not usually using this test computer any more, so whatever. But I'm sure it would be one of the trivial bugs once I would have found it.
I've read erratas on a few of Intel one and ten gigabit chips and saw their DPDK code (in places contradicting the chip documentation (e.g. doing exactly what the doc says not to)) and learned that the whole thing is a big mess. I also had an interesting bug in that the ring buffer would sometimes get stuck before completing the first round if data was arriving fast. But if I did some extra flushing or something of the sort during that first round only, everything would then just work. Still not sure if it was some odd caching issue as the workaround was found sufficient at the time.
alexfru
Member
Member
Posts: 1111
Joined: Tue Mar 04, 2014 5:27 am

Re: What is your longest bug?

Post by alexfru »

sleephacker wrote:When I reboot in bochs (only in bochs) using either the PS/2 or the port 0xcf9 method (but not when triple faulting or pressing 'reset'), the RTC doesn't fire interrupts anymore. The way I made this work is by doing a int 28h, which is mapped to IRQ 8 (the RTC interrupt), after doing that it worked. The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
What if you just do iret without doing int 0x28?

I had a weird PS/2 mouse problem years ago. I could never properly disable it on one PC. I don't remember if the PC hung or if the mouse couldn't be re-enabled again afterwards or it never got disabled. Things seemed to work on other PCs, though. Never got to the bottom of it.
Octocontrabass
Member
Member
Posts: 5521
Joined: Mon Mar 25, 2013 7:01 pm

Re: What is your longest bug?

Post by Octocontrabass »

sleephacker wrote:The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
When you initialize the RTC, do you acknowledge its pending interrupts by reading status register C?
User avatar
sleephacker
Member
Member
Posts: 97
Joined: Thu Aug 06, 2015 6:41 am
Location: Netherlands

Re: What is your longest bug?

Post by sleephacker »

Octocontrabass wrote:
sleephacker wrote:The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...
When you initialize the RTC, do you acknowledge its pending interrupts by reading status register C?
It all makes sense now!

My int 28h handler reads status register C, which is why int 28h fixed it. I removed the int 28h and put a read from status reg C in the initialisation, and now it works!

Thank you!
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: What is your longest bug?

Post by Korona »

The hardest bug I remember is not related to OS development. It was a subtle race condition at work. We have a program that solves a certain computationally hard problem. It does this by running multiple iterations of some algorithm and that algorithm itself is distributed over many compute nodes. In extremely rare cases (like once in a billion) it was possible for one of the compute nodes to prove that the whole problem was infeasible without taking into account all other computations. Thus the code looked like this:

Code: Select all

while(problem_not_solved()) {
    Lots of precomputation, multiple code paths of MPI calls to set everything up.

    while(any_work_left()) {
        if(do_work() == GLOBALLY_INFEASIBLE)
            break;
    }
    wait_for_other_nodes();

    Multiple MPI code paths to collect results and postprocess them.
}
The program would work mostly correct but hang sometimes (like once per 500 runs or so). I spend days to debug the precomputation and postprocessing code and the actual do_work() procedure. Just to illustrate how difficult to debug this code was: It typically ran on ~160 cores (we had dual-socket Xeon nodes with 10 cores/socket) concurrently and the outer loop ran for some thousands of times per invocation while the inner loop ran billions of times per invocation. In the end the problem was that if the break statement was executed the work queue would not be emptied which prevented the wait_for_other_nodes() to complete. However there was a load balancer that moved work between multiple nodes. So that bug would actually go undetected because a node's work queue was still indirectly drained by other nodes. Unless all of them triggered the bug simultaneously! The fix was just to clear the local work queue before the break.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
osdever
Member
Member
Posts: 492
Joined: Fri Apr 03, 2015 9:41 am
Contact:

Re: What is your longest bug?

Post by osdever »

U365 had the worst bug I ever encountered. It was a memory bug: memory was overlapping. Me and my team were forced to rewrite quite a 1/5 of the whole OS: the memory manager, the file system... It was a nightmare. The bug was open for more than a year... Now it's finally fixed. Fully.
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
AMenard
Member
Member
Posts: 67
Joined: Mon Aug 25, 2014 1:27 pm

Re: What is your longest bug?

Post by AMenard »

My longest bug?

It was a millipedes... I called him Tony :-)
Post Reply