Strange multitasking bug

BMW · Post by **BMW** » Tue Dec 10, 2013 10:48 pm

I have been debugging this bug for days without success. What happens is when I run more than 1 process or thread, a GPF (general protection fault) with error code 0 is generated. The strange part is that the GPF is generated at a random time. The two usermode processes run fine, for roughly 1-10 seconds before the GPF occurs.

Here are a few facts about the issue:

The GPF is generated at random times.
The GPF only occurs when running more than one process or thread - one process with one thread works fine.
The GPF occurs on the IRET instruction. Having looked up the intel manual, a GPF with error code 0 means either "the return code or stack segment selector is NULL or the return instruction pointer is not within the return code segment limit."
If I set the timer frequency (which the scheduler uses) to something a lot lower (was 200 Hz, change to 20 Hz), it seems the GPF does not occur (at least not after ~30 seconds).
If I set the timer frequency to something a lot higher (was 200 Hz, change to 1000 Hz), I start getting pagefaults of varying EIPs and error codes.

I cannot see why either of the error conditions described in the intel manual would occur except for maybe stack or heap corruption. I will be investigating these two possibilites. The second error condition (the return instruction pointer is not within the return code segment limit) is unlikely as all my segments span the full 4 GiB address space.

I have tried debugging by stepping through with GDB attached to QEMU, but I cannot reproduce the bug.

You can find the code for my OS on github here and the scheduler part is here.

Any insight would be much appreciated as I am at my wits end with this annoying bug.

bwat · Post by **bwat** » Wed Dec 11, 2013 2:11 am

I've got nothing specific for you but here's three suggestions you might try if you're at a dead end.

1) Put a canary on your stack.
2) Perform a general sanity check before switching to a process/thread. Check that the registers are OK and that the canary is alive.
3) Log context switches with a context dump (any interesting registers and OS globals) of from and to processes/threads. You can sometimes identify patterns with this.

Good luck.

iansjack · Post by **iansjack** » Wed Dec 11, 2013 2:47 am

Write a GPF handler that just halts the processor. That way you can inspect the state of the stack when the error occurs. My guess would be that some runaway sequence of function calls/interrupts is leading to stack overflow. Your experiments would seem to indicate a flaw in the timer interrupt routine.

BMW · Post by **BMW** » Wed Dec 11, 2013 2:53 am

Ok, thank you everyone for the excellent suggestions.

@bwat I tried your idea of sanity checks, and have found what the problem is - CS = 0. Now I need to find why it is occuring.

@iansjack excellent idea of the GPF handler halting immediately, this will assist in the next step.

BMW · Post by **BMW** » Wed Dec 11, 2013 5:17 am

I decided to check if the CS on the stack (the one that gets popped by iret) was 0. When it is found to be 0, I push 0xCCCCCCCC and halt the cpu and then I can perform a stack dump. This revealed that the portion of the stack used for the registers was all zero - see picture: (0xCCCCCCCC is what I pushed)

I am stuck as to how to proceed from here. How can I find what caused the stack corruption?

iansjack · Post by **iansjack** » Wed Dec 11, 2013 5:28 am

That sort of error can be tough to locate, but it does look as if something is writing to the wrong area of memory.

Some debuggers allow you to set a breakpoint when a particular area of memory is written too or even when it attains a particular value. In the past I have used AMD's SimNow (unfortunately, now only available for Linux) to track down errors like this. It's a bit of a pain for regular use as it is much slower than, say, qemu but it does allow debugging at that sort of level - I'm not sure that Bochs or qemu/gdb allow this. Even when you find the area of code that is overwriting the stack you probably have quite an audit trail to try and track down. It's a question of setting more and more breakpoints and carefully inspecting what is happening at each stage. But at least you will be working with a glimmer of light rather than completely in the dark. And then, suddenly, it will be obvious and you will think "Why didn't I think of that"; until the next bug....

bwat · Post by **bwat** » Wed Dec 11, 2013 5:40 am

Either what was said by iansjack or more frequent sanity checks and narrow it down, narrower and narrower until you catch the culprit. So, every context switch, every system call, every library call.

Is it always the same process/thread that gets its stack trashed?

BMW · Post by **BMW** » Wed Dec 11, 2013 3:09 pm

In the timer_handler() I put a print and halt. It appears that the timer_handler is being called before interrupts are enabled????

If you look at my kmain() (https://github.com/BrettMW/lithiumos/bl ... el_entry.c) you can see that interrupts are enabled after the processes are added. However the print and halt code in the timer is executed before the processes are added. I have tried adding a disable_interrupts() at the very start of kmain, and it still happened.

EDIT: Ok, I've sorted the timer being called before interrupts enabled issue (load_gdt disabled then enabled interrupts

)

cyr1x · Post by **cyr1x** » Wed Dec 11, 2013 3:31 pm

I had a quick lock at your source code and probably found something which might be an issue.
You don't initialize esp in the thread's 'regs' structure. The problem is that popad will pop esp and overwrite it with whatever value is in the structure.

Hope my guess helps

BMW · Post by **BMW** » Wed Dec 11, 2013 4:59 pm

'tis fixed!

The vmmngr_switch_pdirectory() function disabled then reenabled interrupts, causing interrupts to be enabled during a context switch, resulting in a context switch occuring during a context switch, causing all sorts of problems.

And thanks @cyr1x, I think that was a bug as well, I have fixed that.

Thank you @bwat and @iansjack, your debugging skills helped me a lot in this and I have learnt a lot.

thepowersgang · Post by **thepowersgang** » Wed Dec 11, 2013 11:02 pm

Your root problem there is binding context switches to interrupts

BMW · Post by **BMW** » Wed Dec 11, 2013 11:18 pm

thepowersgang wrote:Your root problem there is binding context switches to interrupts

What do you mean?

EDIT: Do you mean this?

thepowersgang (in another thread) wrote:2. Task switching shouldn't be bound to the timer. Instead, write a generic 'reschedule' function that can be called by anything (say, in the idle loop of a task) and then once you get all that working, add a timer hook to force a switch if a task is taking too long.

thepowersgang · Post by **thepowersgang** » Wed Dec 11, 2013 11:25 pm

Yes (sorry for not mentioning that full explanation here). Modern systems are very IO-driven, and preemption is a (relatively) rare case. Binding task switches only to the timer means that lots of CPU time is wasted waiting for the timer to fire and de-schedule a waiting task.

bwat · Post by **bwat** » Thu Dec 12, 2013 3:50 am

thepowersgang wrote: Modern systems are very IO-driven, and preemption is a (relatively) rare case.

No! I/O is a source of many preemptions.

Consider this example:

Code: Select all

void high_prio_process_entrypoint(void)
{
  for(;;)
  {
    slow_blocking_io_operation();
  }
}

void low_prio_process_entrypoint(void)
{
  int i;

  for(;;)
  {
    i++;
  }
}

Image a system with only two processes executing the code above: one low priority process executing low_pio_process_entrypoint, and one high priority process executing high_prio_process_entrypoint. Now, with static priority scheduling, i.e. the priorities are fixed and are not changed by the scheduler, we can see that the only reason the low priority process executes is because of I/O. The blocking operation allows the lower priority process to run. When the high priority process is unblocked it preempts the low priority process as the scheduler identifies it as the highest priority process which is ready to run.

Consider a purely CPU bound system similar to the above example:

Code: Select all

void high_prio_process_entrypoint(void)
{
  int i;

  for(;;)
  {
    i++;
  }
}

void low_prio_process_entrypoint(void)
{
  int i;

  for(;;)
  {
    i++;
  }
}

Now there are no I/O operations as both processes are CPU bound, and there is no preemption.

You may say that this is an artificial example and yes it is, but even in a real system with many I/O bound processes the relative priority of these processes will mean that higher priority processes that unblock will preempt a lower priority processes that is running (at least the idle process). The very fact that the higher priority process was blocked means that a lower priority process was allowed to run when it moved into the blocked state.

iansjack · Post by **iansjack** » Thu Dec 12, 2013 4:06 am

Doesn't a context switch happen automatically when a process blocks (for whatever reason) in a well-designed OS? The idea of having to call it within an idle loop, or the user process calling it directly in some other way, smacks of Windows 3 to me rather than modern design. To my mind you have a timer-driven context switch as the norm (allowing you to closely control the scheduling algorithm) and then also context switch whenever a process is blocking (e.g. waiting for I/O, waiting for a timer to expire, or whatever). Relying on processes to trigger scheduling (with an override in the case of "long" delays) seems like anarchy.

OSDev.org

Strange multitasking bug

Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug

Re: Strange multitasking bug