Strange multitasking bug

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
BMW
Member
Member
Posts: 286
Joined: Mon Nov 05, 2012 8:31 pm
Location: New Zealand

Strange multitasking bug

Post by BMW »

I have been debugging this bug for days without success. What happens is when I run more than 1 process or thread, a GPF (general protection fault) with error code 0 is generated. The strange part is that the GPF is generated at a random time. The two usermode processes run fine, for roughly 1-10 seconds before the GPF occurs.

Here are a few facts about the issue:
  • The GPF is generated at random times.
  • The GPF only occurs when running more than one process or thread - one process with one thread works fine.
  • The GPF occurs on the IRET instruction. Having looked up the intel manual, a GPF with error code 0 means either "the return code or stack segment selector is NULL or the return instruction pointer is not within the return code segment limit."
  • If I set the timer frequency (which the scheduler uses) to something a lot lower (was 200 Hz, change to 20 Hz), it seems the GPF does not occur (at least not after ~30 seconds).
  • If I set the timer frequency to something a lot higher (was 200 Hz, change to 1000 Hz), I start getting pagefaults of varying EIPs and error codes.
I cannot see why either of the error conditions described in the intel manual would occur except for maybe stack or heap corruption. I will be investigating these two possibilites. The second error condition (the return instruction pointer is not within the return code segment limit) is unlikely as all my segments span the full 4 GiB address space.

I have tried debugging by stepping through with GDB attached to QEMU, but I cannot reproduce the bug.

You can find the code for my OS on github here and the scheduler part is here.

Any insight would be much appreciated as I am at my wits end with this annoying bug.
Currently developing Lithium OS (LiOS).

Recursive paging saves lives.
"I want to change the world, but they won't give me the source code."
User avatar
bwat
Member
Member
Posts: 359
Joined: Fri Jul 03, 2009 6:21 am

Re: Strange multitasking bug

Post by bwat »

I've got nothing specific for you but here's three suggestions you might try if you're at a dead end.

1) Put a canary on your stack.
2) Perform a general sanity check before switching to a process/thread. Check that the registers are OK and that the canary is alive.
3) Log context switches with a context dump (any interesting registers and OS globals) of from and to processes/threads. You can sometimes identify patterns with this.

Good luck.
Every universe of discourse has its logical structure --- S. K. Langer.
User avatar
iansjack
Member
Member
Posts: 4711
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Strange multitasking bug

Post by iansjack »

Write a GPF handler that just halts the processor. That way you can inspect the state of the stack when the error occurs. My guess would be that some runaway sequence of function calls/interrupts is leading to stack overflow. Your experiments would seem to indicate a flaw in the timer interrupt routine.
User avatar
BMW
Member
Member
Posts: 286
Joined: Mon Nov 05, 2012 8:31 pm
Location: New Zealand

Re: Strange multitasking bug

Post by BMW »

Ok, thank you everyone for the excellent suggestions.

@bwat I tried your idea of sanity checks, and have found what the problem is - CS = 0. Now I need to find why it is occuring.

@iansjack excellent idea of the GPF handler halting immediately, this will assist in the next step.
Currently developing Lithium OS (LiOS).

Recursive paging saves lives.
"I want to change the world, but they won't give me the source code."
User avatar
BMW
Member
Member
Posts: 286
Joined: Mon Nov 05, 2012 8:31 pm
Location: New Zealand

Re: Strange multitasking bug

Post by BMW »

I decided to check if the CS on the stack (the one that gets popped by iret) was 0. When it is found to be 0, I push 0xCCCCCCCC and halt the cpu and then I can perform a stack dump. This revealed that the portion of the stack used for the registers was all zero - see picture: (0xCCCCCCCC is what I pushed)
Image

I am stuck as to how to proceed from here. How can I find what caused the stack corruption?
Currently developing Lithium OS (LiOS).

Recursive paging saves lives.
"I want to change the world, but they won't give me the source code."
User avatar
iansjack
Member
Member
Posts: 4711
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Strange multitasking bug

Post by iansjack »

That sort of error can be tough to locate, but it does look as if something is writing to the wrong area of memory.

Some debuggers allow you to set a breakpoint when a particular area of memory is written too or even when it attains a particular value. In the past I have used AMD's SimNow (unfortunately, now only available for Linux) to track down errors like this. It's a bit of a pain for regular use as it is much slower than, say, qemu but it does allow debugging at that sort of level - I'm not sure that Bochs or qemu/gdb allow this. Even when you find the area of code that is overwriting the stack you probably have quite an audit trail to try and track down. It's a question of setting more and more breakpoints and carefully inspecting what is happening at each stage. But at least you will be working with a glimmer of light rather than completely in the dark. And then, suddenly, it will be obvious and you will think "Why didn't I think of that"; until the next bug....
User avatar
bwat
Member
Member
Posts: 359
Joined: Fri Jul 03, 2009 6:21 am

Re: Strange multitasking bug

Post by bwat »

Either what was said by iansjack or more frequent sanity checks and narrow it down, narrower and narrower until you catch the culprit. So, every context switch, every system call, every library call.

Is it always the same process/thread that gets its stack trashed?
Every universe of discourse has its logical structure --- S. K. Langer.
User avatar
BMW
Member
Member
Posts: 286
Joined: Mon Nov 05, 2012 8:31 pm
Location: New Zealand

Re: Strange multitasking bug

Post by BMW »

In the timer_handler() I put a print and halt. It appears that the timer_handler is being called before interrupts are enabled????

If you look at my kmain() (https://github.com/BrettMW/lithiumos/bl ... el_entry.c) you can see that interrupts are enabled after the processes are added. However the print and halt code in the timer is executed before the processes are added. I have tried adding a disable_interrupts() at the very start of kmain, and it still happened.

EDIT: Ok, I've sorted the timer being called before interrupts enabled issue (load_gdt disabled then enabled interrupts :evil: )
Currently developing Lithium OS (LiOS).

Recursive paging saves lives.
"I want to change the world, but they won't give me the source code."
cyr1x
Member
Member
Posts: 207
Joined: Tue Aug 21, 2007 1:41 am
Location: Germany

Re: Strange multitasking bug

Post by cyr1x »

I had a quick lock at your source code and probably found something which might be an issue.
You don't initialize esp in the thread's 'regs' structure. The problem is that popad will pop esp and overwrite it with whatever value is in the structure.

Hope my guess helps [-o<
User avatar
BMW
Member
Member
Posts: 286
Joined: Mon Nov 05, 2012 8:31 pm
Location: New Zealand

Re: Strange multitasking bug

Post by BMW »

:D
'tis fixed!

The vmmngr_switch_pdirectory() function disabled then reenabled interrupts, causing interrupts to be enabled during a context switch, resulting in a context switch occuring during a context switch, causing all sorts of problems.

And thanks @cyr1x, I think that was a bug as well, I have fixed that.

Thank you @bwat and @iansjack, your debugging skills helped me a lot in this and I have learnt a lot.
Currently developing Lithium OS (LiOS).

Recursive paging saves lives.
"I want to change the world, but they won't give me the source code."
User avatar
thepowersgang
Member
Member
Posts: 734
Joined: Tue Dec 25, 2007 6:03 am
Libera.chat IRC: thePowersGang
Location: Perth, Western Australia
Contact:

Re: Strange multitasking bug

Post by thepowersgang »

Your root problem there is binding context switches to interrupts :)
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
User avatar
BMW
Member
Member
Posts: 286
Joined: Mon Nov 05, 2012 8:31 pm
Location: New Zealand

Re: Strange multitasking bug

Post by BMW »

thepowersgang wrote:Your root problem there is binding context switches to interrupts :)
What do you mean?

EDIT: Do you mean this?
thepowersgang (in another thread) wrote:2. Task switching shouldn't be bound to the timer. Instead, write a generic 'reschedule' function that can be called by anything (say, in the idle loop of a task) and then once you get all that working, add a timer hook to force a switch if a task is taking too long.
Currently developing Lithium OS (LiOS).

Recursive paging saves lives.
"I want to change the world, but they won't give me the source code."
User avatar
thepowersgang
Member
Member
Posts: 734
Joined: Tue Dec 25, 2007 6:03 am
Libera.chat IRC: thePowersGang
Location: Perth, Western Australia
Contact:

Re: Strange multitasking bug

Post by thepowersgang »

Yes (sorry for not mentioning that full explanation here). Modern systems are very IO-driven, and preemption is a (relatively) rare case. Binding task switches only to the timer means that lots of CPU time is wasted waiting for the timer to fire and de-schedule a waiting task.
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
User avatar
bwat
Member
Member
Posts: 359
Joined: Fri Jul 03, 2009 6:21 am

Re: Strange multitasking bug

Post by bwat »

thepowersgang wrote: Modern systems are very IO-driven, and preemption is a (relatively) rare case.
No! I/O is a source of many preemptions.

Consider this example:

Code: Select all

void high_prio_process_entrypoint(void)
{
  for(;;)
  {
    slow_blocking_io_operation();
  }
}

void low_prio_process_entrypoint(void)
{
  int i;

  for(;;)
  {
    i++;
  }
}
Image a system with only two processes executing the code above: one low priority process executing low_pio_process_entrypoint, and one high priority process executing high_prio_process_entrypoint. Now, with static priority scheduling, i.e. the priorities are fixed and are not changed by the scheduler, we can see that the only reason the low priority process executes is because of I/O. The blocking operation allows the lower priority process to run. When the high priority process is unblocked it preempts the low priority process as the scheduler identifies it as the highest priority process which is ready to run.

Consider a purely CPU bound system similar to the above example:

Code: Select all

void high_prio_process_entrypoint(void)
{
  int i;

  for(;;)
  {
    i++;
  }
}

void low_prio_process_entrypoint(void)
{
  int i;

  for(;;)
  {
    i++;
  }
}
Now there are no I/O operations as both processes are CPU bound, and there is no preemption.

You may say that this is an artificial example and yes it is, but even in a real system with many I/O bound processes the relative priority of these processes will mean that higher priority processes that unblock will preempt a lower priority processes that is running (at least the idle process). The very fact that the higher priority process was blocked means that a lower priority process was allowed to run when it moved into the blocked state.
Every universe of discourse has its logical structure --- S. K. Langer.
User avatar
iansjack
Member
Member
Posts: 4711
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Strange multitasking bug

Post by iansjack »

Doesn't a context switch happen automatically when a process blocks (for whatever reason) in a well-designed OS? The idea of having to call it within an idle loop, or the user process calling it directly in some other way, smacks of Windows 3 to me rather than modern design. To my mind you have a timer-driven context switch as the norm (allowing you to closely control the scheduling algorithm) and then also context switch whenever a process is blocking (e.g. waiting for I/O, waiting for a timer to expire, or whatever). Relying on processes to trigger scheduling (with an override in the case of "long" delays) seems like anarchy.
Post Reply