General Protection Fault in VirtualBox Only

Geometrian · Post by **Geometrian** » Mon Dec 03, 2012 12:01 am

Hi,

I am getting a GPF when running in VirtualBox. It happens basically immediately (i.e. after there's a GDT and IDT, it faults). I can check this by enabling/disabling interrupts at various times and hanging.

However, the problem does not happen at all in Bochs. In fact, as far as I can tell, Bochs isn't reporting any problem of any kind.

What could be causing this?

Thanks,

JackScott · Post by **JackScott** » Mon Dec 03, 2012 12:16 am

If you could provide us with links to your source code, a disk image we could run, and (if possible, not sure if VirtualBox does this) a file of debugging output. That will make things much easier all around.

xenos · Post by **xenos** » Mon Dec 03, 2012 1:13 am

VirtualBox has only very limited debugging possibilities - have a look at http://www.virtualbox.org/manual/ch12.html for some help.

Maybe you could explain in more detail how you figured out when exactly the GPF happens? And as JackScott already suggested, provide some code, at least the part the the GPF seems to happen.

Geometrian · Post by **Geometrian** » Mon Dec 03, 2012 3:21 am

Hi,

My OS is now hosted on Google Code: http://code.google.com/p/ianmallett-moss/source/checkout. There's a lot of configuration scripts that are pretty specific to my machine, but the source is current and the prebuilt build/disk_img.bin exhibits the problem in VirtualBox, but not in Bochs. Most of the comments are fairly current, or what's happening is obvious in context.

I had tried to track the problem down by enabling and disabling interrupts. In the most extreme test, as soon as I entered the kernel, I disabled interrupts immediately, set up the GDT and IDT, then reenabled them; the kernel immediately jumps to the designated ISR, printing a nice debug message on the screen telling me it's a GPF. So, no, I wasn't able to pinpoint the problem--as soon as the IDT existed, it was being used.

I don't know what exactly happens when no IDT is available (the CPU is using the IVT instead? --which does nothing? --maybe?) so the problem might be happening previously.

Thanks!

xenos · Post by **xenos** » Mon Dec 03, 2012 6:04 am

If you have a working GPF handler, you can use it to figure out the reason for the GPF, i.e, you can check the faulting instruction and the error code to see whether it was an internal / external event, access violation etc.

Geometrian · Post by **Geometrian** » Mon Dec 03, 2012 11:46 am

XenOS wrote:If you have a working GPF handler, you can use it to figure out the reason for the GPF, i.e, you can check the faulting instruction and the error code to see whether it was an internal / external event, access violation etc.

The error code is 5126 (decimal). Where can I find what that means?

xenos · Post by **xenos** » Mon Dec 03, 2012 11:50 am

Geometrian wrote:Where can I find what that means?

The Intel docs have a chapter devoted to interrupts - and explain the error code pretty well.

araxestroy · Post by **araxestroy** » Mon Dec 03, 2012 11:51 am

Geometrian wrote:The error code is 5126 (decimal). Where can I find what that means?

GPF error codes are the selector index in which the fault occurred plus some extra information.

Geometrian · Post by **Geometrian** » Mon Dec 03, 2012 12:23 pm

Blacklight wrote:GPF error codes are the selector index in which the fault occurred plus some extra information.

Hmmm, okay, so the error code is 1526 -> 0x1406 -> 0001010000000 11 0

-Internal exception (i.e. from the OS)
-Selector Index references a descriptor in the IDT
-Selector index is 0x0280 -> 640, though I think it might actually be LSB first: 0x0041-> 65

Brendan · Post by **Brendan** » Mon Dec 03, 2012 12:44 pm

Hi,

Geometrian wrote:
Blacklight wrote:GPF error codes are the selector index in which the fault occurred plus some extra information.
Hmmm, okay, so the error code is 1526 -> 0x1406 -> 0001010000000 11 0

-Internal exception (i.e. from the OS)
-Selector Index references a descriptor in the IDT
-Selector index is 0x0280 -> 640, though I think it might actually be LSB first: 0x0041-> 65

Erm, no. It's "little endian", which means the bytes on the stack would've been 0x06, 0x14, 0x00, 0x00; but it's still the 32-bit value 0x00001406, and you can't just swap one group of 4-bits with another in the hope it might make more sense.

Basically, the error code you got is impossible - either you got the wrong value from the stack, or the code you used to display it is buggy.

Cheers,

Brendan

Geometrian · Post by **Geometrian** » Mon Dec 03, 2012 2:23 pm

Brendan wrote:Basically, the error code you got is impossible - either you got the wrong value from the stack, or the code you used to display it is buggy.

I checked the display code with some hardcoded values--it works perfectly, so it's likely the former.

The ISRs' code is taken from an example:

Code: Select all

;Stub for an ISR which does NOT pass its own error code (adds a dummy errcode byte)
%macro ISR_NOERRCODE 1
	[GLOBAL isr%1]
	isr%1:
		cli            ;Disable interrupts
		push  byte 0   ;Push a dummy error code
		push  byte %1  ;Push the interrupt number

		jmp  isr_common
%endmacro
;Stub for an ISR which passes its own error code
%macro ISR_ERRCODE 1
	[GLOBAL isr%1]
	isr%1:
		cli            ;Disable interrupts
		push  byte %1  ;Push the interrupt number

		jmp  isr_common
%endmacro

ISR_NOERRCODE  0
ISR_NOERRCODE  1
ISR_NOERRCODE  2
ISR_NOERRCODE  3
ISR_NOERRCODE  4
ISR_NOERRCODE  5
ISR_NOERRCODE  6
ISR_NOERRCODE  7
ISR_ERRCODE    8
ISR_NOERRCODE  9
ISR_ERRCODE   10
ISR_ERRCODE   11
ISR_ERRCODE   12
ISR_ERRCODE   13
ISR_ERRCODE   14
ISR_NOERRCODE 15
ISR_NOERRCODE 16
ISR_NOERRCODE 17
ISR_NOERRCODE 18
ISR_NOERRCODE 19
ISR_NOERRCODE 20
ISR_NOERRCODE 21
ISR_NOERRCODE 22
ISR_NOERRCODE 23
ISR_NOERRCODE 24
ISR_NOERRCODE 25
ISR_NOERRCODE 26
ISR_NOERRCODE 27
ISR_NOERRCODE 28
ISR_NOERRCODE 29
ISR_NOERRCODE 30
ISR_NOERRCODE 31

;This saves the processor state, sets up for kernel mode segments, calls the C-level fault handler, and finally restores the stack frame.
isr_common:
	pusha  ;Pushes edi,esi,ebp,esp,ebx,edx,ecx,eax

	mov   ax, ds  ;Lower 16-bits of eax = ds.
	push  eax    ;save the data segment descriptor

	mov  ax, 0x10  ;Load the kernel data segment descriptor
	mov  ds, ax
	mov  es, ax
	mov  fs, ax
	mov  gs, ax

	call  isr_handler
	;jmp $

	pop  ebx     ;Reload the original data segment descriptor
	mov  ds, bx
	mov  es, bx
	mov  fs, bx
	mov  gs, bx

	popa  ;Pops edi,esi,ebp...

	add  esp, 8  ;Cleans up the pushed error code and pushed ISR number

	sti

	iret  ;Pops 5 things at once: CS, EIP, EFLAGS, SS, and ESP

Code: Select all

typedef struct registers {
	uint32 ds;                                     //Data segment selector
	uint32 edi, esi, ebp, esp, ebx, edx, ecx, eax; //Pushed by pusha
	uint32 int_no, err_code;                       //Interrupt number and error code (if applicable)
	uint32 eip, cs, eflags, useresp, ss;           //Pushed by the processor automatically
} registers_t;

extern "C" void isr_handler(registers_t regs) {
	CONSOLE::Console::draw(5,5,"received interrupt: ");
	CONSOLE::Console::draw(5,6,get_interrupt_description(regs.int_no));
	CONSOLE::Console::draw(5,7,regs.err_code);
	CONSOLE::Console::draw(5,8,5126u); //just a test to demonstrate the console (it works)
	CONSOLE::Console::draw(5,9,"DONE");
}

Through some experimentation, I have found that the error code sometimes varies from run to run--but always in the 5,000s.

Brendan · Post by **Brendan** » Mon Dec 03, 2012 7:20 pm

Hi,

You're going to need to learn how to debug.

Start by getting hold of a decent emulator with a good debugger (e.g. Bochs with its inbuilt debugger enabled); and put a breakpoint or something (even just "jmp $") as the first instruction in those macros that create the interrupt stubs. Once you've done that the OS will stop when any interrupt occurs, and you can use the debugger to examine the raw data that the CPU put on the interrupt handler's stack (before any of your normal interrupt handling has a chance to mess any of it up).

This alone should tell you exactly what causes the first interrupt/exception.

Next; disable all optimisation and see if the problem goes away (hopefully it won't). Then you want to disassemble the code (especially for the "isr_handler()" function) to use as a reference; and single-step through the code one instruction at a time (from the start of the interrupt stub until you get all the way back to the IRET) with the debugger while checking *everything* contains what you think it should at each step. In general, the more complex a high level language and compiler is the more likely it is that it's doing something you wouldn't have expected.

Alternatively

Consider the following code:

Code: Select all

void foo1(void) {
    bork(1);
}

void foo2(void) {
    bork(2);
}

void foo3(void) {
    bork(3);
}

void bork(int number) {
    switch(number) {
    case 1:
        bar1();
        break;
    case 2:
        bar2();
        break;
    case 3:
        bar3();
        break;
    }
}

void bar1(void) {
    // Special code specifically designed to handle the first case
}

voida bar2(void) {
    // Special code specifically designed to handle the second case
}

void bar3(void) {
    // Special code specifically designed to handle the third case
}

Is this code brilliant, or a retarded joke? Can you think of a way to "optimise out" the entirely stupid "switch(number) { ... }" and the entire "bork()" function?

Now think about what your stubs and "common" interrupt handler will actually need to do once you start having different behaviour for different exceptions handlers. For example, the debugging exception and breakpoint exception will eventually send something to a debugger (e.g. GDB); the general protection fault handler (and probably a few others) might send signals to the process that caused the problem; the page fault handler is going to have a whole pile of virtual memory management stuff banged into it (things like "copy on write", swap space support, memory mapped files, etc); the invalid opcode exception handler is going to have a lot of code to determine what the instruction was and emulate it (so that code designed for a more recent CPU still runs on older CPUs). The NMI, double fault and machine check exceptions are going to need some very special handling. The device not available exception might contain support for "delayed FPU/MMX/SSE state save" logic. Then you're probably going to have radically different interrupt handlers for things like the kernel API, and spurious IRQs, and the scheduler's timer, and maybe IPIs (Inter-Processor Interrupts sent from other CPUs), and maybe performance monitoring, and maybe thermal status.

Also note that some of those interrupt stubs will want to use "trap gates", and some will want "interrupt gates" (and some might want "task gates"). The interrupt stub for the page fault handler should save CR2 as soon as possible (in case a second page fault occurs and trashes the first page fault's CR2). When you start looking at adding support for debuggers you'll realise that half of them (those corresponding to "fault class" exceptions) will want to clear the RF flag to avoid issues. So...

Is the idea of having a "common interrupt handler" any less idiotic than the "bork()" function in my example above?

There will be no common code in your "common interrupt handler", except for maybe some kernel panic code that is nowhere near the critical path. Unless you're only writing a tutorial (and therefore don't have a reason to care if the code is sane or not as long as it helps explain things); there's no point having a "common interrupt handler" for anything other than actual IRQs. Instead, just have a "kernel panic" function that anything (including exception handlers) can call; and keep all of the very different exception handlers (and other interrupts) separate. Note: if your kernel detects something "impossible" (for a simple example, maybe it's an attempt to release a re-entrancy lock that hasn't been acquired) then it could call the "kernel panic" function even though no exception (or interrupt) was involved at all.

As a general rule of thumb; it's a waste of time fixing code if that code needs to be redesigned/rewritten anyway. Is your exception handling code worth fixing?

Cheers,

Brendan

linguofreak · Post by **linguofreak** » Tue Dec 04, 2012 11:30 am

Brendan wrote:Hi,

You're going to need to learn how to debug.

Start by getting hold of a decent emulator with a good debugger (e.g. Bochs with its inbuilt debugger enabled);

The OP has already stated that the problem doesn't show up in Bochs. The problem only occurs in Virtual Box (see the thread title), so he's limited to the facilities available there in diagnosing it.

Brendan · Post by **Brendan** » Tue Dec 04, 2012 11:47 am

Hi,

linguofreak wrote:
Brendan wrote:Start by getting hold of a decent emulator with a good debugger (e.g. Bochs with its inbuilt debugger enabled);
The OP has already stated that the problem doesn't show up in Bochs. The problem only occurs in Virtual Box (see the thread title), so he's limited to the facilities available there in diagnosing it.

VirtualBox has it's own inbuilt debugger, and also lets you attach GDB to it.

Cheers,

Brendan

Geometrian · Post by **Geometrian** » Tue Dec 04, 2012 5:43 pm

Brendan wrote:Consider the following code:

...

Is this code brilliant, or a retarded joke? Can you think of a way to "optimise out" the entirely stupid "switch(number) { ... }" and the entire "bork()" function?

Now think about what your stubs and "common" interrupt handler will actually need to do once you start having different behaviour for different exceptions handlers. . . . Is the idea of having a "common interrupt handler" any less idiotic than the "bork()" function in my example above?

I realize that a common exception handler isn't necessarily a good design in a mature OS, but I would like to point out that I don't have a file system, protected mode disk IO, processes, a full C library, dynamic memory, let alone a GUI. I can barely get keyboard input. It does not make sense to spend time overengineering a collection of industry-quality individualized interrupt handlers when I can't even get a trivial example to work properly!

And while I do grasp that it's important to not misdesign something from the beginning, you'll notice that there aren't any functions analogous to your bari()--I am deliberately redirecting all interrupts to the same place because I want to print the same kind of diagnostic information about each. And while I got that idea from a tutorial, I think as a design principle, at this stage, it's fundamentally sound and actually ideal for learning what's happening.

Brendan wrote:Start by getting hold of a decent emulator with a good debugger (e.g. Bochs with its inbuilt debugger enabled);

linguofreak wrote:The OP has already stated that the problem doesn't show up in Bochs. The problem only occurs in Virtual Box (see the thread title)

Thank you.

Brendan wrote:. . . and put a breakpoint or something (even just "jmp $") as the first instruction in those macros that create the interrupt stubs. Once you've done that the OS will stop when any interrupt occurs, and you can use the debugger to examine the raw data that the CPU put on the interrupt handler's stack (before any of your normal interrupt handling has a chance to mess any of it up).

This is a good idea. I had hoped someone might immediately spot the problem a priori, but if no one has any other suggestions, I will try attacking VirtualBox's debugging facilities again.

Thanks,

OSDev.org

General Protection Fault in VirtualBox Only

General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only

Re: General Protection Fault in VirtualBox Only