tripple fault when enabling paging

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
IRBMe

tripple fault when enabling paging

Post by IRBMe »

I return again, after battling with paging all day long with yet another problem.

I followed this tutorial today:

"Implementing Basic Paging" - http://www.osdever.net/tutorials/paging.php?the_id=43

The first error took me hours of debugging to figure out. What that guy didn't mention in his tutorial is that by choosing the addresses he suggests, and following his code, you end up bumping into VRAM! So when you try to read back the page table address, you get 0xffffffff (on bochs) or 0x00000000 (on vmware)!

So I tried relocating my page tables to the 1MB boundary (0x100000) and that seemed to work, in the sense that the page table entries seemed to be correct now (0x0003, 0x1003, 0x2003, 0x4003 etc). The first page directory entry contains as the first entry 0x101003, and 0x0 for all the other entries (not-present).

Following the tutorial, I set up paging to map the first 4MB of memory to their real physical addresses. If I understand correctly, I should then be able to use the first 4MB of memory with paging turned on without getting any page faults. This was a bit of a major assumption since I haven't yet set up any interrupt tables or interrupt handlers.

However, right after turning on paging, the CPU tripple faults with a page error, which I didn't expect, not accessing anything after the first 4MB.

The message in bochs is this:
"Message: exception(): 3rd (14) exception with no resolution"

I thought maybe the A20 line hadn't been enabled properly, but I verified this by writing a byte to 0x100000 then reading 0x000000 and checking that they were not the same - they weren't so the A20 line seems fine.

Could anybody verify that this shouldn't happen, and if so, possibly suggest why it might?

My code is attached.
IRBMe

Re:tripple fault when enabling paging

Post by IRBMe »

Reasons for a page fault:

(1) You try to access a page from user mode, which is marked as supervisor.

- All my pages have the supervisor bit set, but I'm the kernel and I don't even have such a thing as user mode yet. So this can't be the problem.

(2) You try to write to a read only page.

- All my pages are set to read/write access. So this can't be the problem either.

(3) You try to access a page marked as "not present".

- I have my entire first page table (the first 4MB) marked as present, and I don't access memory above the 4MB mark. So I fail to see why this would be a problem either.


So my question is, why am I getting a page fault? If you need more source other than the paging code attached in the above message, feel free to ask and I can upload the rest (there isn't really much more)

??? ??? ???
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:tripple fault when enabling paging

Post by Brendan »

Hi,

What does Bochs say (at the end of bochsout.txt), especially the value in CR2?

Also you could halt the CPU just before paging is enabled and use Bochs debugger to examine the physical memory (and registers like CR3) to make sure everything is correct.

I'd start with CR3 and then check the page directory, then the page table (walk through it the same as the CPU would).

If you get a page fault immediately after your "movl %0, %%cr0" then I'd guess that the CPU can't read your code anymore.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:tripple fault when enabling paging

Post by Pype.Clicker »

just a thought ... is your data-segment zero-based ? If not, the address you use to program the tables and the address they have for CR3 is *not* the same ...
IRBMe

Re:tripple fault when enabling paging

Post by IRBMe »

Well I wrote debugging code which does the following just before turning on paging:

Writes out the first 3 entries of the page directory to the screen.

Writes out the first 3 entries of the page table to screen.

Tries setting the cr3 register and writing out it's value to the screen.

All of the values seem perfectly fine:

Page Directory: 0x101003 0x000002 0x000002
Page Table: 0x000003 0x001003 0x002003
CR3 Register: 0x100000

I'm not quite sure how I would use the bochs debugger to debug this. Perhaps set a "watch read" on 0x100000 or something? I dunno.

From bochsout I have:

Code: Select all

00001310831p[CPU  ] >>PANIC<< exception(): 3rd (14) exception with no resolution
00001310831i[SYS  ] Last time is 1092310929
00001310831i[CPU  ] protected mode
00001310831i[CPU  ] CS.d_b = 32 bit
00001310831i[CPU  ] SS.d_b = 32 bit
00001310831i[CPU  ] | EAX=e0000011  EBX=60000011  ECX=00000520  EDX=0000000f
00001310831i[CPU  ] | ESP=0008ffd8  EBP=0008ffd8  ESI=0000069a  EDI=00000005
00001310831i[CPU  ] | IOPL=0 NV UP DI NG NZ NA PE NC
00001310831i[CPU  ] | SEG selector     base    limit G D
00001310831i[CPU  ] | SEG sltr(index|ti|rpl)     base    limit G D
00001310831i[CPU  ] |  DS:0010( 0002| 0|  0) 00000000 000fffff 1 1
00001310831i[CPU  ] |  ES:0010( 0002| 0|  0) 00000000 000fffff 1 1
00001310831i[CPU  ] |  FS:0010( 0002| 0|  0) 00000000 000fffff 1 1
00001310831i[CPU  ] |  GS:0010( 0002| 0|  0) 00000000 000fffff 1 1
00001310831i[CPU  ] |  SS:0010( 0002| 0|  0) 00000000 000fffff 1 1
00001310831i[CPU  ] |  CS:0008( 0001| 0|  0) 00000000 000fffff 1 1
00001310831i[CPU  ] | EIP=000100d7 (000100d7)
00001310831i[CPU  ] | CR0=0xe0000011 CR1=0x00000000 CR2=0x00000040
00001310831i[CPU  ] | CR3=0x00100000 CR4=0x00000000
00001310831i[     ] restoring default signal behavior
00001310831i[CTRL ] quit_sim called with exit code 1
So at least I know now it's not something obvious and stupid I've done, otherwise I guess you guys (and the various other people on IRC etc) would have spotted it by now.

I think what I might have to do is go learn interrupt handling today and set up an interrupt handler for the page fault. Then I can see a more details like what caused it (not present? not enough privilages? no write access?).

Although, that CR2=0x00000040 looks strange to me. That would be the address which was accessed that caused the page fault right? hm...strange since all my code is loaded well above that.

Well I guess I'll go learn interrupt handling now. Of course if anybody has any more ideas, please share.

And thanks for at least taking a look.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:tripple fault when enabling paging

Post by Brendan »

Hi,
IRBMe wrote: I'm not quite sure how I would use the bochs debugger to debug this. Perhaps set a "watch read" on 0x100000 or something? I dunno.
First you need to stop execution at the right spot. Putting a "for(;;)" just before turning paging on with "WriteCR0()" will do. Then go to the console screen and press "control + c" to stop Bochs inside the endless loop (you'll get a command prompt from Bochs). Now, try "info cpu" and see what is says for CR3. Then try "x /32 0x100000", which will (hopefully) display your page directory contents, followed by "x /32 0x101000" for the first page table.

If all that looks good, shift the endless loop to just after paging is enabled and see if Bochs crashes or not
IRBMe wrote: So at least I know now it's not something obvious and stupid I've done, otherwise I guess you guys (and the various other people on IRC etc) would have spotted it by now.
I wouldn't have noticed much - I only had a brief look (it's written in C & I'm an assembly programmer). I'm also starting to realise that it's better to teach someone how to fix their own bugs, rather than giving an answer like "change <something> on line <something> to <something> and it'll work"... There's a saying - something like "give a man a fish and he'll eat for a day, but teach a man to catch fish and he'll eat until he's old and senile..." :)
IRBMe wrote: Although, that CR2=0x00000040 looks strange to me. That would be the address which was accessed that caused the page fault right? hm...strange since all my code is loaded well above that.
It is strange and it's also probably wrong. Bochs gave me the same CR2=0x00000040 when I completely messed up PAE paging (which I found out Bochs only pretends to support, after much confusion). I'd ignore CR2 as I think it's a bug in Bochs. A normal page fault is handled by Bochs correctly though, so I'm going to guess that something's fairly messed up (I'm still thinking the instruction after the "movl %0, %%cr0" can't be accessed after paging is initialized).
IRBMe wrote: I think what I might have to do is go learn interrupt handling today and set up an interrupt handler for the page fault. Then I can see a more details like what caused it (not present? not enough privilages? no write access?).
If I'm right and the CPU/Bochs can't access your code when paging is enabled, then it probably won't be able to access your exception handlers and/or stack either. In this case very good exception handlers still won't help.

To me everything I've seen so far looks good (apart from the minor/unrelated "CR0=0xe0000011" indicating that the CPU caches are disabled). IMHO there's only 3 ways to find your bug. The first way is to step through all your paging structures with Bochs (as described above). The next way is to find something wrong in the C source (which no-one's been able to do so far). The other way is to go through a disassembly of your binary with pen, paper and magnifying glass. This last method will find problems caused by the compiler (e.g. compiler bugs or unexpected padding/alignment of unsigned longs).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
IRBMe

Re:tripple fault when enabling paging

Post by IRBMe »

Well I think I've found out a good bit more about the problem.

I have these 2 global variables:

Code: Select all

unsigned long *page_directory = (unsigned long *) PAGE_DIRECTORY_ADDRESS;
unsigned long *page_table = (unsigned long *) PAGE_TABLE_ADDRESS;
PAGE_DIRECTORY_ADDRESS = 0x100000
PAGE_TABLE_ADDRESS = 0x101000

Now, when I write to page_directory like so:

Code: Select all

page_directory[0] = 0x1234;
page_directory[1] = 0xabcd;
...then I examine memory at 0x100000. And sure enough, there's 0x1234 0xabcd. ok! great!

Now I repeat for page_table:

Code: Select all

page_table[0] = 0x1234;
page_table[1] = 0xabcd;
..then I examine memory at 0x101000. And....uh oh! It's 0x0000 0x0000!

If I read the values back with code like so:

Code: Select all

value = page_table[0];
value2 = page_table[1];
Then sure enough, then value and value2 have the original values I assigned: 0x1234 and 0xabcd respectively.

Something's not right there!

Further more, if I create a LOCAL variable and write to that:

Code: Select all

unsigned long *p = (unsigned long*) PAGE_TABLE_ADDRESS

p[0] = 0x1234;
p[1] = 0xabcd;
Then I examine memory at 0x101000, then sure enough there's 0x1234 and 0xabcd!

If, after doing that, I read back my page_table variable again:

Code: Select all

value = page_table[0];
value2 = page_table[1];
They still contain 0!

So it seems that the global variable:

Code: Select all

unsigned long * page_table = (unsigned long *) PAGE_TABLE_ADDRESS;
is NOT the same as the local variable:

Code: Select all

unsigned long * p = (unsigned long *) PAGE_TABLE_ADDRESS;
Yet it seems to work ok with the "page_directory" global variable.


This is the root of my page table problem. Due to that weird error, my page_table variable (which should point to 0x101000) contains the correct page table. But the actual address 0x101000 where cr3 points to is full of 0's!

Now what in the heck could cause that?!
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:tripple fault when enabling paging

Post by Pype.Clicker »

okay, so your very first page is not present (better if you want to catch null pointers) and 0x40 is on that page, so i suppose some null/misinitialized pointers to be used here (with an offset, possibly)

The real question is "how far is EIP=000100d7 (000100d7) from the paging intialization thing".

Indeed, objdump -drS yourkernel.o might be the best option for you ...
Then look where you are (in the ASM and in the C source) and *then* wonder how you get there...
IRBMe

Re:tripple fault when enabling paging

Post by IRBMe »

hahaha I can't f***ing believe it!

Broken code:

Code: Select all

unsigned long *page_directory   = (unsigned long *) PAGE_DIRECTORY_ADDRESS;
unsigned long *page_table      = (unsigned long *) PAGE_TABLE_ADDRESS;
Fixed code:

Code: Select all

unsigned long *page_directory   = (unsigned long *) (PAGE_DIRECTORY_ADDRESS);
unsigned long *page_table      = (unsigned long *) (PAGE_TABLE_ADDRESS);
I forgot the brackets around the defines, so it turned into:

Code: Select all

unsigned long *page_table      = (unsigned long *) PAGE_DIRECTORY_ADDRESS + PAGE_DIRECTORY_SIZE
instead of

Code: Select all

unsigned long *page_table      = (unsigned long *) (PAGE_DIRECTORY_ADDRESS + PAGE_DIRECTORY_SIZE)
So, it was more of a glaringly obvious error than I thought (only not that obvious really).

*still laughing in disbelief*

Thanks for the help though. ;)
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:tripple fault when enabling paging

Post by Pype.Clicker »

For further steps, make sure that your #defines can be used as a single token.

like #define X (Y+Z) ...

It will work better than having to use (X) instead of X everywhere in the code ...
IRBMe

Re:tripple fault when enabling paging

Post by IRBMe »

yup that's how I did it. Live and learn, I believe is the expression ;)
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:tripple fault when enabling paging

Post by Pype.Clicker »

in the same "folder of recommendations", it is suggested that you have ".H" for pure-declarative (i.e. no code) interfaces and .C for "pure code" (no structure definitions, no #defines or whatsoever) ...

It eases the "modularization" of your code: Video.H describes all components need to access functions/abstractions offered by the video module and Video.C is the code itself, compiled apart, linked with the rest so that clear_screen() function is present only once in your project.

And by having Video.C including Video.H, you can check that the implementation keeps consistent with the interface ...
Dreamsmith

Re:tripple fault when enabling paging

Post by Dreamsmith »

Pype.Clicker wrote:in the same "folder of recommendations", it is suggested that you have ".H" for pure-declarative (i.e. no code) interfaces and .C for "pure code" (no structure definitions, no #defines or whatsoever) ...
And in the alternate view department: Your *.c files will (or should at least) have structure definitions and #defines for anything that ought not be externally referenced. Your *.h files should contain as little as possible. This prevents other code from messings around in bits it shouldn't. It's called "encapsulation" and is one of the things, along with "inheritence" and "polymorphism" that many people mistakenly believe you need C++ for but, in fact, is quite useful in plain old ANSI C.

The latter two are only an issue if you're using object oriented design, but encapsulation is always a good idea. I shudder to think of projects where all the nitty-gritty details of a particular .c file's implementation are available for all to see in it's .h file. Don't declare any structures there at all if you can avoid it (and you usually can). Don't declare global variables there either. Instead of this:

Code: Select all

extern volatile int systemTicks;
Do this:

Code: Select all

inline int getSystemTicks()
{
    extern volatile int systemTicks;
    return systemTicks;
}
It compiles the exact same way, but it sures makes changing things down the road a lot easier. You get the efficiency of global variables without the headaches that way.

Encapsulation. Gotta love it...
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Re:tripple fault when enabling paging

Post by Colonel Kernel »

I didn't think "inline" was part of ANSI C...
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
Dreamsmith

Re:tripple fault when enabling paging

Post by Dreamsmith »

Colonel Kernel wrote: I didn't think "inline" was part of ANSI C...
Yup, it is. It was introduced in the C99 standard.
Post Reply