Page 1 of 1

Paging problem on real hardware

Posted: Fri Sep 14, 2012 5:23 am
by itportal
Hello forum,

I have been pulling my hair over this one for some time, so I hope you will come up with ideas about the solution. The situation is: I have a 32-bit kernel, it switches to 32-bit mode, uses paging, even also enables the APs and sets up all the required data structures, the kernel receives interrupts on all cores. I have been developing and testing for some time under bochs and qemu and everything works fine.

The problem is: several days ago I tested the whole on real hardware and there is some kind of problem with paging. Right after enabling paging, the kernel stops making any vga text output more and seems not to execute the further instructions at all. However, the cpu is not reset. This happens right after enabling paging:

Code: Select all

inline void enablePaging(){
	paging_enabled = true;
	asm volatile("mov ebx, cr0 		  \n"
				 "or  ebx, 0x80000000 \n"
				 "mov cr0, ebx 		  \n"
				 : : : "ebx");
}
Now, I debugged under bochs and all pages are mapped correctly (also the video ram), cr3 is set before calling the enablePaging function. When I do not enable paging the kernel starts successfully on real hardware as well. At first I though that it might be a problem with initializing the BSS section, but the kernel initializes it and I even checksummed it.

I am really out of ideas why it might be wrong. Has anyone of you encountered such a problem or do you have any ideas what might be wrong? Could it be something connected to caching? I noticed that cr0 on emulators shows that cache is disabled.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 6:20 am
by shikhin
Hi,
itportal wrote:This happens right after enabling paging.
I don't see how anything should go wrong *right after enabling paging* on real hardware, but work on a emulator?

The only problem I see is that you might not be invalidating the TLB correctly, and the problem might occur after enabling paging, though not exactly *right* after enabling it.

Regards,
Shikhin

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 6:34 am
by itportal
Well, I am trying to produce a text output after enabling paging and then leave the kernel in an endless loop (so no page table changes). Though, for the sake of simplicity I do a TLB flush on each mapping/unmapping of virtual memory. And CR3 is loaded right before paging is enabled.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 6:43 am
by bluemoon
You got SMP kernel, I'm sure you know how to narrow down the problem by elimination.
Try to eliminate all stuff by run on another machine (yes it may somehow work),
then try to do just bootstrap and enable paging, without higher half, no smp, etc.

Real hardware is usually more sensitive to bugs, trick, hack, default state (zero'd ram in emulator), and things that can left uninitialized in emu may be mandatory for the real deal.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 7:58 am
by itportal
I am trying to narrow the problem down, but as I mentioned - I am out of ideas.

So, a small update: Interrupts are enabled and I can successfully get and process them. Keyboard is working and prints a character on key down / key up. I can also get a timer interrupt. However the kernel seems to be stuck on one given instruction. Here is a disassembly of the enablePaging function:

Code: Select all

000004ce <enablePaging>:
 4ce:	55                   	push   ebp
 4cf:	89 e5                	mov    ebp,esp
 4d1:	53                   	push   ebx
 4d2:	0f 20 c3             	mov    ebx,cr0
 4d5:	81 cb 00 00 00 80    	or     ebx,0x80000000
 4db:	0f 22 c3             	mov    cr0,ebx
 4de:	c6 05 00 00 00 00 01 	mov    BYTE PTR ds:0x0,0x1
 4e5:	5b                   	pop    ebx
 4e6:	c9                   	leave  
 4e7:	c3                   	ret    
It seems that the kernel is stuck on the mov BYTE PTR ds:0x0,0x1 instruction. All interrupts I get after paging is enabled point to this as the EIP of the interrupted code. However, I do not get any exceptions by this instruction.

The instruction should set the paging_enabled variable to 1. I moved the instruction after the actual write to cr0. The code here says ds:0x0, but it is not linked. After linking, the correct address is being generated and bochs sets the value correctly.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 8:09 am
by bluemoon
itportal wrote:I moved the instruction after the actual write to cr0.
As a side note, you can do this to properly tell gcc that you clobbered memory:

Code: Select all

inline void enablePaging(){
   asm volatile("mov ebx, cr0         \n"
             "or  ebx, 0x80000000 \n"
             "mov cr0, ebx         \n"
             : : : "ebx", "memory");
   paging_enabled = true;
}
gcc should place the paging_enabled = true after write to cr0.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 8:19 am
by itportal
bluemoon wrote:As a side note, you can do this to properly tell gcc that you clobbered memory:
gcc should place the paging_enabled = true after write to cr0.
But I don't clobber the memory here. Adding the "memory" in the clobber list, does not change the asm output.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 8:34 am
by bluemoon
itportal wrote:It seems that the kernel is stuck on the mov BYTE PTR ds:0x0,0x1 instruction. All interrupts I get after paging is enabled point to this as the EIP of the interrupted code. However, I do not get any exceptions by this instruction.
This remind me once I encounter a bug that my #PF handler has one route that return without doing anything, so the cpu seems to get stuck at one faulting instruction.
Are you sure there is no exception?

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 9:28 am
by Combuster
itportal wrote:But I don't clobber the memory here.
That would only be true if you actually map all 4GB of the address space 1:1 to their original locations? I'm sure there are things that were once visible in memory that are now gone.

Re: Paging problem on real hardware

Posted: Fri Sep 14, 2012 9:42 am
by itportal
Combuster wrote:That would only be true if you actually map all 4GB of the address space 1:1 to their original locations? I'm sure there are things that were once visible in memory that are now gone.
That's true, I got the point and added "memory" to the clobber list.

I think I solved the problem: it has nothing to do directly with paging at all. It seems that the init count value of my lapic timer is calculated wrong and it fires timer interrupts so quickly that the kernel code is unable to continue executing as it is interrupted all the time. Why this artifact happens exactly after paging is enabled and not after the lapic timer is enabled is still a mystery to me, but it seems fine now.

Thanks for the help!