Page 1 of 2

Can I turn off paging in kernel mode?

Posted: Sun Nov 04, 2007 5:55 am
by Craze Frog
Is it possible to turn off paging in kernel mode and turn it on again when I run a process without a big performance hit? Basically it would act like the entire computer's physical memory was identity mapped, without having to spend a huge load of page tables. Is there anything wrong with this?

Posted: Sun Nov 04, 2007 8:20 am
by frank
The kinda short answer is yes, as long as the kernel is identity mapped (Located at the same physical and virtual addresses).

Every time you turn paging on or off or write to CR3 the processor flushes the TLBs. I guess that as long as you have the kernel identity mapped there would be no real problem with turning paging on and off except for the trouble of dealing with application pages. When you turn of paging all those nice pages mapped at 0x40000000-0x40010000 now end up being pages somewhere in physical memory. The pages could be fragmented the page at 0x40005000 could be lower in memory than the page at 0x40000000 and stuff like that.

Do you have any big advantages to be gained from turning paging on and off? The cost of page tables and page directories is really negligible as long as you share the kernel page tables with all processes.

Posted: Sun Nov 04, 2007 8:45 am
by Craze Frog
Do you have any big advantages to be gained from turning paging on and off?
Simplicity and memory savings at the same time. I don't need to worry about which addresses are physical and which are virtual in the kernel, since all are physical. This would greatly simplify page table management, since page directories and frames are referenced by physical memory. So I don't need to map the page table onto itself of keep a double array of virtual and physical memory pointers.

Of course I could do that by identity-mapping all available memory, but that would take a whopping 4 mb for 4 gb memory.

Edit: Or are you saying that the kernel should be identity mapped into every process (using the same page table) with no separate page directory for the kernel?

Posted: Sun Nov 04, 2007 11:56 am
by JAAman
Edit: Or are you saying that the kernel should be identity mapped into every process (using the same page table) with no separate page directory for the kernel?
most OSs dont have a separate page directory for the kernel, and simply map it into each process, whenever you turn paging on/off, you will need to be in an identity-mapped space -- whether this happens in its own page directory or within each process space, the simple answer is to identity-map it into each process, then you dont need a separate page directory, and you dont need to worry about running the same code from multiple addresses (as you would if it was identity-mapped in its own space, but not in individual process spaces

Posted: Sun Nov 04, 2007 12:36 pm
by Brendan
Hi,
Craze Frog wrote:
Do you have any big advantages to be gained from turning paging on and off?
Simplicity and memory savings at the same time. I don't need to worry about which addresses are physical and which are virtual in the kernel, since all are physical.
Every address from user-space will be a linear address. This includes any pointers used for kernel API function arguments, the user space ESP and exception handlers (e.g. CR2 in the page fault handler).

To make it worse, (for e.g.) if an application passes a pointer (e.g. a pointer to a data structure or string) to a kernel function, then that kernel function will need to convert the linear address into a physical address for the first byte, and then convert again for each page boundary that the structure crosses (while keeping track of how many bytes left in the current page). A kernel API function that does something simple (like "strcpy()") quickly becomes a complicated mess.

If you've got a "kernel space" mapped into every address space, then you need to convert from linear address to phyical addresses, but only within the linear memory manager, and only for aligned pages (no need to continually check for page boundaries).
Craze Frog wrote:This would greatly simplify page table management, since page directories and frames are referenced by physical memory. So I don't need to map the page table onto itself of keep a double array of virtual and physical memory pointers.
If you insert the page directory into itself as a page table (to create a "page table entry mapping"), then you can find the page table entry from a linear address with:

Code: Select all

page_table_entry_address = mapping_address + ( (linear_address & 0xFFFFF000) >> 10);
If you try to do this without paging enabled, then it'd be something like:

Code: Select all

page_directory_entry = *(CR3 + ( (linear_address & 0xFFC00000) >> 20) );
page_table_address = *( (page_directory_entry & 0xFFFFF000) + ( (linear_address & 0x003FF000) >> 10) );
It doesn't look simpler to me...
Craze Frog wrote:Of course I could do that by identity-mapping all available memory, but that would take a whopping 4 mb for 4 gb memory.
There's no point identity mapping 4 GB - it'd be just like leaving paging disabled. ;)


Cheers,

Brendan

Posted: Sun Nov 04, 2007 2:43 pm
by Craze Frog
Every address from user-space will be a linear address. This includes any pointers used for kernel API function arguments, the user space ESP and exception handlers (e.g. CR2 in the page fault handler).
I don't think I will need to pass strings or structures larger than what can fit in the registers to the kernel. System calls like "open" will be handled by a userspace server. The kernel only manages address spaces, threads and hardware access permissions. So I don't need to receive pointers or stack pointers.

About the CR2, I can't see why it's a problem, if I understood it correctly it will in fact be easier if CR2 is the virtual address and paging is turned off than it would if paging was turned on.
Let's say the page in VIRTUAL memory 651264 gets swapped out. The process now tries to read it. The page fault handler now gets executed.
Here's the point: Since the memory is swapped out, there is no physical memory associated with the virtual memory, so the kernel couldn't have read/written to it even if it had paging turned on.
The only thing the kernel needs to do is to load the page from disk, put that page somewhere in physical memory, and update the page table with the pointer to physical memory (see why it's easier without paging?).
[...] then you can find the page table entry from a linear address with:
I'm don't exactly understand what your code is doing there. This is what I'm thinking (gotta be something wrong with it):
Let's say I already have the physical address of the page directory (it needs to be stored somewhere, and since the directory should never be accessed from userspace, it could just as well stored as the physical address), then I'd just do something like

Code: Select all

ppageTable = ppageDirectory[linear_address/0x1000/1024];
(is it necessary to zero the last bits by a bitmask?)

Also, I can't see how it's possible to do dynamic memory allocation in the kernel in a clean way, when the kernel is mapped into every process. It looks like you need to set away a part of the address space for kernel allocations, which doesn't make it very dynamic in my opinion.

Posted: Sun Nov 04, 2007 3:19 pm
by Craze Frog
But still, will it cause a performance problem?

Posted: Mon Nov 05, 2007 1:07 pm
by JAAman
Let's say the page in VIRTUAL memory 651264 gets swapped out. The process now tries to read it. The page fault handler now gets executed.
Here's the point: Since the memory is swapped out, there is no physical memory associated with the virtual memory, so the kernel couldn't have read/written to it even if it had paging turned on.
The only thing the kernel needs to do is to load the page from disk, put that page somewhere in physical memory, and update the page table with the pointer to physical memory (see why it's easier without paging?).
no i dont see how this is easier without paging -- actually its harder...

the kernel doesnt need to do anything except write to the address, if it wants to use an address that is currently swapped out -- when it fails to write, the CPU will automatically call the code which loads it into memory at the appropriate virtual address (the physical address is irrelevant, and only needs to be marked as in-use) then the kernels request is handled, and the portion of the kernel which tried to access that out-of-memory section of code, doesnt even know it wasnt in memory

on the other hand, with physical memory, the kernel will have to find an available address, then modify the place its trying to write to to match whatever address it was loaded into -- after manually detecting that it wasnt already in memory -- and if its actual kernel code that was swapped out? well that complicates it even more...
Let's say I already have the physical address of the page directory (it needs to be stored somewhere, and since the directory should never be accessed from userspace, it could just as well stored as the physical address), then I'd just do something like
accessing the page directory from within virtual memory is easy -- with physical memory you have to know where in memory its stored, but with virtual memory you dont -- the page tables for the current process will always be in the same place (most people put it right at the top of the address space) and reading this will give you the physical address associated with any page...
Also, I can't see how it's possible to do dynamic memory allocation in the kernel in a clean way, when the kernel is mapped into every process. It looks like you need to set away a part of the address space for kernel allocations, which doesn't make it very dynamic in my opinion.
it can be done different ways, but usually the address space is divided into 2 parts (either 2/2 or 3/1), where the upper portion is allocated into kernel memory, and the lower portion is allocated into process space -- the memory doesnt exist until is is allocated -- setting aside part of the address space for kernel allocations doesnt make it any less dynamic...

Posted: Mon Nov 05, 2007 11:44 pm
by Brendan
Hi,
Craze Frog wrote:
Every address from user-space will be a linear address. This includes any pointers used for kernel API function arguments, the user space ESP and exception handlers (e.g. CR2 in the page fault handler).
I don't think I will need to pass strings or structures larger than what can fit in the registers to the kernel. System calls like "open" will be handled by a userspace server. The kernel only manages address spaces, threads and hardware access permissions. So I don't need to receive pointers or stack pointers.
Some ideas...

For my OS, each process has a name and each thread has a name, and software can ask the kernel for details of all running running processes/threads (e.g. a list including name, used CPU time, memory usage, etc). In this case the thread tells the kernel the linear address and size of a buffer and the kernel fills it with information. This also means you need to supply the name of a thread when it's spawned, and there's kernel functions to change a process' name or a thread's name. Spawning a new process is a little different (you need to supply string of command line arguments instead).

When a thread crashes, my OS searched linear memory (the process' code) looking for the first debugging marker before EIP, so that the "blue screen of death" can say which file the bug is in. One version even disassembled the faulty instruction if it could, and displayed the top 32 dwords on the stack (at the linear address from the thread's ESP).

For me, the general protection fault handler looks at EIP to see which instruction (if any) caused the problem, and if it's an I/O port instruction it does permission checks and emulates the instruction. If you support virtual80x86 then you'll probably need to do some instruction emulation too. The same applies to the invalid opcode handler (emulate the instruction and pretend the CPU supports it instead of crashing). Instructions can be multiple bytes split across different pages, and aren't much different to strings.

For the next version of my OS, I'm planning a kernel API extension where a thread can ask the kernel to do a list of kernel API functions. The idea is that the CPU can go from CPL=3 to CPL=0, do any number of kernel functions, and then return from CPL=0 to CPL=3 (it reduces the overhead of many CPL=3 -> CPL=0 -> CPL=3 switches). To use it, the thread tells the kernel where the address of the list is and the kernel steps through each entry in the list. Note: if you're trashing TLB entries every time the kernel is called, then you'll want to consider using something similar!

I also strongly recommend that application never use the CPUID instruction and instead ask the kernel for the information, so that the OS can correct the buggy crud that CPUID returns (if CPUID is supported) and give applications reliable and consistant information. This means the thread tells the kernel the linear address of a buffer and the kernel writes about 150 bytes of data into it.

I'm supporting up to 255 CPUs. This means that CPU affinity masks need to be 256 bits wide, and the scheduler functions used by threads to get and set CPU affinity need to use structures in linear memory (it's too large for four 32-bit registers). Note: I use EAX for the kernel API function number and returned status, and ECX and EDX are trashed by SYSENTER/SYSEXIT, which leaves EBX, ESI, EDI and EBP for function arguments and returned data.

For any/all of these things you'd be converting between linear addresses and physical addresses. You might not do some of these things, but you might do other things I didn't think of...
[...] then you can find the page table entry from a linear address with:
I'm don't exactly understand what your code is doing there. This is what I'm thinking (gotta be something wrong with it):
Let's say I already have the physical address of the page directory (it needs to be stored somewhere, and since the directory should never be accessed from userspace, it could just as well stored as the physical address), then I'd just do something like

Code: Select all

ppageTable = ppageDirectory[linear_address/0x1000/1024];
(is it necessary to zero the last bits by a bitmask?)[/quote]

Using arrays (instead of pointers), it'd look more like:

Code: Select all

pageDirectoryEntry = ppageDirectory[linear_address/0x1000/1024];
ppageTable = ppageDirectoryEntry & 0xFFFFF000;   // Remove read/write, present, busy, accessed, etc flags
ppageTableEntry = &ppageTable[(linear_address/0x1000) & 0x00000FFC];
Craze Frog wrote:But still, will it cause a performance problem?
The advantage is that each process can use almost 4 GB of address space, as you wouldn't need to use half (for e.g.) of each address space for the kernel. This sounds good, but isn't that useful in practice - most processes that need more than 2 GB probably need more than 4 GB anyway. Note: I say "almost" here because you need to have the GDT, a TSS, the IDT, interrupt handling stubs and kernel API entry and exit points mapped into all address spaces.

The main disadvantage (for performance) is that you trash the TLB every time the kernel does anything. The TLB misses alone could add up to 1000 cycles of overhead when a kernel function is called or an IRQ occurs (unless you happen to switch address spaces, in which case the TLB needs to be flushed anyway).

There are other disadvantages though. If you ever support PAE (or PSE36) then the kernel won't be able to access any RAM above 4 GB; it'll be harder to port the OS to long mode; you won't be able to do some NUMA optimization tricks; you couldn't use "allocate on demand" to simplify the kernel's memory allocation; you won't get page faults when the kernel has bugs (harder to debug the kernel)...


Cheers,

Brendan

Posted: Tue Nov 06, 2007 8:31 am
by Craze Frog
1000 cycles sounds a bit on the heavy side, but thanks for the information.
you couldn't use "allocate on demand" to simplify the kernel's memory allocation;
:shock: That's exactly the thing I want paging off to easily do without wasting 2GB of address space.

Posted: Tue Nov 06, 2007 11:29 am
by Brendan
Hi,
Craze Frog wrote:1000 cycles sounds a bit on the heavy side, but thanks for the information.
1000 cycles is a rough estimate.

You can't enable paging and return from CPL=0 to CPL=3 at the same time. When returning from the kernel to user space you'd enable paging first and cause one TLB miss at EIP and another TLB miss at ESP. Then you'd return to user space, and get a TLB miss at the new EIP and another TLB miss at the new ESP, and more TLB misses for any data accesses the thread does.

For each TLB miss the CPU may need to access RAM to get the page directory entry and then access RAM again to get the page table entry. With 5 TLB misses it's up to 10 RAM accesses.

With modern CPUs there's a huge difference between CPU speed and RAM speed - each RAM access can easily cost up to 150 cycles. For single-CPU systems having 10 RAM accesses can add up to 1500 cycles. For SMP or DMA other CPUs or devices may also be accessing RAM at the same time making your RAM accesses slower (need to wait for bandwidth first). For NUMA, the RAM you're accessing could be a few "hops" away (in RAM chips that are further from the CPU rather than in RAM chips that are close to the CPU).

Of course the data the CPU needs for a TLB miss could be in the CPUs cache. This would speed things up a lot. For e.g. if everything happens to be in the L2 data cache it might only cost 20 cycles to access it, or 200 cycles for 10 TLB misses.

If both our OSs have a simple function (e.g. to allocate or free a page of RAM) and the function's code takes 100 cycles, then for my OS it might take an average of 150 cycles (with the CPL=3 -> CPL=0 -> CPL=3 switching), and for your OS it might take an average of 600 cycles (with the CPL=3 -> CPL=0 -> CPL=3 switching and the TLB misses). My OS would be 4 times faster, even though the function's code is the same. For large/slow kernel functions the overhead has less effect (e.g. a complex kernel function that takes 1000 cycles might cost 1050 cycles in my OS and 1500 cycles in your OS - not as much difference).

Most kernel functions in a micro-kernel are simple and fast (usually, spawning a new thread, creating a new process and doing task switches are the only things the kernel does that take much CPU time). If your OS is 4 times slower for most kernel functions then you'd have to wonder if it's worth doing.
Craze Frog wrote:
you couldn't use "allocate on demand" to simplify the kernel's memory allocation;
:shock: That's exactly the thing I want paging off to easily do without wasting 2GB of address space.
I use allocation on demand for some areas in kernel space - it means that the kernel doesn't need to check if there's RAM at an address or not. This means if RAM is already there (very likely) then you prevent the need to check which saves CPU time, and if RAM is not there (the first time the page is accessed) you get the extra overhead of a page fault. Over time it averages out to a performance increase while also simplifying the kernel's code. Without paging the page fault handler can't allocate RAM when the kernel accesses a "not present" address - you'd have to have "do I need to allocate RAM first" checks in your critical paths that are usually unnecessary.

It's very likely that all of your life you've been using OSs with monolithic kernels (e.g. Windows, Linux, etc) and have never seen a process that has ran out of linear address space. A micro-kernel doesn't need all the device drivers, etc in kernel space, and should be able to leave more of the linear address spaces for processes, so something that you've probably never seen would be less likely. Who cares about using 1 GB (or 512 MB or 2 GB) of the linear address space for the kernel (almost no-one?), and who cares about kernel functions being significantly slower (almost everyone?)....


Cheers,

Brendan

Posted: Thu Nov 08, 2007 11:27 am
by Craze Frog
Ok, so I basically need to have paging on all the time and let the kernel run in all address spaces. But how the **** do I allocate memory then?

Allocating frames is easy enough, just check for a free spot in the bitmap. Assigning a frame to a page table entry is simple as well, provided that the page table entry exists.

But what if the page table doesn't exist? Then it needs to be allocated. There's just a slight problem: Even if I allocate a frame for it, I can't access that frame until I map it somewhere into virtual memory, and to do that I need to allocate a page table. :shock: Goto start.

Posted: Thu Nov 08, 2007 11:37 am
by Combuster
Have the last entry of each page directory map to itself. Excercise for the reader what the consequences of that action are. :wink:

Posted: Thu Nov 08, 2007 11:38 am
by AJ
OK,

1. Map the last page directory entry to the physical address of the page directory. Think about this - the space from 0xFFC00000 to 0xFFFFFFFF now contains all your page tables in *virtual ram* (note you have not actually lost any *physical* ram by doing this).

2. As the last entry in the page directory is itself, that means that the page directory is accessible in virtual ram at 0xFFFFF000 (the last 4k).

3. When you need to map a new page table, allocate the physical ram, then map it in to the correct location in the page directory. This means, that for mapping 0x00000000 - 0x00400000, you will put the physical address of the page table, plus the present and read/write flag in the first PDE at 0xFFFFF000.

4. Take time to think about this again (as it takes some getting your head around).

5. Your page *table* is now accessible at location 0xFFC00000 - 0xFFC00FFF. You can now add PTE's to your page table.

Cheers,
Adam

Posted: Thu Nov 08, 2007 12:03 pm
by Craze Frog
Ehm, I can't map the page directory onto itself until it's allocated. I can't allocate memory for the page directory without allocating a page. I can't allocate a page without mapping the page directory onto itself. That's the problem.

So basically I need to do some static setup of stuff.