Doing Paging Right

pcmattman · Post by **pcmattman** » Tue Oct 28, 2008 2:47 am

Hi guys,

I've lately been fixing up my old OS attempt (Mattise) while I have spare minutes (I'm kind of on hold on the Pedigree project thanks to other commitments such as school).

So far I've done things like rewrite the physical frame allocator to work off a stack instead of a bitmap, and pretty much almost totally rewritten the process code to use process queues instead of static arrays. I've also fixed up a lot of bugs and mistakes I made.

However, the one thing I still can't seem to get right is paging. Every time my OS crashes it's thanks to a page fault, never anything else. So I decided to post here and ask some questions about designing and implementing a decent paging system.

A good first example is my process creation code. To create the process and load the file contents (ELF), I need to write to the virtual address in the new page directory. However, I can't simply write to the physical pages returned by my frame allocator, they need to be mapped into the address space. The double mapping can be extremely confusing at times and uses up extra pages I could use elsewhere. Is there a better way to pull this off, other than COW or similar?

As I write I have started considering passing an extra argument to the kernel-based thread entry function which could load the binary into the address space. Is this a good idea? In my kernel-based thread entry function the CPU is still in ring0, so there wouldn't be any privilege problems.

Another example is, ironically, my page mapping code, which internally calls itself. At the moment, I check whether the page directory to map into is the current page directory - if so, then I use the "map the page directory to itself" trick. This works well. However, there do come times when I need to map into a different page directory than the current one - mainly when doing things like creating and putting data onto new thread stacks. This can't be solved by the "do it in the thread entry function" method because the thread entry function uses this stack.

In this page mapping function I have to map the page directory into the current address space, then map the page table as well. This does cause extra overhead and I'm sure there's a better way to do this.

I'm willing to take the time to rewrite a fair bit if it'll make it easier to do things in the future, so your advice and ideas would be greatly appreciated. Thanks in advance.

AJ · Post by AJ » Tue Oct 28, 2008 4:43 am

Hi,

This is all stuff I have found tricky in the past and have never really been satisfied with the answer, but perhaps if I go through my "semi-solutions" and previous errors, you may at least get a different perspective!

pcmattman wrote:A good first example is my process creation code...

My philosophy here has always been to remain in the creating (host) process' task space for as little time as possible. In my first test kernel, I did this by reserving some virtual memory in kernel-space specifically for creating new PD's (I'll continue to use the term PD for brevity, but if you are using x86_64, the same applies to PML4's). The problem was that this artificially limits the number of tasks that can be created simultaneously.
What I now do instead of this is have an aligned malloc function so that a new PD is created in the kernel heap (which is huge on x86-64!) and a special free function which does a forced un-map, so that the page which is allocated in the kernel heap for creating the new PD cannot get re-used by the kernel heap(!). This same combination of functions is used for creating the new PD, ring 0 stack and "process control region". The new PD consists of an exact copy of kernel-space (substituting the new PD for the old one for PD self-mapping) and an empty user region. The "process control region" is the only area which is then mapped in to the new PD before the new PD is deallocated from the kernel heap. This region contains items such as the initial process stack.

Once that lot is done, I'm ready to switch to the new task and all EXE loading is done entirely from the new task space.

As I write I have started considering passing an extra argument to the kernel-based thread entry function which could load the binary into the address space. Is this a good idea?

Personally, I prefer lazy-loading within the context of the new task - I think a system feels more responsive with quick load times. Of course, lazy loading is less efficient for programs that use a lot of their binary straight away. I guess a more advanced system could intelligently select the type of loading to use

However, there do come times when I need to map into a different page directory than the current one - mainly when doing things like creating and putting data onto new thread stacks. This can't be solved by the "do it in the thread entry function" method because the thread entry function uses this stack.

For your example, I don't see why this is a problem. You create a minimal stack for each new task and after that all memory maps are done in the context of the new task. Perhaps if you use shared memory areas for e.g. IPC and DLL sharing this could come and haunt you again, but I think that for "general purpose" code, you should be able to avoid the situation of mapping one process' PD's and PT's to anothers.

Sorry this isn't nearly in-depth enough but hopefully there are some ideas in there. I generally find that Brendan gives some excellent advice in this kind of implementation mind-bender so hopefully he'll be around. If I have more time, I'll get back later!

Cheers,
Adam

pcmattman · Post by **pcmattman** » Tue Oct 28, 2008 5:27 am

Thanks AJ, you've given some good points to ponder. It's nice having some ideas to get me thinking about how I could do it in my kernel.

Brendan · Post by **Brendan** » Tue Oct 28, 2008 7:12 am

Hi,

AJ wrote:
pcmattman wrote:A good first example is my process creation code...
My philosophy here has always been to remain in the creating (host) process' task space for as little time as possible.

The last time I did this, I had 2 key structures. One structure that describes a thread; including things like the thread name, process ID, CPU time used by the thread so far, thread priority, an area to save FPU/MXX/SSE state, and the thread's kernel stack. The other structure describes a process; including things like the process name, the amount of CPU time used by threads that have terminated and a list of "thread structures" that belong to the process.

During boot, the boot code constructs a "process structure" for the kernel and a "thread structure" for each CPU; so that immediately after boot the kernel looks just like a multi-threaded process. The kernel's multi-threaded process has it's own virtual address space ("kernel space").

This means that at any time I can spawn a kernel thread that belongs to the kernel process, and I mainly only need to worry about creating a new thread structure in kernel space. When the new kernel thread gets CPU time it can create a "process structure" and change it's owner (so that it belongs to the new process, and doesn't belong to the "kernel process" anymore). Then the thread can create/allocate it's own new virtual address space (e.g. allocate a new page directory and copy the kernel's page tables into it). At this point I've basically got a new thread and a new process with nothing in user-space. From here the thread can allocate it's own CPL=3 stack, parse an executable file's header, load/map the executable, and then jump to the executable's entry point.

Also note that the same concept ("do all memory management from the correct thread") applies to freeing stuff (e.g. when a process or thread terminates itself or is terminated by the kernel). You shouldn't ever need to access pages in a different address space.

However, there may be situations where you need to access page directory entries in a different address spaces (e.g. allocating a new kernel page table). To solve this problem I map every page directory into kernel space (so that for each address space it costs me 4 KiB of "kernel space"). This doesn't cost much RAM though (just like the "map the page directory to itself" trick costs you space but doesn't cost you RAM). For "plain 32-bit paging", for 1024 address spaces it costs 4 KiB of RAM for the page table, and 4 MiB of "kernel space".

AJ wrote:Personally, I prefer lazy-loading within the context of the new task - I think a system feels more responsive with quick load times. Of course, lazy loading is less efficient for programs that use a lot of their binary straight away. I guess a more advanced system could intelligently select the type of loading to use

The main problem with lazy loading (or memory mapped executable files) is that the disk heads tend to bounce around everywhere while the executable is starting, which can hurt performance more than it helps (especially if the disk heads are bouncing around everywhere trying to find DLLs too).

The other thing to be careful of with lazy loading is that some disks (e.g. floppy, CD-ROM, USB flash, hot-plug hard drives, etc) are either unreliable (read errors) or removable. If the kernel needs to load a page from the executable file (so that the process can continue running) but the kernel can't access the executable file anymore, then you're screwed - there's no graceful way to handle the error (for e.g. you can't tell process it failed to load and let the process handle the problem in a way that's appropriate for that process). The only thing you can do is terminate the process (for e.g. AFAIK Windows treats it like a page fault and does a "blue screen of death"). I'd guess a more advanced system would have an "enable/disable lazy loading" flag in the executable header (for performance reasons), and the kernel would ignore this flag unless the executable file is on a removable disk (for reliability reasons).

Also, if lazy loading is enabled it'd be nice if the kernel would/could prefetch pages from disk. For example, if there's free RAM and the disk drive has nothing better to do, load pages from disk in case they're needed (so you don't need to get a page fault and then wait for the page to come from disk while the process is trying to get work done).

Cheers,

Brendan

egos · Post by **egos** » Tue Oct 28, 2008 8:04 am

pcmattman wrote:In this page mapping function I have to map the page directory into the current address space, then map the page table as well. This does cause extra overhead and I'm sure there's a better way to do this.

To map page dir and page tabs into address space is normal. I save the whole 4 mb page table at the end of each space, where last page is simultaneously page dir and page tab for whole page table. Page tabs for the global memory region (where the kernel is located) are general for all processes. When the new process is created one page is allocated and mapped into the special page frame. It is page dir for the new process. After this it is filled up (global entries are copied from the current page dir, last entry is produced from the new page dir address), some pages are allocated for the new local kernel stack and the new local thread structure with special starting address is created. Process initialization continues in the new context.

pcmattman · Post by **pcmattman** » Tue Oct 28, 2008 6:50 pm

The idea of process initialization in the new context rather than in a different context seems to be a popular solution - and I can understand why. I'm not entirely sure why I didn't think of it initially.

I guess I've needed to rewrite my process/thread creation code for a while now, just have tried to avoid it. Now looks like a perfect time to tear it apart and do things differently.

Thanks guys for the ideas and advice, means I can get just a little bit closer to a working kernel (which can run NASM and binutils - already ported, just not running right because of these sorts of problems).

EDIT: I just got my kernel compiling with all the warning flags set (I don't know why I avoided -Wall in the least), after I release this version I'm planning on following the advice here to do context initialization in the new context rather than in some other context. It'll make the code far easier to read, understand and debug. Thanks Brendan and AJ for all the assistance.

Owen · Post by **Owen** » Mon Nov 03, 2008 5:34 pm

The way I do it is that my kernel has three regions at the very top of it's address space (Well, four, but the very top one is unimportant for this)

The first is the current process' page directory, so that it is always mapped.
The second is a page directory entry for mapping another process' page directory temporarily, e.g. when mapping a page into another process (Or when a PDE needs to be allocated in kernel space and it needs adding to all processes). This mapping must only be held within the VMM's locks
The third is an area I call the "Transient Mapping Table". Simply, the TMT is a PDE into which the kernel can map 1024 pages temporarily (No explicit time limit is placed on this, but it's assumed anyone writing kernel land code has a reasonable definition of temporary

)

Whenever I need to allocate a process, I request a page from the physical allocator, map it into the TMT, zero it, unmap it, then map it into the temporary page directory location. I then memcpy the kernel space page mappings into it, and allocate it's first thread's kernel stack into the kernel heap. I put any information required for process startup on the kernel stack, then enter the process into the scheduler.

Having a dedicated area to map individual pages is very useful, and allows you to optimize it's structure fully for speed (rather than size)

OSDev.org

Doing Paging Right

Doing Paging Right

Re: Doing Paging Right

Re: Doing Paging Right

Re: Doing Paging Right

Re: Doing Paging Right

Re: Doing Paging Right

Re: Doing Paging Right