OSDev.org

Posted: **Mon Aug 29, 2005 1:46 pm**

Can someone please explain paging to me in detail. I have read numerous tutorials and I am still not sure how to use it. Could you please give me a step-by-step guide(you don't have to give me code just tell me what I have to do) to how I should use paging in my kernel.

Thanks for your help.

edit:
Also does each process need its own page directory?

Posted: **Mon Aug 29, 2005 3:09 pm**

hi,

In order to enable paging, you have to:
1. create a page directory
2. create one (or more) page tables
3. write physical address of page directory to cr3
4. enable the paging bit in cr0

Also does each process need its own page directory?
This is really up to you. But I think this is something you want.

Edit: This is how understood your questions, but if you want to know more about how paging works than please ask.

Posted: **Mon Aug 29, 2005 4:18 pm**

Hey,

Might I suggest you use the search featur for a recent post? I've made a few when I was completely clueless on the subject and I believe it shows how it was explained to me step by step by AR and others.

Posted: **Mon Aug 29, 2005 8:02 pm**

Ok, I seem to have found out where I got confused. I thought that every application had to use the same page directory. But it turns out every application can have a different one. So when I do a system call, the output will be put into the callers memory because the page directory is different and points to a different physical address even if the logical address is the same. So the page directory maps all 4gb and the page tables map the 4mb superblocks which are mad out of 4kb pages.

Thanks for all your help but now I have another question.

If the task state selectors must be located in the GDT(correct me if i am wrong) then there can only be about 8192 task state selectors which means that only 8192 processes which have different address spaces can be running at a given time. So is there a way to get around that?

Posted: **Mon Aug 29, 2005 9:08 pm**

If you are referring to Task State Segments, then yes, that is the maximum number of TSSes you can have, however very few people use hardware task switching as it slow, and getting slower with every new CPU model, the prefered method is a single TSS per CPU and simply manually reconfiguring the TSS each time a task is switched instead (Called Software Task Switching).

The method is quite simple really, when the timer interrupt occurs and switches to the kernel, the SS0 and ESP0 fields are read from the TSS to provide the stack segment and stack pointer, the Interrupt Service Routine automatically pushes the program state onto the stack as always then all you have to do is manually MOV to change ESP/SS to a different stack representing another process (and configure the TSS ESP0/SS0 accordingly as well) [You also need to load the page directory too] then as the ISR exits, it will pop the new processes state into the registers and IRET to the new task.

Posted: **Mon Aug 29, 2005 10:03 pm**

Hi,

iammisc wrote:If the task state selectors must be located in the GDT(correct me if i am wrong) then there can only be about 8192 task state selectors which means that only 8192 processes which have different address spaces can be running at a given time. So is there a way to get around that?

You must have at least 4 code & data descriptors and the null descriptor. By putting these descriptors into an LDT you can have a maximum of 8190 TSS descriptors at any one time.

The solution for this (apart from using software task switching) would be not to have all TSS descriptors in the GDT at the same time. It is possible to do hardware tasking switching with as little as 2 TSS's in the GDT by dynamically changing the GDT entries before each task switch (assuming you're not using TSS desciptors for interrupt/exception handling - ie. no task gates in the IDT).

I've also heard that a single TSS descriptor does work, as the CPU "remembers" where the current state came from - there must be a hidden part of the TR (Task Register), not unlike the hidden part of segment registers. There would be restriction for this though (no task gates in the IDT, all task switches done with "JMP" and not "CALL", etc). The reason for this is the CPU modifies the "busy" bit in the GDT entry...

IMHO it's better to use software switching anyway and forget all of this (it's faster and more portable)

....

Cheers,

Brendan

Posted: **Tue Aug 30, 2005 2:24 pm**

The isr will automatically pop the task state or do I have to program that myself? I still am confused as to software task switching are there any tutorials you know of(besides the one on osdever.net)?

Posted: **Tue Aug 30, 2005 7:14 pm**

Hi,

iammisc wrote:The isr will automatically pop the task state or do I have to program that myself? I still am confused as to software task switching are there any tutorials you know of(besides the one on osdever.net)?

First, I'd recommend not relying on the ISR for any part of the task switch itself. The reason for this is that there are kernel functions which cause a task switch without any interrupt (for example, "sleep()", spawning a new task, sending or receiving IPC, etc). Instead I'd use a single "do task switch" function that can be called from the ISR or from anywhere else.

Something like this would do to start with:

Code: Select all

;Switch tasks
;
;Input
; eax = address to store current task's ESP
; ebx = address to get new task's ESP

do_a_task_switch:
   pushfd
   cli

   cmp eax,ebx
   je .done

   pushad
   mov [eax],esp
   mov esp,[ebx]
   popad

.done:
   popfd
   ret

Of course you'd want to change it so you only need to tell it the new task's "Task ID" or something (and keep track of the current task's "Task ID" internally). You'd probably also want to update ESP0 in the TSS (so it'll work correctly when you've got CPL=3 code working), and load/store CR3 (so it'll change address spaces).

On top of this code you'd want another function to select which task to switch to:

Code: Select all

;Find a new task to run

find_next_task:
   ... make this up...
   call do_a_task_switch
   ret

And then your timer ISR could be:

Code: Select all

;Timer ISR
;
;Note: Interrupts disabled

timer_ISR:
   call send_EOI_to_the_PIC
   call find_next_task
   iretd

Then you'd have other scheduler functions like this:

Code: Select all

;Terminate the current task

terminate_task:
   ... make this up...
   call find_next_task
   ret


;Stop this task (make it blocked)

stop_task:
   ... make this up...
   call find_next_task
   ret


;Start a blocked task
;
;Input
; eax = Task ID for task to restart

start_task:
   ... make this up...
   if (started task is more important than the currently running task) {
      ;Switch to the more important/started task
      call do_a_task_switch
   }
   ret

;Create a new task

spawn_new_task:
   ... make this up...
   call start_task
   ret

Of course I simplified it all a bit, but you should see that the ISR itself doesn't do much... Despite this, all you'd need to do is add the missing parts (code to complete the task switch itself, code that deals with however you decide to keep track of tasks, and code that determines how your scheduler behaves).

Also, the task switch code doesn't save and restore the state of the FPU/MMX/SSE/SSE2 registers (which can be complicated but can also be ignored for now), and assumes that you don't allow the segment registers to be changed (if you do you'd want to save and restore them too).

Cheers,

Brendan

Posted: **Wed Aug 31, 2005 1:59 am**

Brendan wrote: First, I'd recommend not relying on the ISR for any part of the task switch itself. The reason for this is that there are kernel functions which cause a task switch without any interrupt (for example, "sleep()", spawning a new task, sending or receiving IPC, etc). Instead I'd use a single "do task switch" function that can be called from the ISR or from anywhere else.

I hadn't considered doing things that way before. Something about it gives me pause... I think maybe you can get away with it because you're writing your kernel in assembler.

In this example:

Code: Select all

;Timer ISR
;
;Note: Interrupts disabled

timer_ISR:
  call send_EOI_to_the_PIC
  call find_next_task
  iretd

What happens if send_EOI_to_the_PIC, or perhaps the first part of find_next_task before the stack switch, accidentally messes with one of the interrupted thread's registers? If you're implementing your scheduler in C, you'd have to save at least some registers in your ISR stub before calling into C code. If you're doing it all in assembler, you'd better have a really good attention span...

The other thing that weirds me out about this is that every time you return from do_a_task_switch(), you no longer know where in the kernel you are.

Let's say thread A makes a sleep() system call, during which do_a_task_switch() is called. Let's say the kernel switches to thread B, which happily runs until a timer interrupt occurs. The timer ISR calls the scheduler, which notices that A's timeout is up, and that it's priority is higher than B's, so it calls do_a_task_switch() to switch back to A. The problem is, A's kernel call stack is set up to complete a sleep() system call, but in fact the kernel is in the midst of processing a timer interrupt. I assume this is why you put send_EOI_to_the_PIC before find_next_task... otherwise all hell would break loose.

IMO this is too subtle and dangerous, all just to save yourself from pushing a few registers now and then. It's not something I would attempt on my first foray into OS dev.

Instead, I save the registers in all ISRs before calling into C code. Any kernel code that wishes to effect a task switch returns a pointer to the new context (sitting at the bottom of some other kernel stack) to the ISR stub, which does the stack switch, pops all the registers, and irets. This way, regardless of what's happening with the tasks, it's always clear where control flow in the kernel itself is going. This is similar to the way that NT works, and also Minix (although Minix has just one kernel stack, while my OS will have multiple kernel stacks -- one per thread).

Granted, this scheme isn't the fastest in the world. My plan is to extend this a little bit as optimization requires -- mainly to replace the system call ISR with sysenter/sysexit stubs. Otherwise, the basic control flow model is sound IMO, and far less confusing than switching stacks at any arbitrary point in the kernel. My $0.02.

<edit>
Heh... someone else ran into exactly the bug I warned about above: http://www.mega-tokyo.com/forum/index.p ... eadid=8290
</edit>

Posted: **Wed Aug 31, 2005 5:14 am**

Hi,

Colonel Kernel wrote:What happens if send_EOI_to_the_PIC, or perhaps the first part of find_next_task before the stack switch, accidentally messes with one of the interrupted thread's registers?

Then it'd be a bug!

Colonel Kernel wrote:If you're implementing your scheduler in C, you'd have to save at least some registers in your ISR stub before calling into C code. If you're doing it all in assembler, you'd better have a really good attention span...

How about something like this:

Code: Select all

;Timer ISR
;
;Note: Interrupts disabled

timer_ISR:
  push eax
  mov al,0x60
  out 0x20,al
  pop eax
  call find_next_task
  iretd


;Find a new task to run

find_next_task:
  pushad
  ... make this up...
  call do_a_task_switch
  popad
  ret


;Switch tasks
;
;Input
; eax = address to store current task's ESP
; ebx = address to get new task's ESP
;
;Note: Caller MUST save general registers before calling, and restore them
;      after the call, for e.g.:
;         pushad
;         call do_a_task_switch
;         popad

do_a_task_switch:
  pushfd
  cli

  cmp eax,ebx
  je .done

  mov [eax],esp
  mov esp,[ebx]

.done:
  popfd
  ret


;Start a blocked task
;
;Input
; eax = Task ID for task to restart

start_task:
  pushad
  ... make this up...
  if (started task is more important than the currently running task) {
      ;Switch to the more important/started task
      call do_a_task_switch
  }
  popad
  ret

This is very similar to the previous code (with all the same parts missing), but it is one step closer to the code I'm actually using (the timer IRQ handler is identical).

Take a close look at the difference between this version of "do_a_task_switch" and the old version - where'd the "pushad" and "popad" go? I shift them into the caller to improve performance. Can you see how this can be much more confusing for someone trying to figure out how it works?

For similar code written in C you'd end up with and ISR stub code that probably pushes all general registers, then you'd have more general registers saved and restored as part of the "do_a_task_switch" or "start_task" and then all general registers saved and restored in the actual task switch code - almost 3 times as much stack usage wouldn't surprised me. Of course there could also be some messing about with stack frames too (but what do you expect for a language that can't even handle a simple ISR ;D)...

I guess if your writing a kernel in assembler and you want good performance, then having a really good attention span won't hurt

.

[continued]

Posted: **Wed Aug 31, 2005 5:25 am**

[continued]

Colonel Kernel wrote:The other thing that weirds me out about this is that every time you return from do_a_task_switch(), you no longer know where in the kernel you are.

Why do you need to know where in the kernel you are after you return from "do_a_task_switch()"?

In all cases you are where EIP says you are, and "do_a_task_switch()" always returns to the code that called it (except for when a new task is created, where it'd return to where-ever the "create_new_thread" code tells it to return).

Colonel Kernel wrote:Let's say thread A makes a sleep() system call, during which do_a_task_switch() is called. Let's say the kernel switches to thread B, which happily runs until a timer interrupt occurs. The timer ISR calls the scheduler, which notices that A's timeout is up, and that it's priority is higher than B's, so it calls do_a_task_switch() to switch back to A. The problem is, A's kernel call stack is set up to complete a sleep() system call, but in fact the kernel is in the midst of processing a timer interrupt. I assume this is why you put send_EOI_to_the_PIC before find_next_task... otherwise all hell would break loose.

Ahh. I have a re-entrancy lock around the "wake_sleeping_tasks" code which postpones any task switches until the lock is released. That way 100 tasks can wake up at the same time and the scheduler would only switch to the highest priority task after they've all been woken (rather than doing 100 task switches as they are woken up one at a time).

Colonel Kernel wrote:Instead, I save the registers in all ISRs before calling into C code. Any kernel code that wishes to effect a task switch returns a pointer to the new context (sitting at the bottom of some other kernel stack) to the ISR stub, which does the stack switch, pops all the registers, and irets. This way, regardless of what's happening with the tasks, it's always clear where control flow in the kernel itself is going. This is similar to the way that NT works, and also Minix (although Minix has just one kernel stack, while my OS will have multiple kernel stacks -- one per thread).

What happens if one ISR is interrupted by a higher priority IRQ/ISR?

Let me guess - you disable interrupts within the first ISR so that it can't be interrupted.

Now imagine "IRQ A" is a high priority IRQ that causes a task switch to high priority "Task A", and "IRQ B" is a medium priority IRQ that causes a task switch to medium priority "Task B". A low priority "Task C" is running, but gets interrupted by "IRQ B" and while "ISR B" is running IRQ A occurs.

For my system, "ISR B" acquires a lock and then gets interrupted by "ISR A". "ISR A" tries to switch to "Task A" but can't because of the lock. Instead it returns to "ISA B". "ISR B" tries to switch to "Task B" but can't because of the lock. Then "ISR B" frees the lock causing the scheduler to do a task switch to the highest priority task, which is "Task A". In this case, the high priority IRQ A got handled immediately, the medium priority IRQ B got handled very quickly and the scheduler does one task switch directly to "Task A".

For your system (if I understand it correctly), "ISR B" disables interrupts, does it's thing, and causes a task switch to "Task B" during it's return (where interrupts would become enabled again). As soon as interrupts are enabled the CPU starts "ISR A", which does it's thing, and causes a task switch to "Task A" during it's return. In this case the priority of the IRQ's was ignored, there was 2 task switches instead of one and the high priority IRQ A suffered a large amount of interrupt latency because of the unnecessary task switch.

Is my guess close?

Cheers,

Brendan

Posted: **Wed Aug 31, 2005 9:53 am**

Brendan wrote: Why do you need to know where in the kernel you are after you return from "do_a_task_switch()"?

In all cases you are where EIP says you are, and "do_a_task_switch()" always returns to the code that called it (except for when a new task is created, where it'd return to where-ever the "create_new_thread" code tells it to return).

That's the problem. I think you misunderstood the purpose of my thread A/thread B example. I was not trying to make a point about performance, but rather about the possibility of subtle bugs.

Thinking on a high level for a second -- every time you enter the kernel it's for a specific purpose, and there are some mechanics going on related to that purpose that must be carried out consistently. Sending an EOI to the PIC in an ISR is a perfect example. I'm sure there are others (perhaps not many in a microkernel, but who knows what weird and wonderful things the monolithic guys want to do

). Did you read that other thread I linked to?

My point is, if you look closely at where you're calling do_a_task_switch(), it's always getting called after these important things take place. This is a rather important constraint on the use of this function that I think you take for granted. For a newbie, it's very subtle. With my approach, the actual context switch always takes place at a "safe" time (the very last thing before iret).

Ahh. I have a re-entrancy lock around the "wake_sleeping_tasks" code which postpones any task switches until the lock is released. That way 100 tasks can wake up at the same time and the scheduler would only switch to the highest priority task after they've all been woken (rather than doing 100 task switches as they are woken up one at a time).

I don't see how this relates to the example I gave...

What happens if one ISR is interrupted by a higher priority IRQ/ISR?

Let me guess - you disable interrupts within the first ISR so that it can't be interrupted.

For now, yes. Not because it performs well, but because it's simple and I actually need to get something working before the end of the decade.

Besides, you were doing it the same way last year, and if it was good enough for you, I decided it was good enough for me. It's not my fault if you keep changing your mind.

Brendan wrote:If interrupt nesting is done correctly it would improve the interrupt latency of the higher priority IRQs at the expense of lower priority IRQs. This is a good scheme until you think about shared IRQs on the PCI bus, where your high speed network card might be sharing an IRQ with the sound card's MIDI.

For my OS the code to send a message will disable interrupts to reduce the time between acquiring the spinlock and releasing it, and as sending messages is the only thing the ISR does interrupt nesting would be a waste of time. For OSs that handle the IRQ within the ISR (e.g. monolithic, or micro-kernels that run device drivers at cpl=0 in kernel space) nested interrupts are much more important.

For your system (if I understand it correctly), "ISR B" disables interrupts, does it's thing, and causes a task switch to "Task B" during it's return (where interrupts would become enabled again). As soon as interrupts are enabled the CPU starts "ISR A", which does it's thing, and causes a task switch to "Task A" during it's return. In this case the priority of the IRQ's was ignored, there was 2 task switches instead of one and the high priority IRQ A suffered a large amount of interrupt latency because of the unnecessary task switch.

Is my guess close?

I don't have a scheduler implemented yet (been stuck on MM for a long time), so... no, your guess is off.

Or maybe not. I don't know, I can make up whatever answer I like. This is why design is the cheapest time to change things! ;D

My plan was to disable interrupts for most of the time, then possibly re-enable them before calling the scheduler to flush any IRQs out. Whether they're re-enabled depends on what's going on. I don't allow nesting of ISRs, but I do allow a single ISR to interrupt other kernel activities. I guess you could call it two-level nesting...? Anyway, interrupts would be enabled during a system call or exception handler before calling the scheduler, but not within an interrupt handler itself.

Posted: **Wed Aug 31, 2005 9:56 am**

BTW, IMO in your example it doesn't matter whether ISR A or ISR B gets to run first. The device isn't getting serviced until your driver thread runs anyway. So, interrupt nesting isn't helping you here except to avoid the gratuitous context switches, which I think my scheme already does (will do?) reasonably well.

Posted: **Wed Aug 31, 2005 11:27 pm**

Hi,

Colonel Kernel wrote:
In all cases you are where EIP says you are, and "do_a_task_switch()" always returns to the code that called it (except for when a new task is created, where it'd return to where-ever the "create_new_thread" code tells it to return).
That's the problem. I think you misunderstood the purpose of my thread A/thread B example. I was not trying to make a point about performance, but rather about the possibility of subtle bugs.

I understood your thread A/thread B example, and was trying to point out that while not returning from "do_a_task_switch()" into a known state may be a little more complex, it's unecessary and there are (or can be) reasons not to.

Colonel Kernel wrote:Thinking on a high level for a second -- every time you enter the kernel it's for a specific purpose, and there are some mechanics going on related to that purpose that must be carried out consistently. Sending an EOI to the PIC in an ISR is a perfect example. I'm sure there are others (perhaps not many in a microkernel, but who knows what weird and wonderful things the monolithic guys want to do ). Did you read that other thread I linked to?

How about the page fault handler? The page fault itself causes you to enter the kernel, but if the page fault handler needs to load a page of data from disk (e.g. from swap space) then you have to do a thread switch before you return from the page fault handler. If the page fault is caused by trying to access an area that is used for a memory mapped file, then the page fault handler may need to switch to the virtual file system driver, which would need to switch to the correct file system driver, which would need to switch to the hard disk driver, which may block waiting for a "data transferred" IRQ and allow other unrelated tasks to run until this IRQ arrives.

I did glance at the thread you linked to - putting the EOI after a possible task switch is a mistake I first made around 10 years ago and haven't made since. I'm certainly not saying that my method is simpler (only that there may be reasons for any extra complexity).

Colonel Kernel wrote:My point is, if you look closely at where you're calling do_a_task_switch(), it's always getting called after these important things take place. This is a rather important constraint on the use of this function that I think you take for granted. For a newbie, it's very subtle. With my approach, the actual context switch always takes place at a "safe" time (the very last thing before iret).

But there are exceptions to this. One example would be a "get a message" system call, where the task blocks if no messages are currently available for the task. In this case "do_a_task_switch" is often called after checking if any messages are available, but before the "important thing" (or getting the message) takes place, and certainly not as the last thing before iret.

Colonel Kernel wrote:Besides, you were doing it the same way last year, and if it was good enough for you, I decided it was good enough for me. It's not my fault if you keep changing your mind.

That's a fair comment, but possibly I changed my mind because my original plan wasn't good enough for me!

At the time I had (unrelated) design problems with level triggered interrupts, and (more recently) interrupt latency problems that caused IRQ8 to be missed on some computers which led to the OS locking up/waiting forever (although this could be attributed to setting the RTC periodic timer's frequency too fast).

In any case, I'm almost ready to start a large scale review of everything I've implemented so far (to be done while almost all of the source code is progressively made public). During this anything could change, and some things definately will (I'll be having a kernel made of modules rather than a single "micro-kernel binary" and organising boot/initialization code differently). It's "spring cleaning" time!

I've been wondering about this lately - the new "build utility" will mean that every time I build the OS, all of the source code on the web site will be automatically updated. People will be able to watch everything I do, as I do it. It might be an interesting thing to watch...

Cheers,

Brendan

Posted: **Thu Sep 01, 2005 12:15 am**

Brendan wrote:while not returning from "do_a_task_switch()" into a known state may be a little more complex, it's unecessary and there are (or can be) reasons not to.

Unnecessary to do what? Return to an unknown state?

I'd better define what I mean by unknown state. I agree that, at run time, in the absence of weird subtle bugs like the EOI problem, things will work ok with your scheme. When you switch from thread A to thread B, thread B returns from do_a_task_switch() seemingly in the same state as when it entered it. However, from the kernel developer's point of view, it is now impossible to statically analyze the control flow inside the kernel just by looking at the code. Every time you see a call to do_a_task_switch(), you as the developer reading the code, have no idea what will happen next (in terms of time, not in terms of what will happen on that thread at some point in the future when it gets switched to again).

How about the page fault handler? The page fault itself causes you to enter the kernel, but if the page fault handler needs to load a page of data from disk (e.g. from swap space) then you have to do a thread switch before you return from the page fault handler.

IMO that's an odd way to implement any handler. I think of each handler/system call/interrupt as an event that potentially changes the state of a thread. At a high level, the control flow of an OS is a bunch of related concurrent event-driven state-machines. I'd rather model this in a more controlled manner, rather than relying on the state of some thread's kernel stack to "remember" what was happening when that thread was blocked for whatever reason.

In the page fault example in particular, I'd handle "hard" page faults (those requiring disk access) by putting the thread in a "page-fault-blocked" state and putting it on an appropriate queue. Next I'd set up a message to send to the appropriate file system or disk driver. Then I'd run the scheduler, which would pick the next task to run. The page fault handler would just iret to a new thread, like all the other handlers. Eventually, the in-page I/O request will complete, and the "page-fault-blocked" thread will be awoken. I see no reason to freeze the thread's state within the kernel itself at the moment you decide to block it.

But there are exceptions to this. One example would be a "get a message" system call, where the task blocks if no messages are currently available for the task. In this case "do_a_task_switch" is often called after checking if any messages are available, but before the "important thing" (or getting the message) takes place, and certainly not as the last thing before iret.

That's an example of a case where your scheme magically works because copying a message to the receiving thread's buffer is not really a critical operation the way sending an EOI is (i.e. -- it's something that could happen before iret, or could not, either way it doesn't bring the system down).

In my scheme, the thread makes a "get a message" system call. Immediately its most essential context (i.e. -- excluding FPU state but excluding system call parameter registers) is saved. This means that when its context is eventually restored, it will wake up seemingly at the moment that it made the system call, rather than somewhere in the middle of the kernel. This means that whoever delivers the message to that thread, whenever it happens, can copy the message to the thread's buffer and then switch to that thread on the way out of the kernel. If you were to read the code for the "get message" system call, you could tell statically where the control would go right up until the iret.

What is the advantage of your scheme...? Conversely, what is the disadvantage of my scheme? For that matter, which is more conventional?

At the time I had (unrelated) design problems with level triggered interrupts, and (more recently) interrupt latency problems that caused IRQ8 to be missed on some computers which led to the OS locking up/waiting forever (although this could be attributed to setting the RTC periodic timer's frequency too fast).

Did allowing nested interrupts fix this? If so, how? I briefly forgot that the kernel itself ought to deal with the timer in its own ISRs.

OSDev.org

Using Paging

Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging

Re:Using Paging