OSDev.org

Posted: **Sat Jun 09, 2018 5:52 pm**

I haven't set up paging for usermode yet, only have a 1:1 map used for everything.

Just wondering when doing a software context switch to a user task, if I set CR3 just before the IRET to user mode, does IRET interpret the return EIP as an absolute address or does it interpret it using the newly set page directory?

Posted: **Sun Jun 10, 2018 2:49 am**

If paging is enabled, all memory addresses are translated using whatever set of page tables are pointed to by CR3. So yes, setting CR3 before an iret will use the new page tables

Posted: **Sun Jun 10, 2018 4:44 am**

IRET assumes that the CPU mode is the same as was when taking the interrupt. If not, funny stuff may happen. For example, the segment value will be a segment address when the interrupt is taken in real mode, but will be used as a segment selector when the context is restored at the end of the interrupt in protected mode. The pseudo-code for the instruction's operation can be found in the CPU manuals or here.

PS: You actually asked about setting CR3, not the PM bit in CR0. Still, IRET will not have any stored information in the return context, which enables it to operate independently of the current setting. On the other hand, if the return EIP is identity mapped by the current page tables, it shouldn't cause problems.

Posted: **Mon Jun 11, 2018 6:41 pm**

Thanks guys. I'm using only 32 bit code.

Is there a less complex way I'm missing to change page directories per task? All I can think of is:

1. Mapping an state info and IRET stack for each task
2. Copying that state info from kernel
3. Changing CR3, updating ESP
4. Then doing the IRET to invoke a context switch

Posted: **Mon Jun 11, 2018 11:25 pm**

Hi,

rwosdev wrote:Thanks guys. I'm using only 32 bit code.

Is there a less complex way I'm missing to change page directories per task? All I can think of is:

1. Mapping an state info and IRET stack for each task
2. Copying that state info from kernel
3. Changing CR3, updating ESP
4. Then doing the IRET to invoke a context switch

Let's start from the start.. For multi-tasking there's two quite different ways, depending on how you do kernel stacks.

The most common way is to have a different kernel stack for each task. In this case task switching has nothing to do with privilege level changes and nothing to do with IRET. For example, user-space code might be interrupted by an IRQ causing the CPU to automatically switch from that task's user-mode stack to that task's kernel mode stack and start the IRQ handler, and the kernel's IRQ handler would do some stuff (which might or might not involve task switches) and then kernel would IRET back to the task that was interrupted. Note that in this case task switches only ever happen between tasks that are running kernel code, and because kernel is mapped into all virtual address spaces kernel can change CR3 whenever it feels like, and because each task has its own kernel stack the kernel can store most of the task's (kernel) state on the task's kernel stack. This means that (for simple cases that don't involve FPU/MMX/SSE/AVX state, multi-CPU locking, etc) the low-level task switch code can look like this:

Code: Select all

;Go to task
;
;Input
; esi = address of task control block for task to switch to
; edi = address of task control block for task to switch from

goto_task:
    pusha

    mov eax,[esi+TCB.kernelStackTop]
    mov ebx,[esi+TCB.CR3]
    mov eax,cr3
    mov [tss.esp0],eax  ;Set address of kernel stack top in TSS
    mov esp,eax
    cmp eax,ebx         ;Is virtual address space the same (e.g. different thread in same process)?
    je .l1              ; yes
    mov cr3,ebx         ; no, switch virtual address space
.l1:
    popa
    ret

The second way is to have a single kernel stack (for each CPU). In this case when something (IRQ, software interrupt, call gate, exception, SYSCALL/SYSENTER) causes a switch from CPL=3 to CPL=0 (but not when an IRQ interrupts kernel code) you save the user-space task's state somewhere (e.g. in its task control block); and when the kernel returns from CPL=0 to CPL=3 it has to restore all the user-space task's state. Note that this doesn't allow the kernel to switch from one task to another while the kernel is running, which can cause latency problems if kernel does anything time consuming. To avoid these latency problems you either make sure kernel never does anything that takes very long, or you split lengthy things into smaller pieces (and have some kind of "prioritised kernel joblet" scheme). Both of these things are extremely hard to do for monolithic kernels (where you've got loads of third-party code doing who-knows-what in kernel space); which is why "single kernel stack (for each CPU)" is only done by (a small number of) micro-kernels. Also, because "single kernel stack (for each CPU)" is more restrictive and more complex, it's not suited to beginners - I'd recommend doing a normal "kernel stack per task" kernel before attempting this.

Mostly; I suspect that you are doing "kernel stack per task" and have mistakenly combined privilege level changes with task switches.

Cheers,

Brendan

Posted: **Tue Jun 12, 2018 12:40 am**

Hi Brendan,

Yes I've always assumed that in software multitasking, you have a TSS per core and on each interrupt or SYSCALL you cause the core to load ESP0 from the TSS and that is your per core kernel stack.

So this is the second method, my issue with this is it seems quite complicated to switch CR3 before doing an IRET and my design goal (which I've succeeded in so far even with CD-ROM/ISO FS bootloading, linking/loading, drivers, threads, usermode switching, syscalls etc) is to have all code look stupidly simple.

So if I understand you correctly, and in order to achieve this goal, I should:

1. Have a separate kernel stack set up and mapped in the SAME virtual location per-process (i.e. per-collection of threads, not per-thread)
2. Manually switch to that stack if needed on all appropriate SYSCALLs, interrupts as ESP0 will be the default kernel stack on the core
3. On a task switch, save current thread's registers and restore new thread's registers to new thread's kernel stack
3.1. Switch to another process's kernel stack if needed before IRETing from the PIT handler

And when CR3 changes in the interrupt/syscall, how do I deal with the last instruction (IRET) not being paged in resulting in a crash?

Am I mostly correct? Anything else I should consider?

Posted: **Tue Jun 12, 2018 9:40 pm**

I figured it out, in addition to Brendan's suggestions I also need to map in the switching code to a common address space in kernel and every task.

Following Spectre, Meltdown etc even though my OS will never be seriously targeted, I just don't agree with mapping the whole kernel in, only the part that does the usermode switching.

I've been able to test this successfully somewhat, but sadly my existing code needs a lot of adjustments. I didn't give this part enough consideration.

Thanks guys

Posted: **Tue Jun 12, 2018 10:01 pm**

Hi,

rwosdev wrote:So this is the second method...

rwosdev wrote:1. Have a separate kernel stack set up and mapped in the SAME virtual location per-process (i.e. per-collection of threads, not per-thread)

The second method is "one kernel stack (per CPU)". You don't have multiple kernel stacks mapped at the same virtual location, you just have one kernel stack.

For this case you'd do something like this (without worrying about things like FPU/MMX/SSE/AVX again):

Code: Select all

exit_kernel:
    cli
    mov esi,[current_task_TCB]     ;Address for the task control block for the task we're returning to

;Change virtual address space

    mov eax,[esi+TCB.cr3]
    mov cr3,eax

;Prepare for IRET

    mov eax,[esi+TCB.esp]
    mov ebx,[esi+TCB.eflags]
    mov ecx,[esi+TCB.eip]
    push dword 0x00000023        ;User data segment (for SS)
    push eax
    push ebx
    push dword 0x0000001C        ;User code segment (for CS)
    push ecx

;Load user-space state

    mov eax,[esi+TCB.eax]
    mov ebx,[esi+TCB.ebx]
    mov ecx,[esi+TCB.ecx]
    mov edx,[esi+TCB.edx]
    mov edi,[esi+TCB.edi]
    mov ebp,[esi+TCB.ebp]
    mov esi,[esi+TCB.esi]

;Return to user-space

    iretd

While a CPU is running user-space code its kernel stack is empty; and when something causes a switch to CPL=0 the CPU switches to that CPU's kernel stack and pushes some stuff onto it, then kernel has do the reverse of the above to shift all of the user-space state to the tasks' "task control block" and this removes the information that the CPU pushed (return SS:ESP, return EFLAGS, return EIP) so that the kernel stack is empty again.

For example, for a page fault exception it might be:

Code: Select all

page_fault_exception:
    push esi
    mov edi,[current_task_TCB]

    pop dword [edi+TCB.edi]
    mov [edi+TCB.eax],eax
    mov [edi+TCB.ebx],ebx
    mov [edi+TCB.ecx],ecx
    mov [edi+TCB.edx],edx
    mov [edi+TCB.esi],esi
    mov [edi+TCB.ebp],ebp

    pop dword [edi+TCB.errorCode]
    mov eax,cr2
    mov [edi+TCB.cr2],eax

    pop dword [edi+TCB.eip]
    add esp,4               ;Remove return CS
    pop dword [edi+TCB.eflags]
    pop dword [edi+TCB.esp]
    add esp,4               ;Remove return SS

    ;Add "handle page fault for this task" joblette to the kernel's prioritised queue/s of things to do

    mov eax,JOBLETTE_TYPE_PAGE_FAULT
    mov ebx,edi             ;ebx = 1st piece of joblette data (address of TCB for task that needs its page fault handled)
    call add_new_joblette

   ;** At this point, all user-space state has been saved, kernel stack is now empty, and nothing useful is in any of the registers **

    ;Enter the kernel's "do whatever joblette is most important" loop.

    sti
    jmp kernel_entry

Note that because there's only one kernel stack (per CPU) the "ESP0" field in the TSS never changes - instead it's set once during boot (e.g. during the AP CPU startup sequence).

While the kernel is running it only changes the "current_task_TCB" variable (which effects the task that the kernel would return to if the kernel has nothing more important to do).

For the kernel's prioritised queue/s of things to do ("joblettes"); the kernel might do many joblettes for many different tasks (and for itself), and then (when there's no more joblettes to do, and maybe also when a task that is ready to run has a higher priority than the highest priority remaining joblette) the kernel would leave its "joblette loop" and jump to the "exit_kernel" code above.

rwosdev wrote:Am I mostly correct?

No; either it's very wrong for "kernel stack per task" or its very wrong for "one kernel stack (per CPU)".

I think (and hope) you're using "kernel stack per task" (and not the second method - "one kernel stack (per CPU)"; and in that case you've ignored everything I said last time and still think that IRQs have something to do with task switching when they do not.

I also think you're making a second mistake - thinking that the scheduler's timer (PIT) is the only thing that causes task switches. For all operating systems most task switches are caused by tasks blocking (e.g. because they have to wait for data from disk, from network, from user, from another task, etc) and by tasks unblocking (e.g. because the data that they were were waiting for arrived).

For "kernel stack per task" you'd implement a low level "go to task" routine (like the example code I provided last time); then you'd implement a "find task to switch to and switch to it" routine (which figures out which task should get CPU time next and then calls the "go to task" routine). When a task blocks and you have to find another task to run you'd call the "find task to switch to and switch to it" routine; and when a task unblocks that has higher priority than the currently running task you'd call the "go to task" routine directly (bypassing the "find task to switch to and switch to it" code) so that the higher priority task preempts the currently running task immediately; and when the currently running task has used too much CPU time you'd call the "find task to switch to and switch to it" routine.

For that last part; the timer IRQ would occur, the timer interrupt handler would do stuff (wake up sleeping tasks, update a "ticks since boot" variable, etc); then (optionally, only for some kinds of scheduling) check if the currently running task used too much CPU time and call the "find task to switch to and switch to it" routine if it did; then IRET. The timer interrupt handler would not do any of the actual task switching itself, would not save or load user-space state, and would not touch CR3.

Cheers,

Brendan

Posted: **Tue Jun 12, 2018 10:20 pm**

Hi,

rwosdev wrote:I figured it out, in addition to Brendan's suggestions I also need to map in the switching code to a common address space in kernel and every task.

Following Spectre, Meltdown etc even though my OS will never be seriously targeted, I just don't agree with mapping the whole kernel in, only the part that does the usermode switching.

Just in case...

For meltdown mitigation, you do need to change virtual address spaces at every kernel entry (immediately after a "CPL=0 to CPL=3" switch) and at every kernel exit (immediate before a "CPL=3 to CPL=0" switch). This also has nothing to do with task switching at all.

For this case (for most operating systems); every process has two virtual address spaces - a "user-space only" virtual address space that contains the process' memory plus a tiny little "kernel trampoline"; and then another "user-space plus kernel" virtual address space that contains the process' memory, the "kernel trampoline" and the rest of the kernel. These are the virtual address spaces you switch between immediately after a "CPL=0 to CPL=3" switch and immediately before a "CPL=3 to CPL=0" switch. Task switching (which has nothing to do with any of this) would remain the same and would still only ever switch between a task running kernel code to a task running kernel code, and would therefore only ever change from one process' "user-space plus kernel" virtual address space to a different process' "user-space plus kernel" virtual address space.

Note that this isn't the only way (it's just what most operating systems do, partly because they all retro-fitted meltdown mitigations on top of kernels that weren't designed for it). Alternatively, each process could only have a "user-space only" virtual address space (with the kernel trampoline); and the kernel could have its own "kernel-space only" virtual address space. This might be more efficient when the meltdown mitigation is enabled (especially if the CPU supports PCID, where it'd avoid a large number of TLB misses), but would be less efficient when the meltdown mitigation is disabled. It would also mean that a process could use (almost) a whole virtual address space and the kernel could use a whole virtual address space (e.g. for 32-bit it could allow 4 GiB of space for each process plus 4 GiB of space for the kernel).

Cheers,

Brendan

Posted: **Tue Jun 12, 2018 11:30 pm**

Hi Brendan,

I'm attempting to use a combination of a kernel esp0 stack per core and also a small kernel stack per task depending on what's needed.

I'm not conflating PIT events as the be all and end all of task switching. I wrote code that does indeed allow threads to block for messages from the OS/other threads, in a queue, and user mode task killing on fault, all of which is done in a higher level language and share's some common assembly procedures to execute task state capture and restoration.

What may make my posts a bit clearer is I'm experimenting with abstracting some things, like paging, out of the kernel design and having all those details specifically and seemlessly handled by platform-specific assembly procedures.

Many thanks again

Posted: **Thu Jun 14, 2018 1:43 am**

I'm gonna go against what I want for now, have the kernel mapped in to every process, then I'll add in the kernel trampoline method to get closer to what I thought I could have.

Thanks for all the advice Brendan

Posted: **Thu Jun 14, 2018 5:25 am**

rwosdev wrote:1. Have a separate kernel stack set up and mapped in the SAME virtual location per-process (i.e. per-collection of threads, not per-thread)
2. Manually switch to that stack if needed on all appropriate SYSCALLs, interrupts as ESP0 will be the default kernel stack on the core
3. On a task switch, save current thread's registers and restore new thread's registers to new thread's kernel stack
3.1. Switch to another process's kernel stack if needed before IRETing from the PIT handler

And when CR3 changes in the interrupt/syscall, how do I deal with the last instruction (IRET) not being paged in resulting in a crash?

It is clearer now what you wanted to accomplish originally. I don't know if anyone else is practicing per-process kernel address space randomization. Most randomization approaches are per-boot, which means that the layout is shared, albeit chosen arbitrary after restart.

In protected mode you could use far jumps for process switching and far calls for system calls, pointing their memory operand at task gates in LDT or GDT. Those instructions will simultaneously change CR3, esp, eip. In fact, what occurs is a full context switch from microcode, saving the state of the register file into the TSS entry pointed currently by TR, and restoring the register file from the value stored in the TSS entry indexed in the task gate selected by the operand. Which could be inefficient if you intend to use software task switching. Similarly, you could use task gates in the IDT entries, which performs context switch including CR3, esp, eip. In long mode changing cr3 simultaneously with rip is not possible. The far branching instructions are available, but they do not affect cr3 or the general purpose registers anymore.

Brendan wrote:Note that this isn't the only way (it's just what most operating systems do, partly because they all retro-fitted meltdown mitigations on top of kernels that weren't designed for it). Alternatively, each process could only have a "user-space only" virtual address space (with the kernel trampoline); and the kernel could have its own "kernel-space only" virtual address space. This might be more efficient when the meltdown mitigation is enabled (especially if the CPU supports PCID, where it'd avoid a large number of TLB misses), but would be less efficient when the meltdown mitigation is disabled. It would also mean that a process could use (almost) a whole virtual address space and the kernel could use a whole virtual address space (e.g. for 32-bit it could allow 4 GiB of space for each process plus 4 GiB of space for the kernel).

Am I correct in thinking that an unfortunate NMI while CR3 is not switched will crash the system, unless the interrupt handlers are trampoline code as well? That is, that you have to perform LIDT after changing CR3 in order to switch between trampoline and actual interrupt handlers.

Posted: **Thu Jun 14, 2018 6:07 am**

Hi,

simeonz wrote:
Brendan wrote:Note that this isn't the only way (it's just what most operating systems do, partly because they all retro-fitted meltdown mitigations on top of kernels that weren't designed for it). Alternatively, each process could only have a "user-space only" virtual address space (with the kernel trampoline); and the kernel could have its own "kernel-space only" virtual address space. This might be more efficient when the meltdown mitigation is enabled (especially if the CPU supports PCID, where it'd avoid a large number of TLB misses), but would be less efficient when the meltdown mitigation is disabled. It would also mean that a process could use (almost) a whole virtual address space and the kernel could use a whole virtual address space (e.g. for 32-bit it could allow 4 GiB of space for each process plus 4 GiB of space for the kernel).
Am I correct in thinking that an unfortunate NMI while CR3 is not switched will crash the system, unless the interrupt handlers are trampoline code as well? That is, that you have to perform LIDT after changing CR3 in order to switch between trampoline and actual interrupt handlers.

All things that could cause a privilege level change would be in the trampoline, including the entry and exits of all exception handlers. Typically this means assembly language stubs in the trampoline (that call C functions that may not be in the trampoline) for all interrupts and for any (optional) things like call gates, SYSENTER and SYSCALL. The IDT, GDT and TSS would also be in the trampoline, and (except during kernel initialisation) you'd wouldn't need to use LIDT or LGDT.

The assembly language stub for the NMI handler would have to be carefully designed and checked to make sure that it can handle all of the annoying corner-cases; including "NMI interrupted kernel code before kernel's CR3 is loaded" and "NMI interrupted kernel code after process' CR3 is loaded" and other potential problems (e.g. "NMI interrupted kernel code while kernel is still using user-space stack after SYSCALL", and "NMI when kernel code is holding any kernel lock", etc).

Note that I'm still not sure what the best way to handle NMI really is (there are no good or easy ways to handle it). For one relatively bizarre thought; I've considered an NMI handler that does nothing more than use the local APIC's "send IPI to self" to send a normal (maskable) "fake NMI" interrupt to the CPU. That way the real NMI handler wouldn't have to care about most of the annoying corner cases because it does almost nothing (e.g. a single WRMSR followed by IRET for the x2APIC case); and the "fake NMI" handler wouldn't have to care about any of the annoying corner-cases (because kernel can use CLI and STI to control when the "fake NMI" is delivered).

Cheers,

Brendan

Posted: **Thu Jun 14, 2018 11:44 am**

Brendan wrote:All things that could cause a privilege level change would be in the trampoline, including the entry and exits of all exception handlers. Typically this means assembly language stubs in the trampoline (that call C functions that may not be in the trampoline) for all interrupts and for any (optional) things like call gates, SYSENTER and SYSCALL. The IDT, GDT and TSS would also be in the trampoline, and (except during kernel initialisation) you'd wouldn't need to use LIDT or LGDT.

I see. Overall, this approach would be better I think, assuming exceptions and NMIs in the kernel are sufficiently rare events. That is, the overhead from retaining the trampolines at all times would be smaller than the cost of performing LIDT on kernel entry and exit. On the other hand, using LIDT after SYSENTER could be faster if the kernel spends a lot of time in processing system calls, such as, in a monolithic kernel that internally uses demand paging. For example, 32-bit Windows had more physical memory for file caching than it could map into kernel space at any one time, and it could end up in a situation where the file system might generate a lot of soft-faults through the system cache (i.e. page faults serviced from memory). I suppose, that is outdated issue now.

Brendan wrote:The assembly language stub for the NMI handler would have to be carefully designed and checked to make sure that it can handle all of the annoying corner-cases; including "NMI interrupted kernel code before kernel's CR3 is loaded" and "NMI interrupted kernel code after process' CR3 is loaded" and other potential problems (e.g. "NMI interrupted kernel code while kernel is still using user-space stack after SYSCALL", and "NMI when kernel code is holding any kernel lock", etc).

Note that I'm still not sure what the best way to handle NMI really is (there are no good or easy ways to handle it). For one relatively bizarre thought; I've considered an NMI handler that does nothing more than use the local APIC's "send IPI to self" to send a normal (maskable) "fake NMI" interrupt to the CPU. That way the real NMI handler wouldn't have to care about most of the annoying corner cases because it does almost nothing (e.g. a single WRMSR followed by IRET for the x2APIC case); and the "fake NMI" handler wouldn't have to care about any of the annoying corner-cases (because kernel can use CLI and STI to control when the "fake NMI" is delivered).

I am not sure even how NMIs are supposed to be handled to begin with, assuming it is not a watchdog expiration. NMIs have two legacy ports to describe the overall nature of the error. MCEs can at least describe the location and nature of the error. Are there parity checked memory modules sold anymore? Or if the NMI is due to bus transaction, what can the OS do? This is topic with which I am not sufficiently familiar to discuss it, but the point is, small latency in the NMI processing will not be fatal, assuming that the OS has no way to recover anyway, and the code cannot corrupt any persistent data by continuing execution unrectified.

Posted: **Thu Jun 14, 2018 11:18 pm**

Hi,

simeonz wrote:
Brendan wrote:Note that I'm still not sure what the best way to handle NMI really is (there are no good or easy ways to handle it). For one relatively bizarre thought; I've considered an NMI handler that does nothing more than use the local APIC's "send IPI to self" to send a normal (maskable) "fake NMI" interrupt to the CPU. That way the real NMI handler wouldn't have to care about most of the annoying corner cases because it does almost nothing (e.g. a single WRMSR followed by IRET for the x2APIC case); and the "fake NMI" handler wouldn't have to care about any of the annoying corner-cases (because kernel can use CLI and STI to control when the "fake NMI" is delivered).
I am not sure even how NMIs are supposed to be handled to begin with, assuming it is not a watchdog expiration. NMIs have two legacy ports to describe the overall nature of the error. MCEs can at least describe the location and nature of the error. Are there parity checked memory modules sold anymore? Or if the NMI is due to bus transaction, what can the OS do? This is topic with which I am not sufficiently familiar to discuss it, but the point is, small latency in the NMI processing will not be fatal, assuming that the OS has no way to recover anyway, and the code cannot corrupt any persistent data by continuing execution unrectified.

In general; I'm not sure anyone can really know how NMIs are supposed to be handled (unless the OS developer has intentionally used it for something, which is always a bad idea).

My approach is for the micro-kernel to ask a motherboard driver what to do during boot ("always ignore", "always kernel panic", "ask motherboard driver what to do with each NMI"); where if the motherboard driver selects the last option and an NMI happens the motherboard driver can do anything it likes (including nothing) before telling the kernel to ignore the NMI, or the motherboard driver can just tell the kernel to panic for that NMI (and can provide a more specific reason).

Cheers,

Brendan

OSDev.org

Context Switch and Paging - Does IRET Care About Paging?

Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?

Re: Context Switch and Paging - Does IRET Care About Paging?