Task switching both ring-0 and ring-3 tasks

naegelejd · Post by **naegelejd** » Thu Nov 13, 2014 12:29 pm

My kernel is 32-bit x86. I have a working multitasking system derived from old GeekOS code that works in Ring-0 only. I am adapting it to work in Ring-3, user-mode. Every page of memory in use by my kernel is mapped to 0xC0000000 and above, with the User privilege bit set so I can worry about switching page directories later.

What important pieces of information do I need to know to make the task switching code work for "threads" running in both ring-0 and ring-3?

The switching occurs in the PIT interrupt handler in Ring-0. I've already added an additional (kernel) stack to each task that is patched into the TSS on each task switch. I believe the CPU also pops SS and ESP off the stack but only on a privilege level change (Ring-3 to Ring-0). If a ring-0 task is running and is preempted, it won't pop these two values off the stack so they shouldn't be pushed. I'm assuming the initial task stack and the thread switching code needs to do slightly different things depending on whether the task is running in Ring-0 or Ring-3. Does this make sense?

Brendan · Post by **Brendan** » Fri Nov 14, 2014 1:10 am

naegelejd wrote:Does this make sense?

Yes; and no.

The PIT interrupt handler is only one of the many reasons for a task switch; and (on real work-loads) often it's the least important. Other reasons include the currently running task blocking for any reason (e.g. waiting for IO) or terminating; or a higher priority task unblocking (e.g. the IO it was waiting for occurred) or being spawning and pre-empting a currently running (lower priority) task.

In general; something (kernel API call, IRQ, exception) causes a switch from CPL=3 to CPL=0, then kernel does its thing, then returns from CPL=0 back to CPL=3; but this has nothing to do with task switching. While the kernel is doing its thing (and running at CPL=0), the kernel may or may not decide to do a task switch; but this has nothing to do with switching between CPL=3 and CPL=0. Task switches are always from "task running kernel code at CPL=0" to "different task running kernel code at CPL=0".

Now let's think about task switching. There are only 2 cases:

Kernel knows exactly which task it should switch to (e.g. a higher priority task just unblocked and is pre-empting a lower priority task). For this case you need some sort of "switch_to_task(task_ID)" function.
Kernel does not know which task it should switch to (e.g. the currently running task just blocked, or used its time slice). In this case your scheduler would have some sort of "decide_which_task_to_switch_to(void)" function that would decide which task to switch to and would then call the "switch_to_task(task_ID)" function.

It is best to implement the "switch_to_task(task_ID)" function first, and test it. To test it you could have 2 kernel threads that just switch to each other (e.g. task 1 does "switch_to_task(2);" and task 2 does "switch_to_task(1);"). After that works properly; then implement the "decide_which_task_to_switch_to(void)" function, and test it. To test it you'd could just have 2 kernel threads that both call "decide_which_task_to_switch_to();".

After you know that both of these things work correctly; then implement the timer IRQ handler. All it does is call the "decide_which_task_to_switch_to(void)" function.

Cheers,

Brendan

naegelejd · Post by **naegelejd** » Fri Nov 14, 2014 12:31 pm

I need to elaborate a little more on my existing scheduling code. I incorrectly stated that the task switching occurs in the PIT interrupt handler. The PIT interrupt handler invokes my scheduler if the current thread has outlived its timeslice. I have a "runnable" thread queue, "sleep" queue, and wait queues for threads/resources. Threads can invoke the scheduler using yield, etc. The scheduler disables interrupts, switches tasks, then enables interrupts again. So I essentially have a great multi-threading system.

I see what you mean by the Ring-0 to Ring-3 switch and vice versa are actually unrelated to task switching. I guess my issue is understanding how to set up the stack of a new thread and if anything different needs to happen when jumping back to Ring-3.

A new thread is created with a C function accepting a single argument. I then set up the thread's stack to look like this:

arg(s) to main thread function
address of shutdown_thread (thread cleanup code)
address of main thread function
eflags (0)
CS (ring-0: 0x08)
address of launch_thread (initialization function that just enables interrupts)
fake error code
fake interrupt #
eax, ecx, edx, ebx, ebp, esi, edi (all 0s)
DS (ring-0: 0x10)
ES, FS, GS (all 0s)

then the thread is just added to the "runnable" queue.

Then in my switch_to_thread function, which accepts a pointer to the new thread, the following happens:

change the stack from this:

pointer to new thread
return address

to:

pointer to new thread
eflags
CS (ring-0: 0x08)
return address
fake error code
fake interrupt #
registers (eax, ecx, edx, ebx, ebp, esi, edi, ds, es, fs, gs)

then:

save thread's stack pointer inside current thread struct
clear current thread's "num ticks" field
update the "global current thread" using the "new thread" argument (esp+64)
change esp to saved esp in new thread struct
update esp in the tss to new thread's kernel stack
pop all registers from stack
pop fake error code and fake interrupt #
iret

So, for threads intended to run in Ring-3, don't I need to set up the stack differently, e.g. also push SS/ESP and use ring-3 CS/DS (0x1b, 0x23)?

Brendan · Post by **Brendan** » Sat Nov 15, 2014 4:55 am

Hi,

naegelejd wrote:A new thread is created with a C function accepting a single argument. I then set up the thread's stack to look like this:

arg(s) to main thread function
address of shutdown_thread (thread cleanup code)
address of main thread function
eflags (0)
CS (ring-0: 0x08)
address of launch_thread (initialization function that just enables interrupts)
fake error code
fake interrupt #
eax, ecx, edx, ebx, ebp, esi, edi (all 0s)
DS (ring-0: 0x10)
ES, FS, GS (all 0s)

then the thread is just added to the "runnable" queue.

When switching from a thread running at CPL=0 to another thread running at CPL=0; things like the segment registers are always the same for kernel code (e.g. CS is always 0x0008, DS is always 0x0010, etc). They do not need to be saved or restored during a task switch. The stack might look like this:

eax, ecx, edx, ebx, ebp, esi, edi
return EIP for whoever called the kernel's "goto_thread(thread)" function

The "goto_thread(thread_data *thread)" function might look like this:

Code: Select all

;Input
; eax = address of thread's data
;
;WARNING: Interrupts must be disabled (and re-enabled) by caller.

goto_thread:
    cmp [currentThread],eax             ;Are we already running the thread?
    jne .doSwitch                       ; no
    ret                                 ; yes, do nothing
    
.doSwitch:
    pushad                              ;Save previous thread's general registers
    mov ebx,[currentThread]             ;ebx = address of previous thread's data
    mov [ebx+thread_data.saved_esp],esp ;Save previous thread's ESP

    mov ebx,[eax+thread_data.saved_cr3] ;ebx = new thread's CR3
    mov ecx,cr3                         ;ecx = current CR3
    cmp ebx,ecx                         ;Is CR3 different?
    je .l1                              ; no, don't reload it (avoid TLB flush)
    mov cr3,ebx                         ; yes, load new thread's CR3
.l1:
    mov [currentThread],eax             ;Update the address of currently running thread's data
    mov esp,[eax+thread_data.saved_esp] ;Load new thread's ESP
    popad                               ;Load new thread's general registers
    ret                                 ;Everything done!

Note: This is simplified a little to make it easier to understand. For a real OS I'd be very tempted to do things like (e.g.) update the amount of time the previous task has used, and I'd also want to worry about saving and loading FPU/MMX/SSE/AVX state (either in the task switch, or in the "delayed FPU/MMX/SSE/AVX state saving" way involving the TS flag and "device not available" exception). Also the single "currentThread" global variable wouldn't work for multi-CPU (and you might want to use GS for "CPU local data" and something like "[gs:currentThreadForThisCPU]").

Now; when creating a new thread, the new thread's kernel stack must contain whatever the "goto_thread(thread)" function takes off of the stack. This means that when creating the new thread its kernel stack would contain:

eax, ecx, edx, ebx, ebp, esi, edi
return EIP (this would be the address of your "launch_thread" initialisation code)

Your "launch_thread" initialisation code could enable IRQs, allocate/create the thread's "CPL=3 stack", put whatever it likes on that CPL=3 stack and/or in general purpose registers, and "return" from CPL=0 to CPL=3 using either a RETF or IRET or maybe SYSRET or SYSEXIT (it doesn't matter much which).

If you need to pass data from the "spawn_thread()" code to the "launch_thread()" code (e.g. the "address of CPL=3 function" that your "launch_thread" code will need); then the simplest way is for the "spawn_thread()" code to put it on the stack so that the "goto_thread()" code pops it (using the POPAD) into a general purpose register, where the "launch_thread()" code can expect to find it in that register.

Please note that the "launch_thread" initialisation code is not a normal function - it does not return to its caller and should leave the CPL=0 stack empty. To be more clear, I would not attempt to write any part of the "launch_thread" initialisation code or any part of the "goto_thread(thread)" code in C (but this does not mean you can't use C calling conventions for either of them).

Cheers,

Brendan

naegelejd · Post by **naegelejd** » Tue Nov 18, 2014 11:01 am

Hi again,

Thanks for your tips. I push/pop segment registers because my "goto_task" function looks a lot like my IRQ handler because I perform task switches when the PIT handler determines that preemption is necessary. I save and restore segment registers in my IRQ handler so they work in Ring-3 (So DS, etc. are saved as 0x23, changed to 0x10, then restored as 0x23). So a task switch can occur two ways

1. Manually, e.g. yield(). This will save EFLAGS, CS, EPI so it looks like an interrupt occurred, then save a fake error code and interrupt number, then general purpose and segment regs because this is what my IRQ handler does. Then it switches "threads" (thread data, stack, CR3, etc.) just like in your "goto_thread" example. Lastly, it pops segment regs, general purpose regs, fake interrrupt numer and error code then just does an IRET.
2. Automatically, e.g. preempted. When the PIT IRQ occurs the base IRQ handler calls the PIT handler, which checks if preemption should occur. If so, it sets a flag named "g_need_reschedule" then returns to the base IRQ handler so that the PIC end-of-interrupt can be sent. Then, if "g_need_reschedule" is true, the IRQ handler switches threads and performs an IRET.

This means any thread that is either "new" or sitting on the "ready" queue has a stack that looks like an interrupt just occurred with all registers pushed.

I understand your idea for "returning" from CPL=0 to CPL=3 in the "launch_thread". Essentially, just set up the CPL=3 stack and do something like the "Getting to Ring 3" wiki example.

I'm more than happy to simplify my task switching code to look like your "goto_thread" function (which looks like it doesn't treat switching like interrupt handling), but I'm not sure how to do so and still use the PIT handler for preemption. One issue is that new threads don't return to the caller of "goto_thread"... they jump straight to "launch_thread".

Brendan · Post by **Brendan** » Tue Nov 18, 2014 11:19 am

Hi,

naegelejd wrote:I'm more than happy to simplify my task switching code to look like your "goto_thread" function (which looks like it doesn't treat switching like interrupt handling), but I'm not sure how to do so and still use the PIT handler for preemption. One issue is that new threads don't return to the caller of "goto_thread"... they jump straight to "launch_thread".

For new threads, "goto_thread" thinks it's returning to the caller, but it's actually "returning" to the launch_thread code (and when creating the new thread, you'd setup the new thread's kernel stack to make that work).

Cheers,

Brendan

naegelejd · Post by **naegelejd** » Tue Nov 18, 2014 11:50 am

Right, that's what it does now because I set its stack up like this:

arg(s) to main thread function
address of shutdown_thread (thread cleanup code)
address of main thread function
eflags (0)
CS (ring-0: 0x08)
address of launch_thread
fake error code
fake interrupt #
eax, ecx, edx, ebx, ebp, esi, edi (all 0s)
DS, ES, FS, GS (ring-0: 0x10)

so "launch_thread" is the EIP popped by IRET. "launch_thread" then returns to "main_thread_function", etc. This all works fine.

I'm interested in whether you'd use an interrupt handler for preemption. If so, how would "goto_task" be called since a switch inside an interrupt handler requires the target thread's stack to contain at least EIP, CS, and EFLAGS, in addition to registers because the handler ends with an IRET?

Brendan · Post by **Brendan** » Tue Nov 18, 2014 4:59 pm

Hi,

naegelejd wrote:I'm interested in whether you'd use an interrupt handler for preemption.

I'd never directly use an interrupt handler for pre-emption.

For "indirectly"...

Let's start by assuming the kernel API could be anything (e.g. SYSCALL and/or SYSENTER and/or call gate and/or software interrupt) where the kernel supports all of them and applications use whichever they like - e.g. the slowest and smallest option (software interrupt) where code size is more important, and the fastest option where speed matters. For all of them the kernel (internally) would just have a table of function pointers. For the SYSCALL handler it'd have code to "call [table+eax*4]" and do SYSRET; for the call gate handler it'd have code to do "call [table+eax*4]" and do RETF; etc.

Let's also assume that the kernel supports "batch functions"; where if your application needs to use 10 different kernel API functions it can build a list of input/output parameters and call a special "do batch" kernel API function; which would loop through the provided list (doing each kernel API function in the list, one at a time). Note: This is a very good idea, to reduce the overhead the CPL=3 -> CPL=0 -> CPL=3 switching.

Now...

If the current task is being pre-empted because it called a "yield()" kernel API function, then there's no guarantee that any interrupt was involved at all. The kernel API (whatever it is) would do its "call [table+eax*4]" which would probably call that "decide_which_task_to_switch_to(void)" function I mentioned earlier, which would call the "goto_thread()" function. The "goto_thread()" function only sees a normal function call.

If the current task is being pre-empted because it used all of the CPU time it was given; then this isn't really any different to the task voluntarily calling "yield()". The result is the same - e.g. the timer IRQ calls the "decide_which_task_to_switch_to(void)" function, which would call the "goto_thread()" function. The "goto_thread()" function only sees a normal function call.

If the current task is blocked (e.g. it called "sleep()" or "load_huge_file_from_slow_floppy()" or something); then it still ends up the same. Something ends up calling the "decide_which_task_to_switch_to(void)" function, which would call the "goto_thread()" function. The "goto_thread()" function only sees a normal function call.

If the current task is being pre-empted because a higher priority task unblocked; then this is different. Something (e.g. a device driver) calls something (e.g. a kernel "unblock_thread(thread)" function), which calls the "goto_thread()" function directly.
The "goto_thread()" function only sees a normal function call.

Under no circumstances does the "goto_thread()" function ever see anything other than a normal function call (where creating a new thread emulates a normal function return, so that "goto_thread()" doesn't see anything different).

naegelejd wrote:If so, how would "goto_task" be called since a switch inside an interrupt handler requires the target thread's stack to contain at least EIP, CS, and EFLAGS, in addition to registers because the handler ends with an IRET?

For a simple example, imagine an interrupt handler like this:

Code: Select all

interruptHandler(void) {
    foo();
    bar();

    if(something) {
        do_something();
    }
}

When the interrupt handler starts, the return EIP, return CS and return EFLAGS are on the stack. If it doesn't call "do_something()" then this return information is still on the stack when the function exists. If it does call "do_something()" then the return information is still on the stack when the function exists. It simply makes no difference whether "do_something()" is called or not.

Now imagine if "do_something()" looks like this:

Code: Select all

int k = 0;

void do_something(void) {
    k = (k + 54321) * 1234567;
    if( (k & 3 == 0) {
        goto_thread(4);
    }
}

In this case, the return EIP, return CS and return EFLAGS are on the stack is still where it always was when the interrupt handler exists. It still makes no difference whether "do_something()" is called or not.

Note: Technically, if this was an IRQ handler then it'd probably need to send its EOI before calling "do_something()", so it would need to care a tiny little bit if a task switch can happen or not.

Cheers,

Brendan

naegelejd · Post by **naegelejd** » Tue Nov 25, 2014 8:44 am

Hi,

Thanks for all your help. I was able to simplify my task switching code to just push/pop general purpose registers and do a normal "RET". The one thing that was tripping me up was how to handle a task switch after a PIT IRQ and still send the PIC end-of-interrupt, but there are many ways to do it.

I also managed to get "usermode" tasks working after figuring out the necessary differences between the kernel and user thread stacks at initialization time, and where to IRET to CPL=3, etc.

Thanks again.

OSDev.org

Task switching both ring-0 and ring-3 tasks

Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks

Re: Task switching both ring-0 and ring-3 tasks