Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Hello again, and yes, I'm an idiot and I promise this is going to be last question about multitasking that I asked.
Finally, I finished a simple multitasking implementation. I simply push all regs and eflags, save current task's stack, switch to next task's stack, then pop eflags and all regs, lastly return with ret.
As context switch code always run in ring 0, I don't need to save segment registers and etc but what will happen when a ring 3 or virtual 8086 mode task will run first time? It never switches to ring 3 as there is no iret or retf and it destroys stack as initial stack setup is very different for ring 3 and virtual 8086 mode tasks than the switch_task expects.
Also similar problem applies to virtual 8086 tasks. Using popf doesn't reload eflags completely so virtual 8086 tasks never switches to virtual 8086 mode.
iret pops and completely reloads eflags but iret expects a stack like this:
[eflags]
[cs]
[eip]
As call switch_task pushes eip first, I can't put eflags and cs before eip without losing a register (probably eax) (actually it isn't a problem as eax isn't preserved in C calling convention but I don't want to make "black magic")
Concurrent NMIs are delivered to the CPU one by one. IRET signals to the NMI circuitry that another NMI can now be delivered. No other instruction can do this signalling. If NMIs could preempt execution of other NMI ISRs, they would be able to cause a stack overflow, which rarely is a good thing.
Thanks in advance.
Last edited by Agola on Fri Aug 18, 2017 8:54 am, edited 1 time in total.
While creating a ring0 task, you allocate space for a stack and make it look like this:
[eip]
[eax]
[ecx]
[edx]
[ebx]
[useless esp]
[ebp]
[esi]
[edi]
[eflags]
..and when the scheduler gives it CPU time everything on the stack looks right, so (after the task switch) the CPU starts executing (in ring0) at whatever you set for "[eip]".
While creating a ring3 task, you allocate space for its kernel stack and make it look like this:
[eip]
[eax]
[ecx]
[edx]
[ebx]
[useless esp]
[ebp]
[esi]
[edi]
[eflags]
..and when the scheduler gives it CPU time everything on the (kernel) stack looks right, so (after the task switch) the CPU starts executing (in ring0) at whatever you set for "[eip]". The kernel code at that address might do something to switch to ring3 (e.g. pushing some more values on the stack and doing IRET or RETF, or doing a SYSRET, or a SYSEXIT); and the kernel code at that address might be an executable loader that finishes creating a virtual address space and loads an executable file into it and then does something to switch to ring3 that "returns" to the executable's entry point.
Of course you could add more to the original kernel stack in some cases if it's convenient to do so.
For example; while creating a ring3 task, you could make it look like this:
[pointer_to_command_line_args]
[pointer_to_executable_file_name]
[eip]
[eax]
[ecx]
[edx]
[ebx]
[useless esp]
[ebp]
[esi]
[edi]
[eflags]
And after the task switch the CPU would start executing the kernel code at "[eip]", and the stack would still contain the extra stuff and look like this:
[pointer_to_command_line_args]
[pointer_to_executable_file_name]
..which might make it easier to start a new process.
In the same way, if you felt like it, you could add some extra stuff to make it easier (for kernel code at "[eip]" that is executed after the task switch happens) to switch to virtual8086 mode.
The key point here is that the scheduler and its task switch code doesn't need to know or care about any of this. As far as the context switch code is concerned, there are no differences between any of the different possibilities. It simply has no reason to care what the code at "[eip]" is, doesn't care if that code might switch to ring3, doesn't care if that code finishes loading a new process, doesn't care if that code switches to virtual8086 mode, etc.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
While creating a ring0 task, you allocate space for a stack and make it look like this:
[eip]
[eax]
[ecx]
[edx]
[ebx]
[useless esp]
[ebp]
[esi]
[edi]
[eflags]
..and when the scheduler gives it CPU time everything on the stack looks right, so (after the task switch) the CPU starts executing (in ring0) at whatever you set for "[eip]".
While creating a ring3 task, you allocate space for its kernel stack and make it look like this:
[eip]
[eax]
[ecx]
[edx]
[ebx]
[useless esp]
[ebp]
[esi]
[edi]
[eflags]
..and when the scheduler gives it CPU time everything on the (kernel) stack looks right, so (after the task switch) the CPU starts executing (in ring0) at whatever you set for "[eip]". The kernel code at that address might do something to switch to ring3 (e.g. pushing some more values on the stack and doing IRET or RETF, or doing a SYSRET, or a SYSEXIT); and the kernel code at that address might be an executable loader that finishes creating a virtual address space and loads an executable file into it and then does something to switch to ring3 that "returns" to the executable's entry point.
Of course you could add more to the original kernel stack in some cases if it's convenient to do so.
For example; while creating a ring3 task, you could make it look like this:
[pointer_to_command_line_args]
[pointer_to_executable_file_name]
[eip]
[eax]
[ecx]
[edx]
[ebx]
[useless esp]
[ebp]
[esi]
[edi]
[eflags]
And after the task switch the CPU would start executing the kernel code at "[eip]", and the stack would still contain the extra stuff and look like this:
[pointer_to_command_line_args]
[pointer_to_executable_file_name]
..which might make it easier to start a new process.
In the same way, if you felt like it, you could add some extra stuff to make it easier (for kernel code at "[eip]" that is executed after the task switch happens) to switch to virtual8086 mode.
The key point here is that the scheduler and its task switch code doesn't need to know or care about any of this. As far as the context switch code is concerned, there are no differences between any of the different possibilities. It simply has no reason to care what the code at "[eip]" is, doesn't care if that code might switch to ring3, doesn't care if that code finishes loading a new process, doesn't care if that code switches to virtual8086 mode, etc.
Cheers,
Brendan
Aaahhhh, finally I understood context switching. Thanks very much.
What about leaving an interrupt handler with ret? Will leaving an interrupt with ret trash the NMI execution? If so can I use iret for context switch instead of ret?
Am I have to save and restore eflags register? And is saving and restoring eflags with just pushf and popf enough? Will saving and restoring eflags with just pushf and popf reload the necessary flags for next task?
Agola wrote:What about leaving an interrupt handler with ret? Will leaving an interrupt with ret trash the NMI execution? If so can I use iret for context switch instead of ret?
Leaving an interrupt handler with RET can't work without ugly stack fix-ups and other problems. For NMI, if you leave the NMI handler with IRET then NMI will remain blocked until something else does an IRET (so you'll probably miss NMIs until it's too late), and if you do leave the NMI handler with an IRET then it won't matter much how you leave other interrupt handlers.
Note that this has nothing to do with task switching whatsoever; and there's never a sane reason to want to leave an interrupt handler with RET. You can think of an interrupt handler like this:
someInterruptHandler:
push whatever registers you use
; Do some stuff
; Call some other kernel functions without having any reason to care if they do/don't end up causing a task switch
pop whatever registers you used
iretd
Agola wrote:Am I have to save and restore eflags register?
When tasks are running in ring0, could any important flags be different for different tasks?
The arithmetic flags (carry, overflow, etc) are expected to be trashed and therefore aren't important. The direction flag (for "string instructions" like REP MOVSD, etc) is defined by most ABIs as "must be clear during function calls" and therefore probably shouldn't be important either. A task can't be using virtual8086 mode while running kernel code so that flag can't be important either (it's guaranteed to be clear).
However, for some OSs some important flags might be different for different tasks while running in ring3. There's only a few of these that I can think of - the trap flag (if you support single-step debugging of tasks while they're in ring0), both "virtual interrupt" flags (if you use that feature for anything, and depending on how), and potentially the ID flag (if a task switch might happen while some code is trying to determine if CPUID is supported).
Don't forget that typically EFLAGS is saved when the CPU switches from ring3 to ring0 and then restored when the CPU switches from ring0 to ring3. This means that flags that only matter when running at ring3 (the IOPL field, the alignment check enable/disable flag, the virtual8086 mode flag) can be safely trashed/ignored by all code running at ring0 (including the task switch code).
Basically; it's entirely possible to have a kernel that doesn't need to save/restore EFLAGS during task switches, and it's also entirely possible to have a kernel that must to save/restore EFLAGS during task switches (and it's impossible that both kernels are for the same OS, just with different features enabled at compile time).
Agola wrote:And is saving and restoring eflags with just pushf and popf enough? Will saving and restoring eflags with just pushf and popf reload the necessary flags for next task?
Yes.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
The problem is that you view switch_task as procedure that on its own takes you from any context to any other context. I.e. it will take you from a kernel mode context and land you directly into user mode somewhere. You can implement it that way, but the code will be complicated and stuffed with branches, dealing with much too much responsibility (including iret and ret return paths as you noticed). It will be coupled with the implementation details of other kernel facilities of all kinds, like interrupt handlers and thread creation routines.
Instead kernels have a function that does something which is more appropriately named switch_kernel_call_stack (vs your switch_task per se.) Switch_kernel_call_stack cannot take you to user mode, cannot switch the processor mode, cannot exit interrupt handlers. It only switches the current kernel call stack and performs normal return. The caller then has the responsibility to perform any work that the new thread context additionally requires. If switch_kernel_call_stack returns into interrupt handler, it will iret. If that interrupt handler's iret lands the eip into kernel code, this will most likely resume a kernel thread or an incomplete system call. If it lands into user mode, it will resume an interrupted user thread. In the special (and rather convoluted) case that switch_kernel_call_stack returns into a user thread setup routine, the latter will have to complete the user space thread creation (i.e. the part which has to be performed inside the kernel anyway). I hope that this will not confuse you, but another way to think about it is to say that you should be implementing user space and kernel space preemption on top of cooperative multitasking. switch_kernel_call_stack implements the classic cooperative call stack switching, and interrupt routines facilitate the preemption mechanics.
Your question about eflags again stems from the fact that you are trying to stuff too much responsibility into switch_task, which as I said should be semantically more like switch_kernel_call_stack. From now on, I will talk about switch_kernel_call_stack instead, to assert that point. As Brendan already stated, the flags indicate transient state, which is relevant when you are interrupted midway through your code, hence why iret restores it, but is not relevant when you perform explicit call. Why? Because it is not callee saved state. And switch_kernel_call_stack has not interrupted anyone. It has been called. Thus it needs to save and restore not all possible registers, but only callee saved registers. It may be called from inside an interrupt handler, but it is not its job to deal with that directly for the most part. The only relevant exception is dealing carefully with the interrupt flag (i.e. IF), because you need to avoid stack overflow hazards, as I already mentioned in another post. The point is, that when you call a function, you don't expect it to restore the ZF or SF. The arithmetic flags in FLAGS are not callee saved state. IF must be preserved, and the additional EFLAGS will not change from one kernel thread context to another. And again, switch_kernel_call_stack is just a normal call, as far as its caller is concerned. It will return much later, but when it returns, it will behave just as a normal routine returning from its job.
Regarding new user threads, it will be good to understand how control is passed into user mode with iret. This happens for various reasons. It can happen because a system call completes (if int 0x80 is used), when an interrupt handler ends, or it can even happen deliberately at any point in the kernel code. The kernel can perform an iret from any place, to call out into user space if it so desires. To do so, it will push ss, esp, eflags, cs, eip, and then perform iret. This will launch whatever user procedure the kernel wants on whatever user stack the kernel wishes. This is very specific, but it illustrates that iret is the general mechanism for starting user mode code, whether in order to resume it, to call out into it, or to initially start the user thread.
So, we can assume that you need to arrange switch_kernel_call_stack to return into some run_user_thread type of routine. For simplicity, lets also consider that the thread creator has populated the kernel stack beneath the return address to run_user_thread with the user context. That is, lets assume that below the callee saved context that switch_kernel_call_stack restores (which the thread creator must also populate), lies the address of run_user_thread (which switch_kernel_call_stack retuns into), and below that are the eip, cs, eflags, esp, and ss for the initial user context. run_user_thread can consist of a simple iret in this case. You may want to distribute more work here to offload the creator, but this will be more complicated and thus I avoid it deliberately.
Last, but not least, I am not talking about process creation, but thread creation. Process creation requires that you set-up a new address space, allocate a user stack, start an image loader somewhere, etc. This requires a lot of additional thought about how the work will be distributed between kernel space and user space libraries, and whether the fork-exec model or the create-from-executable model will be used.
Edit: I noticed that we are avoiding the issue of populating the GS descriptor. This is the only rather exotic functionality that switch_kernel_call_stack may have to perform if you use gs based addressing to access the per-cpu/per-thread data like most kernels do. It is technically a callee saved state, although the notion of descriptors is not explicitly referred to when talking about ABIs.
Agola wrote:Am I have to save and restore eflags register?
When tasks are running in ring0, could any important flags be different for different tasks?
The arithmetic flags (carry, overflow, etc) are expected to be trashed and therefore aren't important. The direction flag (for "string instructions" like REP MOVSD, etc) is defined by most ABIs as "must be clear during function calls" and therefore probably shouldn't be important either. A task can't be using virtual8086 mode while running kernel code so that flag can't be important either (it's guaranteed to be clear).
However, for some OSs some important flags might be different for different tasks while running in ring3. There's only a few of these that I can think of - the trap flag (if you support single-step debugging of tasks while they're in ring0), both "virtual interrupt" flags (if you use that feature for anything, and depending on how), and potentially the ID flag (if a task switch might happen while some code is trying to determine if CPUID is supported).
Don't forget that typically EFLAGS is saved when the CPU switches from ring3 to ring0 and then restored when the CPU switches from ring0 to ring3. This means that flags that only matter when running at ring3 (the IOPL field, the alignment check enable/disable flag, the virtual8086 mode flag) can be safely trashed/ignored by all code running at ring0 (including the task switch code).
Basically; it's entirely possible to have a kernel that doesn't need to save/restore EFLAGS during task switches, and it's also entirely possible to have a kernel that must to save/restore EFLAGS during task switches (and it's impossible that both kernels are for the same OS, just with different features enabled at compile time).
Agola wrote:And is saving and restoring eflags with just pushf and popf enough? Will saving and restoring eflags with just pushf and popf reload the necessary flags for next task?
Yes.
Cheers,
Brendan
Thank you very much again, that answers almost all of my questions.
Brendan wrote:Hi,
Agola wrote:What about leaving an interrupt handler with ret? Will leaving an interrupt with ret trash the NMI execution? If so can I use iret for context switch instead of ret?
Leaving an interrupt handler with RET can't work without ugly stack fix-ups and other problems. For NMI, if you leave the NMI handler with IRET then NMI will remain blocked until something else does an IRET (so you'll probably miss NMIs until it's too late), and if you do leave the NMI handler with an IRET then it won't matter much how you leave other interrupt handlers.
Note that this has nothing to do with task switching whatsoever; and there's never a sane reason to want to leave an interrupt handler with RET. You can think of an interrupt handler like this:
someInterruptHandler:
push whatever registers you use
; Do some stuff
; Call some other kernel functions without having any reason to care if they do/don't end up causing a task switch
pop whatever registers you used
iretd
I asked this because actually the switch_task does the exactly same thing. If it's called within an interrupt handler, it leaves the interrupt and loads the next task's eip with ret. What will happen if the task that switched to don't execute an iret? An interrupt handler will be leaved with ret, and iret won't be called until switching to a task that will execute an iret. Won't that affect NMIs?
Agola wrote:I asked this because actually the switch_task does the exactly same thing. If it's called within an interrupt handler, it leaves the interrupt and loads the next task's eip with ret. What will happen if the task that switched to don't execute an iret? An interrupt handler will be leaved with ret, and iret won't be called until switching to a task that will execute an iret. Won't that affect NMIs?
I will only effect NMI if the NMI handler does a task switch itself.
If you want to call yield() from your IRQ #0 handler, you should modify switchTask() to work with interrupts and iret.
I've fixed that problem (by deleting the sentence in that wiki page).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Agola wrote:I asked this because actually the switch_task does the exactly same thing. If it's called within an interrupt handler, it leaves the interrupt and loads the next task's eip with ret. What will happen if the task that switched to don't execute an iret? An interrupt handler will be leaved with ret, and iret won't be called until switching to a task that will execute an iret. Won't that affect NMIs?
I will only effect NMI if the NMI handler does a task switch itself.
If you want to call yield() from your IRQ #0 handler, you should modify switchTask() to work with interrupts and iret.
I've fixed that problem (by deleting the sentence in that wiki page).
Cheers,
Brendan
Thanks!
That made me very very happy. Marked thread as solved.
Edit: I finished implementing and it works really good! I'm really happy because now I can call schedule and yield from everywhere I want, also I don't need to stick interrupt handlers to switch tasks now. That is really cool!