8infy wrote:Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right?
Pretty much. See, I have a different kernel stack for each task (kernel or user task doesn't matter). If an IRQ arrives while I'm in userspace, then the IRET frame will be on top of the stack. If an IRQ arrives while I'm inside the kernel, then the IRET frame will be in the middle of the stack somewhere, and it doesn't really matter where. And yes, if a user space task is performing calculations until its time is up and it is preempted by the timer IRQ, then the timer IRQ will set the timeout flag, which will make the return-to-userspace code call schedule(), and then the task is basically sidelined with the IRET frame and whatever else is needed to return to userspace on stack.
8infy wrote: But i'm still confused about the userland example, does sysret_fast_path set the iret frame?
No. I should mention that my struct regs already contains an iret frame at the top. That frame will not be used, however. The definition is this:
Code: Select all
struct regs {
uint64_t r15, r14, r13, r12, r11, r10, r9, r8;
uint64_t di, si, dx, cx, bx, ax, bp;
uint64_t code, ip, cs, flags, sp, ss;
};
The last line is an IRET frame, and the "code" member corresponds to either the error code in exceptions, or else the syscall number in system calls or the negative interrupt number in IRQs. This way, all causes of entering the kernel can create exactly the same register image on stack, and therefore anyone can find out the registers of any task at any time, simply by looking at the top of the kernel stack. That helps with debugging.
sysret_fast_path is a symbol in the syscall code, which expects RSP to point to this structure, and merely copies all members of this structure into their respective registers (except for SP). Then, it also discards the "code" member. Then, if CX happens to equal IP (which is still on stack), and R11 happens to equal "flags" (which is still on stack), then the return to userspace can be done with the sysret instruction. Else something fishy is going on, but my code will just perform an iret. If that iret faults, I'll tell the user process about it in the form of a SIGSEGV.
Now, you can't copy this completely, since you also have to deal with segment registers beyond what I showed here (indeed, the full version of schedule() also saves and restores the FS and GS base), but you must deal with switching DS and ES when entering and leaving kernel space. Though that is only a small deviation from what I did here.