TSS and software switching
TSS and software switching
Hi,
I'm just starting to work on my multitasking code and I have a few questions:
- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?
- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?
- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?
- Do I need to reload the TSS by doing ltr each time I change the kernel stack pointer or is it updated automatically?
- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)
Thanks
I'm just starting to work on my multitasking code and I have a few questions:
- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?
- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?
- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?
- Do I need to reload the TSS by doing ltr each time I change the kernel stack pointer or is it updated automatically?
- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)
Thanks
Re: TSS and software switching
The TSS simply contains the threads kernel stack, kernel ss, cs, ds, es, fs, and gs. When an IRQ or ISR occurs, then the CPU will do a hardware task switch. It will fetch those values from the TSS. Each time you switch tasks in software, you place that thread's kernel stack in the TSS. The TSS is per core. Each core has its own TSS. Hopefully this answers your questions.
-
- Member
- Posts: 426
- Joined: Tue Apr 03, 2018 2:44 am
Re: TSS and software switching
Careful, that's not quite right.nexos wrote:The TSS simply contains the threads kernel stack, kernel ss, cs, ds, es, fs, and gs. When an IRQ or ISR occurs, then the CPU will do a hardware task switch. It will fetch those values from the TSS. Each time you switch tasks in software, you place that thread's kernel stack in the TSS. The TSS is per core. Each core has its own TSS. Hopefully this answers your questions.
In an interrupt/trap gate, only the ss:esp is retrieved from the TSS (and then only for a privilege level change), cs:eip comes from the interrupt/trap gate, and ds, es, fs, gs are all untouched. The old cs, eip and eflags are stored on the kernel stack, and if a privilege level change occured, also the old ss and esp as well.
Else, if this interrupt specifies a task gate, then the existing CPU state is stored in the current TSS, and the task gate selects the new TSS from which to load the new state (including all the TSS state,) and stores a back pointer to the previous interrupted TSS in the new TSS. Useful for double fault handlers where the stack state can't be relied on. But hardware TSS based task switching is otherwise not useful nor a portable concept, so I think most stay away from it. This sounds more like what you're describing.
So, in summary, when switching kernel stacks, the location of the bottom (high address) of the new thread's kernel stack needs to be stored in the current per-core TSS. When user code transitions to kernel mode via an interrupt, the kernel stack pointer will be initialised from this TSS entry for ss:eip. The rest of the TSS can be ignored.
Re: TSS and software switching
Yes. And the IO permission bitmap, if you allow userspace to perform IO.8infy wrote:- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?
Yes.8infy wrote:- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?
Yes.8infy wrote:- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?
No.8infy wrote:- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)
Separate the ideas of switching threads and getting an IRQ. I pilfered this idea from Linux and I'm sticking to it: Every task gets some task information flags. The code responsible for the kernel-to-userspace transition will check two of these flags: Was a signal received or did the task time out? The timer IRQ merely sets the timeout flag, but it is the kernel-to-user code responsible for acting on it. Also, everything in the kernel not in interrupt context (so syscall context or kernel thread) can just call schedule() at any time. Anyway, if either of these flags is set, another function is called in the kernel. That function will react to signals. But importantly for you, it will also react to a timeout by just calling schedule(). And it is that function that will pick the next task and perform the actual task switch by pushing all non-volatile registers, switching the stack, then reloading all the registers and returning. Since the stack was switched, it returns to a different context, but it will return later to the context we just left, and restore those registers.
For the actual task switch, in addition to switching the GPRs, yes, you might need to switch FS and GS, too (you do switch DS and ES at the kernel/user boundary, right? Unless you use 64-bit mode, where literally no-one cares), as well as the kernel stack pointer in the TSS, and the CR3 if it is a user task. And not to forget the FPU/SSE/AVX unit (XSAVE is your friend). And maybe the debug registers.
For the initial task, you just set up your scheduler structures and act as if this was a kernel task performing the initialization of the system. After that, the world is your oyster. If you wanted to know how to set up a new task, I would simply put on its stack enough data for schedule() to "restore" a healthy kernel-task, but set the return address such that when schedule() returns, it will be into a special routine initializing the new task.That way, no special handling for unintialized tasks is necessary.
Carpe diem!
Re: TSS and software switching
I see what you mean, but new tasks also have to be initialized with their stack filled with data enough to do an IRET anyways right? I just don't see how schedule() and IRET go together, maybe you could explain it for me if you don't mind.nullplan wrote:Yes. And the IO permission bitmap, if you allow userspace to perform IO.8infy wrote:- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?Yes.8infy wrote:- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?Yes.8infy wrote:- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?No.8infy wrote:- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)
Separate the ideas of switching threads and getting an IRQ. I pilfered this idea from Linux and I'm sticking to it: Every task gets some task information flags. The code responsible for the kernel-to-userspace transition will check two of these flags: Was a signal received or did the task time out? The timer IRQ merely sets the timeout flag, but it is the kernel-to-user code responsible for acting on it. Also, everything in the kernel not in interrupt context (so syscall context or kernel thread) can just call schedule() at any time. Anyway, if either of these flags is set, another function is called in the kernel. That function will react to signals. But importantly for you, it will also react to a timeout by just calling schedule(). And it is that function that will pick the next task and perform the actual task switch by pushing all non-volatile registers, switching the stack, then reloading all the registers and returning. Since the stack was switched, it returns to a different context, but it will return later to the context we just left, and restore those registers.
For the actual task switch, in addition to switching the GPRs, yes, you might need to switch FS and GS, too (you do switch DS and ES at the kernel/user boundary, right? Unless you use 64-bit mode, where literally no-one cares), as well as the kernel stack pointer in the TSS, and the CR3 if it is a user task. And not to forget the FPU/SSE/AVX unit (XSAVE is your friend). And maybe the debug registers.
For the initial task, you just set up your scheduler structures and act as if this was a kernel task performing the initialization of the system. After that, the world is your oyster. If you wanted to know how to set up a new task, I would simply put on its stack enough data for schedule() to "restore" a healthy kernel-task, but set the return address such that when schedule() returns, it will be into a special routine initializing the new task.That way, no special handling for unintialized tasks is necessary.
Re: TSS and software switching
No, an IRET is not really in the cards here. Not directly, anyway. The following are in different files (and different directories):
See, this code really doesn't care about whether the next task is initialized. It just switches to the stack. This way of jumping to another stack and leaving the current one hanging is also why arch_switch_task() has to restore the stuff of "this" instead of "next". By the time that code runs, "this" has been selected as "next" in some other task.
So, how do you start a new task? Simple: You put 7 words on the stack of the new task to get a frame that can be picked up by switch_task_asm.
For user tasks, exiting is taken care of by user space. And so, for a user task, I actually do only have to prepare a stack for a sysret. But I use the normal syscall exit code for this
And sysret_fast_path will simply restore registers, perform a sanity check, set SP and sysret. This is possible since the ABI does not specify the values of any of the registers except RSP and RDX, so I can put into RCX and R11 whatever I want. As I said, schedule can only ever be called from syscall context, kernel thread context, or late interrupt context (the interrupt is done, and only performing a return to user space is left to do), so sysret is permissible (there is a problem with using sysret from nested interrupts).
Code: Select all
struct task *current;
void schedule(void) {
struct task *next = pick_next_task();
if (next != current)
arch_switch_task(next);
}
Code: Select all
void arch_switch_task(struct task* next) {
struct task *this = current;
if (this->flags & TIF_FPU)
save_fpu(this);
if (this->flags & TIF_DEBUG)
disable_dr();
switch_task_asm(&this->arch.sp, next->arch.sp);
if (current->flags & TIF_EXIT)
current->flags |= TIF_FREE;
current = this;
if (this->flags & TIF_DEBUG)
restore_dr(this);
if (this->flags & TIF_FPU)
restore_fpu(this);
if (!(this->flags & TIF_KERNEL)
load_cr3(this->arch.cr3);
}
Code: Select all
switch_task_asm:
pushq %rbp
pushq %rbx
pushq %r12
pushq %r13
pushq%r14
pushq %r15
movq %rsp, (%rdi)
movq%rsi, %rsp
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
popq %rbp
movl %cr0, %eax
bts $3, %eax
movl %eax, %cr0
retq
So, how do you start a new task? Simple: You put 7 words on the stack of the new task to get a frame that can be picked up by switch_task_asm.
Code: Select all
extern void init_ktask_asm(void);
int new_kernel_task(void (*routine)(void*), void* arg) {
[alloc new stack and task structure]
uint64_t *newstack = you_dont_ask_and_i_dont_tell();
newstack -= 7;
newstack[0] = (uint64_t)routine;
newstack[1] = (uint64_t)arg;
newstack[2] = (uint64_t)new_task;
newstack[5] = 0;
newstack[6] = (uint64_t)init_ktask_asm;
new_task->arch.sp = (uint64_t)newstack;
}
noreturn void init_ktask(struct task *this, void (*routine)(void*), void *arg) {
current = this;
sti();
routine(arg);
this->flags |= TIF_EXIT;
make_not_runnable(this);
for (;;)
schedule();
}
Code: Select all
init_ktask_asm:
sti
movq %r13, %rdi
movq %r15, %rsi
movq %r14, %rdx
call init_ktask
ud2
Code: Select all
int new_user_task(uint64_t ip, uint64_t sp) {
[alloc new stack and task structure]
struct regs* regs = task_regs(new_task);
regs->dx = 0;
regs->cs = USER_CS;
regs->cx = regs->ip = ip;
regs->r11 = regs->flags = USER_FLAGS;
regs->ss = USER_DS;
regs->sp = sp;
uint64_t *newstack = (uint64_t*)regs;
newstack -= sizeof regs / sizeof (uint64_t) + 7;
newstack[0] = (uint64_t)new_task,
newstack[6] = (uint64_t)init_utask;
new_task->arch.sp = (uint64_t)newstack;
}
Code: Select all
init_utask:
movq %r15, current
xorl %eax, %eax
jmp sysret_fast_path
Carpe diem!
Re: TSS and software switching
Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right? But i'm still confused about the userland example, does sysret_fast_path set the iret frame? Thats kind of the most important piece here Sorry maybe I'm just dumb, it's hard for me to understand what's going on here exactly, especially since you're doing x64... Anyways, I'll read it a few more times and try to comprehend whatever is going onnullplan wrote:No, an IRET is not really in the cards here. Not directly, anyway. The following are in different files (and different directories):Code: Select all
struct task *current; void schedule(void) { struct task *next = pick_next_task(); if (next != current) arch_switch_task(next); }
Code: Select all
void arch_switch_task(struct task* next) { struct task *this = current; if (this->flags & TIF_FPU) save_fpu(this); if (this->flags & TIF_DEBUG) disable_dr(); switch_task_asm(&this->arch.sp, next->arch.sp); if (current->flags & TIF_EXIT) current->flags |= TIF_FREE; current = this; if (this->flags & TIF_DEBUG) restore_dr(this); if (this->flags & TIF_FPU) restore_fpu(this); if (!(this->flags & TIF_KERNEL) load_cr3(this->arch.cr3); }
See, this code really doesn't care about whether the next task is initialized. It just switches to the stack. This way of jumping to another stack and leaving the current one hanging is also why arch_switch_task() has to restore the stuff of "this" instead of "next". By the time that code runs, "this" has been selected as "next" in some other task.Code: Select all
switch_task_asm: pushq %rbp pushq %rbx pushq %r12 pushq %r13 pushq%r14 pushq %r15 movq %rsp, (%rdi) movq%rsi, %rsp popq %r15 popq %r14 popq %r13 popq %r12 popq %rbx popq %rbp movl %cr0, %eax bts $3, %eax movl %eax, %cr0 retq
So, how do you start a new task? Simple: You put 7 words on the stack of the new task to get a frame that can be picked up by switch_task_asm.Code: Select all
extern void init_ktask_asm(void); int new_kernel_task(void (*routine)(void*), void* arg) { [alloc new stack and task structure] uint64_t *newstack = you_dont_ask_and_i_dont_tell(); newstack -= 7; newstack[0] = (uint64_t)routine; newstack[1] = (uint64_t)arg; newstack[2] = (uint64_t)new_task; newstack[5] = 0; newstack[6] = (uint64_t)init_ktask_asm; new_task->arch.sp = (uint64_t)newstack; } noreturn void init_ktask(struct task *this, void (*routine)(void*), void *arg) { current = this; sti(); routine(arg); this->flags |= TIF_EXIT; make_not_runnable(this); for (;;) schedule(); }
For user tasks, exiting is taken care of by user space. And so, for a user task, I actually do only have to prepare a stack for a sysret. But I use the normal syscall exit code for thisCode: Select all
init_ktask_asm: sti movq %r13, %rdi movq %r15, %rsi movq %r14, %rdx call init_ktask ud2
Code: Select all
int new_user_task(uint64_t ip, uint64_t sp) { [alloc new stack and task structure] struct regs* regs = task_regs(new_task); regs->dx = 0; regs->cs = USER_CS; regs->cx = regs->ip = ip; regs->r11 = regs->flags = USER_FLAGS; regs->ss = USER_DS; regs->sp = sp; uint64_t *newstack = (uint64_t*)regs; newstack -= sizeof regs / sizeof (uint64_t) + 7; newstack[0] = (uint64_t)new_task, newstack[6] = (uint64_t)init_utask; new_task->arch.sp = (uint64_t)newstack; }
And sysret_fast_path will simply restore registers, perform a sanity check, set SP and sysret. This is possible since the ABI does not specify the values of any of the registers except RSP and RDX, so I can put into RCX and R11 whatever I want. As I said, schedule can only ever be called from syscall context, kernel thread context, or late interrupt context (the interrupt is done, and only performing a return to user space is left to do), so sysret is permissible (there is a problem with using sysret from nested interrupts).Code: Select all
init_utask: movq %r15, current xorl %eax, %eax jmp sysret_fast_path
Re: TSS and software switching
Pretty much. See, I have a different kernel stack for each task (kernel or user task doesn't matter). If an IRQ arrives while I'm in userspace, then the IRET frame will be on top of the stack. If an IRQ arrives while I'm inside the kernel, then the IRET frame will be in the middle of the stack somewhere, and it doesn't really matter where. And yes, if a user space task is performing calculations until its time is up and it is preempted by the timer IRQ, then the timer IRQ will set the timeout flag, which will make the return-to-userspace code call schedule(), and then the task is basically sidelined with the IRET frame and whatever else is needed to return to userspace on stack.8infy wrote:Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right?
No. I should mention that my struct regs already contains an iret frame at the top. That frame will not be used, however. The definition is this:8infy wrote: But i'm still confused about the userland example, does sysret_fast_path set the iret frame?
Code: Select all
struct regs {
uint64_t r15, r14, r13, r12, r11, r10, r9, r8;
uint64_t di, si, dx, cx, bx, ax, bp;
uint64_t code, ip, cs, flags, sp, ss;
};
sysret_fast_path is a symbol in the syscall code, which expects RSP to point to this structure, and merely copies all members of this structure into their respective registers (except for SP). Then, it also discards the "code" member. Then, if CX happens to equal IP (which is still on stack), and R11 happens to equal "flags" (which is still on stack), then the return to userspace can be done with the sysret instruction. Else something fishy is going on, but my code will just perform an iret. If that iret faults, I'll tell the user process about it in the form of a SIGSEGV.
Now, you can't copy this completely, since you also have to deal with segment registers beyond what I showed here (indeed, the full version of schedule() also saves and restores the FS and GS base), but you must deal with switching DS and ES when entering and leaving kernel space. Though that is only a small deviation from what I did here.
Carpe diem!
Re: TSS and software switching
Thanks Do you think it's a good approach if my userland tasks first get directed to a userland bootstrap function (via ret in schedule()) that basically only does an IRET from a pre-initialized IRET frame? So all userland tasks' stack comes preinitialized with registers on it enough to do a schedule() followed by an iretnullplan wrote:Pretty much. See, I have a different kernel stack for each task (kernel or user task doesn't matter). If an IRQ arrives while I'm in userspace, then the IRET frame will be on top of the stack. If an IRQ arrives while I'm inside the kernel, then the IRET frame will be in the middle of the stack somewhere, and it doesn't really matter where. And yes, if a user space task is performing calculations until its time is up and it is preempted by the timer IRQ, then the timer IRQ will set the timeout flag, which will make the return-to-userspace code call schedule(), and then the task is basically sidelined with the IRET frame and whatever else is needed to return to userspace on stack.8infy wrote:Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right?
No. I should mention that my struct regs already contains an iret frame at the top. That frame will not be used, however. The definition is this:8infy wrote: But i'm still confused about the userland example, does sysret_fast_path set the iret frame?The last line is an IRET frame, and the "code" member corresponds to either the error code in exceptions, or else the syscall number in system calls or the negative interrupt number in IRQs. This way, all causes of entering the kernel can create exactly the same register image on stack, and therefore anyone can find out the registers of any task at any time, simply by looking at the top of the kernel stack. That helps with debugging.Code: Select all
struct regs { uint64_t r15, r14, r13, r12, r11, r10, r9, r8; uint64_t di, si, dx, cx, bx, ax, bp; uint64_t code, ip, cs, flags, sp, ss; };
sysret_fast_path is a symbol in the syscall code, which expects RSP to point to this structure, and merely copies all members of this structure into their respective registers (except for SP). Then, it also discards the "code" member. Then, if CX happens to equal IP (which is still on stack), and R11 happens to equal "flags" (which is still on stack), then the return to userspace can be done with the sysret instruction. Else something fishy is going on, but my code will just perform an iret. If that iret faults, I'll tell the user process about it in the form of a SIGSEGV.
Now, you can't copy this completely, since you also have to deal with segment registers beyond what I showed here (indeed, the full version of schedule() also saves and restores the FS and GS base), but you must deal with switching DS and ES when entering and leaving kernel space. Though that is only a small deviation from what I did here.
Re: TSS and software switching
You do you. But that is pretty much exactly what I'm doing. I'm only using sysret_fast_path because that code already had to exist. And of the two ways to return to userspace, sysret is the faster one. It clobbers two registers at least, but that is not a concern at process startup.8infy wrote:Do you think it's a good approach if my userland tasks first get directed to a userland bootstrap function (via ret in schedule()) that basically only does an IRET from a pre-initialized IRET frame? So all userland tasks' stack comes preinitialized with registers on it enough to do a schedule() followed by an iret
Carpe diem!
Re: TSS and software switching
Maybe that's a better way. I'm not that familiar with the differences between sysret and iret, I'm assuming it has something to do with fast syscalls and stuff, is it applicable to the non-longmode x86?nullplan wrote:You do you. But that is pretty much exactly what I'm doing. I'm only using sysret_fast_path because that code already had to exist. And of the two ways to return to userspace, sysret is the faster one. It clobbers two registers at least, but that is not a concern at process startup.8infy wrote:Do you think it's a good approach if my userland tasks first get directed to a userland bootstrap function (via ret in schedule()) that basically only does an IRET from a pre-initialized IRET frame? So all userland tasks' stack comes preinitialized with registers on it enough to do a schedule() followed by an iret
The reason why I'm asking this is because I literally started figuring out how multitasking works a few days ago, and the ideas that i'm suggesting could be absolutely stupid/broken. I don't 100% understand your approach because I don't know how sysret differs from iret and I'm not sure what you mean by `because that code already had to exist`.
Re: TSS and software switching
x86 has two fast system call instructions: syscall and sysenter. They both have a reverse operation called sysret and sysexit, respectively.8infy wrote:I'm not that familiar with the differences between sysret and iret, I'm assuming it has something to do with fast syscalls and stuff, is it applicable to the non-longmode x86?
SYSCALL was AMD's invention. They put it in AMD CPUs and on those, it works well in long mode and legacy mode. SYSENTER, on the other hand, was Intel's invention. They put that in their CPUs, and so SYSENTER works well in long mode and legacy mode on Intel CPUs. However, since there is no entity on the planet more petty than two competing companies, both Intel and AMD have completely kneecapped their implementations of the other company's innovation. And so, on an AMD CPU, SYSENTER will only work in legacy mode, and on an Intel CPU, SYSCALL will only work in long mode. For you, this means you actually can use SYSENTER on a modern CPU since that should be supported. It also means detecting support for SYSENTER and SYSCALL requires you to identify the CPU vendor. And no, I don't think Cyrix ever invented yet another mechanism. Or Centaur. Or VIA. Or... you know, there are a lot of x86 CPU vendors.
So yeah, these things can be supported in legacy mode. Now you just have to detect support correctly. Of note is that Linux will not use SYSCALL on x86 legacy mode even on AMD CPUs, because something didn't work right. They do however use it in long compatibility mode (so 64-bit kernel, 32-bit processes).
For our purposes, a minor technicality. SYSRET and IRET mainly differ in the former being faster, but clobbering two registers, and not switching stacks; you have to do that yourself. You know how IRET reads the information on the return context from the stack, right? Well SYSRET does not do this. SYSRET sets CS and SS to the selector values found in the STAR MSR, sets RIP to RCX and RFLAGS to R11. That is why RCX and R11 cannot contain arbitrary values, and why SYSRET is not suitable to return from interrupts. Also, since the stack is not switched, there is a tiny amount of time between loading RSP and executing the SYSRET that an interrupt might arrive while in kernel mode with a user stack. That problem can be solved by disabling interrupts before attempting that. There might be an NMI, but I can set up the NMI handler to get a guaranteed kernel stack.8infy wrote:I don't know how sysret differs from iret and I'm not sure what you mean by `because that code already had to exist`.
The sysret code already had to exist, because I needed it to return from syscall. I wrote the syscall handler first, and then the multitasking code afterwards. The syscall handler was basically an outgrowth of the interrupt handler code I needed first.
Carpe diem!
Re: TSS and software switching
Ok thanks a lot, that makes sense. So if you didn't have access to sysret, would you do what I suggested in my previous post (the preset iret frame for a new user thread)?nullplan wrote:x86 has two fast system call instructions: syscall and sysenter. They both have a reverse operation called sysret and sysexit, respectively.8infy wrote:I'm not that familiar with the differences between sysret and iret, I'm assuming it has something to do with fast syscalls and stuff, is it applicable to the non-longmode x86?
SYSCALL was AMD's invention. They put it in AMD CPUs and on those, it works well in long mode and legacy mode. SYSENTER, on the other hand, was Intel's invention. They put that in their CPUs, and so SYSENTER works well in long mode and legacy mode on Intel CPUs. However, since there is no entity on the planet more petty than two competing companies, both Intel and AMD have completely kneecapped their implementations of the other company's innovation. And so, on an AMD CPU, SYSENTER will only work in legacy mode, and on an Intel CPU, SYSCALL will only work in long mode. For you, this means you actually can use SYSENTER on a modern CPU since that should be supported. It also means detecting support for SYSENTER and SYSCALL requires you to identify the CPU vendor. And no, I don't think Cyrix ever invented yet another mechanism. Or Centaur. Or VIA. Or... you know, there are a lot of x86 CPU vendors.
So yeah, these things can be supported in legacy mode. Now you just have to detect support correctly. Of note is that Linux will not use SYSCALL on x86 legacy mode even on AMD CPUs, because something didn't work right. They do however use it in long compatibility mode (so 64-bit kernel, 32-bit processes).
For our purposes, a minor technicality. SYSRET and IRET mainly differ in the former being faster, but clobbering two registers, and not switching stacks; you have to do that yourself. You know how IRET reads the information on the return context from the stack, right? Well SYSRET does not do this. SYSRET sets CS and SS to the selector values found in the STAR MSR, sets RIP to RCX and RFLAGS to R11. That is why RCX and R11 cannot contain arbitrary values, and why SYSRET is not suitable to return from interrupts. Also, since the stack is not switched, there is a tiny amount of time between loading RSP and executing the SYSRET that an interrupt might arrive while in kernel mode with a user stack. That problem can be solved by disabling interrupts before attempting that. There might be an NMI, but I can set up the NMI handler to get a guaranteed kernel stack.8infy wrote:I don't know how sysret differs from iret and I'm not sure what you mean by `because that code already had to exist`.
The sysret code already had to exist, because I needed it to return from syscall. I wrote the syscall handler first, and then the multitasking code afterwards. The syscall handler was basically an outgrowth of the interrupt handler code I needed first.
Re: TSS and software switching
I am not understanding this... Aren't the parameters to init_ktask() passed in rdi, rsi and rdx? Seeding the stack with the parameters and calling task_switch_asm() means the values you setup on the stack will endup in volatile registers (rbp, rbx, r12-15).nullplan wrote:Code: Select all
int new_kernel_task(void (*routine)(void*), void* arg) { [alloc new stack and task structure] uint64_t *newstack = you_dont_ask_and_i_dont_tell(); newstack -= 7; newstack[0] = (uint64_t)routine; newstack[1] = (uint64_t)arg; newstack[2] = (uint64_t)new_task; newstack[5] = 0; newstack[6] = (uint64_t)init_ktask_asm; new_task->arch.sp = (uint64_t)newstack; } noreturn void init_ktask(struct task *this, void (*routine)(void*), void *arg) { ... }
Re: TSS and software switching
Oh, I just noticed you aren't returning to init_ktask() directly but to init_ktask_asm(). I am guessing this is a trampoline that moves the parameters to the right registers.