TSS and software switching

8infy · Post by **8infy** » Thu Jun 25, 2020 2:44 am

Hi,

I'm just starting to work on my multitasking code and I have a few questions:

- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?
- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?
- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?
- Do I need to reload the TSS by doing ltr each time I change the kernel stack pointer or is it updated automatically?
- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)

Thanks

nexos · Post by **nexos** » Thu Jun 25, 2020 5:27 am

The TSS simply contains the threads kernel stack, kernel ss, cs, ds, es, fs, and gs. When an IRQ or ISR occurs, then the CPU will do a hardware task switch. It will fetch those values from the TSS. Each time you switch tasks in software, you place that thread's kernel stack in the TSS. The TSS is per core. Each core has its own TSS. Hopefully this answers your questions.

thewrongchristian · Post by **thewrongchristian** » Thu Jun 25, 2020 10:25 am

nexos wrote:The TSS simply contains the threads kernel stack, kernel ss, cs, ds, es, fs, and gs. When an IRQ or ISR occurs, then the CPU will do a hardware task switch. It will fetch those values from the TSS. Each time you switch tasks in software, you place that thread's kernel stack in the TSS. The TSS is per core. Each core has its own TSS. Hopefully this answers your questions.

Careful, that's not quite right.

In an interrupt/trap gate, only the ss:esp is retrieved from the TSS (and then only for a privilege level change), cs:eip comes from the interrupt/trap gate, and ds, es, fs, gs are all untouched. The old cs, eip and eflags are stored on the kernel stack, and if a privilege level change occured, also the old ss and esp as well.

Else, if this interrupt specifies a task gate, then the existing CPU state is stored in the current TSS, and the task gate selects the new TSS from which to load the new state (including all the TSS state,) and stores a back pointer to the previous interrupted TSS in the new TSS. Useful for double fault handlers where the stack state can't be relied on. But hardware TSS based task switching is otherwise not useful nor a portable concept, so I think most stay away from it. This sounds more like what you're describing.

So, in summary, when switching kernel stacks, the location of the bottom (high address) of the new thread's kernel stack needs to be stored in the current per-core TSS. When user code transitions to kernel mode via an interrupt, the kernel stack pointer will be initialised from this TSS entry for ss:eip. The rest of the TSS can be ignored.

nullplan · Post by **nullplan** » Thu Jun 25, 2020 1:16 pm

8infy wrote:- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?

Yes. And the IO permission bitmap, if you allow userspace to perform IO.

8infy wrote:- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?

Yes.

8infy wrote:- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?

Yes.

8infy wrote:- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)

No.
Separate the ideas of switching threads and getting an IRQ. I pilfered this idea from Linux and I'm sticking to it: Every task gets some task information flags. The code responsible for the kernel-to-userspace transition will check two of these flags: Was a signal received or did the task time out? The timer IRQ merely sets the timeout flag, but it is the kernel-to-user code responsible for acting on it. Also, everything in the kernel not in interrupt context (so syscall context or kernel thread) can just call schedule() at any time. Anyway, if either of these flags is set, another function is called in the kernel. That function will react to signals. But importantly for you, it will also react to a timeout by just calling schedule(). And it is that function that will pick the next task and perform the actual task switch by pushing all non-volatile registers, switching the stack, then reloading all the registers and returning. Since the stack was switched, it returns to a different context, but it will return later to the context we just left, and restore those registers.

For the actual task switch, in addition to switching the GPRs, yes, you might need to switch FS and GS, too (you do switch DS and ES at the kernel/user boundary, right? Unless you use 64-bit mode, where literally no-one cares), as well as the kernel stack pointer in the TSS, and the CR3 if it is a user task. And not to forget the FPU/SSE/AVX unit (XSAVE is your friend). And maybe the debug registers.

For the initial task, you just set up your scheduler structures and act as if this was a kernel task performing the initialization of the system. After that, the world is your oyster. If you wanted to know how to set up a new task, I would simply put on its stack enough data for schedule() to "restore" a healthy kernel-task, but set the return address such that when schedule() returns, it will be into a special routine initializing the new task.That way, no special handling for unintialized tasks is necessary.

8infy · Post by **8infy** » Thu Jun 25, 2020 1:43 pm

nullplan wrote:
8infy wrote:- Do I understand correctly that if I do software switching I only need 1 TSS per core, that stores the kernel stack pointer of the current thread and the kernel data selector?
Yes. And the IO permission bitmap, if you allow userspace to perform IO.
8infy wrote:- Are those^ the only values fetched from the TSS when going from ring3 -> ring0?
Yes.
8infy wrote:- Is updating the kernel stack pointer in TSS the correct way to go when switching between threads?
Yes.
8infy wrote:- Also, do I understand the multitasking pipeline correctly:
1. Create a "fake" initial task with the correct values on its stack, so we can switch from and to it later.
2. Whenever an IRQ fires we switch the task by changing the kernel stack pointer in the TSS and the current kernel esp, to the new task's kernel esp. (also storing the current stack top in the task state)
3. Since this new esp already contains the stuff necessary to do an IRET to this task we don't change anything else.
4. repeat 2-3 (this is oversimplified of course not taking priority / time slices, etc into account)
No.
Separate the ideas of switching threads and getting an IRQ. I pilfered this idea from Linux and I'm sticking to it: Every task gets some task information flags. The code responsible for the kernel-to-userspace transition will check two of these flags: Was a signal received or did the task time out? The timer IRQ merely sets the timeout flag, but it is the kernel-to-user code responsible for acting on it. Also, everything in the kernel not in interrupt context (so syscall context or kernel thread) can just call schedule() at any time. Anyway, if either of these flags is set, another function is called in the kernel. That function will react to signals. But importantly for you, it will also react to a timeout by just calling schedule(). And it is that function that will pick the next task and perform the actual task switch by pushing all non-volatile registers, switching the stack, then reloading all the registers and returning. Since the stack was switched, it returns to a different context, but it will return later to the context we just left, and restore those registers.

For the actual task switch, in addition to switching the GPRs, yes, you might need to switch FS and GS, too (you do switch DS and ES at the kernel/user boundary, right? Unless you use 64-bit mode, where literally no-one cares), as well as the kernel stack pointer in the TSS, and the CR3 if it is a user task. And not to forget the FPU/SSE/AVX unit (XSAVE is your friend). And maybe the debug registers.

For the initial task, you just set up your scheduler structures and act as if this was a kernel task performing the initialization of the system. After that, the world is your oyster. If you wanted to know how to set up a new task, I would simply put on its stack enough data for schedule() to "restore" a healthy kernel-task, but set the return address such that when schedule() returns, it will be into a special routine initializing the new task.That way, no special handling for unintialized tasks is necessary.

I see what you mean, but new tasks also have to be initialized with their stack filled with data enough to do an IRET anyways right? I just don't see how schedule() and IRET go together, maybe you could explain it for me if you don't mind.

nullplan · Post by **nullplan** » Thu Jun 25, 2020 10:12 pm

No, an IRET is not really in the cards here. Not directly, anyway. The following are in different files (and different directories):

Code: Select all

struct task *current;
void schedule(void) {
  struct task *next = pick_next_task();
  if (next != current)
    arch_switch_task(next);
}

Code: Select all

void arch_switch_task(struct task* next) {
  struct task *this = current;
  if (this->flags & TIF_FPU)
    save_fpu(this);
  if (this->flags & TIF_DEBUG)
    disable_dr();
  switch_task_asm(&this->arch.sp, next->arch.sp);
  if (current->flags & TIF_EXIT)
    current->flags |= TIF_FREE;
  current = this;
  if (this->flags & TIF_DEBUG)
    restore_dr(this);
  if (this->flags & TIF_FPU)
    restore_fpu(this);
  if (!(this->flags & TIF_KERNEL)
    load_cr3(this->arch.cr3);
}

Code: Select all

switch_task_asm:
  pushq %rbp
  pushq %rbx
  pushq %r12
  pushq %r13
  pushq%r14
  pushq %r15
  movq %rsp, (%rdi)
  movq%rsi, %rsp
  popq %r15
  popq %r14
  popq %r13
  popq %r12
  popq %rbx
  popq %rbp
  movl %cr0, %eax
  bts $3, %eax
  movl %eax, %cr0
  retq

See, this code really doesn't care about whether the next task is initialized. It just switches to the stack. This way of jumping to another stack and leaving the current one hanging is also why arch_switch_task() has to restore the stuff of "this" instead of "next". By the time that code runs, "this" has been selected as "next" in some other task.

So, how do you start a new task? Simple: You put 7 words on the stack of the new task to get a frame that can be picked up by switch_task_asm.

Code: Select all

extern void init_ktask_asm(void);
int new_kernel_task(void (*routine)(void*), void* arg) {
  [alloc new stack and task structure]
  uint64_t *newstack = you_dont_ask_and_i_dont_tell();
  newstack -= 7;
  newstack[0] = (uint64_t)routine;
  newstack[1] = (uint64_t)arg;
  newstack[2] = (uint64_t)new_task;
  newstack[5] = 0;
  newstack[6] = (uint64_t)init_ktask_asm;
  new_task->arch.sp = (uint64_t)newstack;
}
noreturn void init_ktask(struct task *this, void (*routine)(void*), void *arg) {
  current = this;
  sti();
  routine(arg);
  this->flags |= TIF_EXIT;
  make_not_runnable(this);
  for (;;)
    schedule();
}

Code: Select all

init_ktask_asm:
  sti
  movq %r13, %rdi
  movq %r15, %rsi
  movq %r14, %rdx
  call init_ktask
  ud2

For user tasks, exiting is taken care of by user space. And so, for a user task, I actually do only have to prepare a stack for a sysret. But I use the normal syscall exit code for this

Code: Select all

int new_user_task(uint64_t ip, uint64_t sp) {
  [alloc new stack and task structure]
  struct regs* regs = task_regs(new_task);
  regs->dx = 0;
  regs->cs = USER_CS;
  regs->cx = regs->ip = ip;
  regs->r11 = regs->flags = USER_FLAGS;
  regs->ss = USER_DS;
  regs->sp = sp;
  uint64_t *newstack = (uint64_t*)regs;
  newstack -= sizeof regs / sizeof (uint64_t) + 7;
  newstack[0] = (uint64_t)new_task,
  newstack[6] = (uint64_t)init_utask;
  new_task->arch.sp = (uint64_t)newstack;
}

Code: Select all

init_utask:
  movq %r15, current
  xorl %eax, %eax
  jmp sysret_fast_path

And sysret_fast_path will simply restore registers, perform a sanity check, set SP and sysret. This is possible since the ABI does not specify the values of any of the registers except RSP and RDX, so I can put into RCX and R11 whatever I want. As I said, schedule can only ever be called from syscall context, kernel thread context, or late interrupt context (the interrupt is done, and only performing a return to user space is left to do), so sysret is permissible (there is a problem with using sysret from nested interrupts).

8infy · Post by **8infy** » Fri Jun 26, 2020 12:24 am

nullplan wrote:No, an IRET is not really in the cards here. Not directly, anyway. The following are in different files (and different directories):
Code: Select all
struct task *current;
void schedule(void) {
  struct task *next = pick_next_task();
  if (next != current)
    arch_switch_task(next);
}
Code: Select all
void arch_switch_task(struct task* next) {
  struct task *this = current;
  if (this->flags & TIF_FPU)
    save_fpu(this);
  if (this->flags & TIF_DEBUG)
    disable_dr();
  switch_task_asm(&this->arch.sp, next->arch.sp);
  if (current->flags & TIF_EXIT)
    current->flags |= TIF_FREE;
  current = this;
  if (this->flags & TIF_DEBUG)
    restore_dr(this);
  if (this->flags & TIF_FPU)
    restore_fpu(this);
  if (!(this->flags & TIF_KERNEL)
    load_cr3(this->arch.cr3);
}
Code: Select all
switch_task_asm:
  pushq %rbp
  pushq %rbx
  pushq %r12
  pushq %r13
  pushq%r14
  pushq %r15
  movq %rsp, (%rdi)
  movq%rsi, %rsp
  popq %r15
  popq %r14
  popq %r13
  popq %r12
  popq %rbx
  popq %rbp
  movl %cr0, %eax
  bts $3, %eax
  movl %eax, %cr0
  retq
See, this code really doesn't care about whether the next task is initialized. It just switches to the stack. This way of jumping to another stack and leaving the current one hanging is also why arch_switch_task() has to restore the stuff of "this" instead of "next". By the time that code runs, "this" has been selected as "next" in some other task.

So, how do you start a new task? Simple: You put 7 words on the stack of the new task to get a frame that can be picked up by switch_task_asm.
Code: Select all
extern void init_ktask_asm(void);
int new_kernel_task(void (*routine)(void*), void* arg) {
  [alloc new stack and task structure]
  uint64_t *newstack = you_dont_ask_and_i_dont_tell();
  newstack -= 7;
  newstack[0] = (uint64_t)routine;
  newstack[1] = (uint64_t)arg;
  newstack[2] = (uint64_t)new_task;
  newstack[5] = 0;
  newstack[6] = (uint64_t)init_ktask_asm;
  new_task->arch.sp = (uint64_t)newstack;
}
noreturn void init_ktask(struct task *this, void (*routine)(void*), void *arg) {
  current = this;
  sti();
  routine(arg);
  this->flags |= TIF_EXIT;
  make_not_runnable(this);
  for (;;)
    schedule();
}
Code: Select all
init_ktask_asm:
  sti
  movq %r13, %rdi
  movq %r15, %rsi
  movq %r14, %rdx
  call init_ktask
  ud2
For user tasks, exiting is taken care of by user space. And so, for a user task, I actually do only have to prepare a stack for a sysret. But I use the normal syscall exit code for this
Code: Select all
int new_user_task(uint64_t ip, uint64_t sp) {
  [alloc new stack and task structure]
  struct regs* regs = task_regs(new_task);
  regs->dx = 0;
  regs->cs = USER_CS;
  regs->cx = regs->ip = ip;
  regs->r11 = regs->flags = USER_FLAGS;
  regs->ss = USER_DS;
  regs->sp = sp;
  uint64_t *newstack = (uint64_t*)regs;
  newstack -= sizeof regs / sizeof (uint64_t) + 7;
  newstack[0] = (uint64_t)new_task,
  newstack[6] = (uint64_t)init_utask;
  new_task->arch.sp = (uint64_t)newstack;
}
Code: Select all
init_utask:
  movq %r15, current
  xorl %eax, %eax
  jmp sysret_fast_path
And sysret_fast_path will simply restore registers, perform a sanity check, set SP and sysret. This is possible since the ABI does not specify the values of any of the registers except RSP and RDX, so I can put into RCX and R11 whatever I want. As I said, schedule can only ever be called from syscall context, kernel thread context, or late interrupt context (the interrupt is done, and only performing a return to user space is left to do), so sysret is permissible (there is a problem with using sysret from nested interrupts).

Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right? But i'm still confused about the userland example, does sysret_fast_path set the iret frame? Thats kind of the most important piece here

Sorry maybe I'm just dumb, it's hard for me to understand what's going on here exactly, especially since you're doing x64... Anyways, I'll read it a few more times and try to comprehend whatever is going on

nullplan · Post by **nullplan** » Fri Jun 26, 2020 6:12 am

8infy wrote:Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right?

Pretty much. See, I have a different kernel stack for each task (kernel or user task doesn't matter). If an IRQ arrives while I'm in userspace, then the IRET frame will be on top of the stack. If an IRQ arrives while I'm inside the kernel, then the IRET frame will be in the middle of the stack somewhere, and it doesn't really matter where. And yes, if a user space task is performing calculations until its time is up and it is preempted by the timer IRQ, then the timer IRQ will set the timeout flag, which will make the return-to-userspace code call schedule(), and then the task is basically sidelined with the IRET frame and whatever else is needed to return to userspace on stack.

8infy wrote: But i'm still confused about the userland example, does sysret_fast_path set the iret frame?

No. I should mention that my struct regs already contains an iret frame at the top. That frame will not be used, however. The definition is this:

Code: Select all

struct regs {
  uint64_t r15, r14, r13, r12, r11, r10, r9, r8;
  uint64_t di, si, dx, cx, bx, ax, bp;
  uint64_t code, ip, cs, flags,  sp, ss;
};

The last line is an IRET frame, and the "code" member corresponds to either the error code in exceptions, or else the syscall number in system calls or the negative interrupt number in IRQs. This way, all causes of entering the kernel can create exactly the same register image on stack, and therefore anyone can find out the registers of any task at any time, simply by looking at the top of the kernel stack. That helps with debugging.

sysret_fast_path is a symbol in the syscall code, which expects RSP to point to this structure, and merely copies all members of this structure into their respective registers (except for SP). Then, it also discards the "code" member. Then, if CX happens to equal IP (which is still on stack), and R11 happens to equal "flags" (which is still on stack), then the return to userspace can be done with the sysret instruction. Else something fishy is going on, but my code will just perform an iret. If that iret faults, I'll tell the user process about it in the form of a SIGSEGV.

Now, you can't copy this completely, since you also have to deal with segment registers beyond what I showed here (indeed, the full version of schedule() also saves and restores the FS and GS base), but you must deal with switching DS and ES when entering and leaving kernel space. Though that is only a small deviation from what I did here.

8infy · Post by **8infy** » Fri Jun 26, 2020 7:00 am

nullplan wrote:
8infy wrote:Hmm thanks, I think I see what you mean. So basically you don't even wait for an IRQ to start a new task, you call schedule which switches the stack and then you start directly executing the new task and when an IRQ happens and you're inside that task the IRQ has already set up the context for iret, right?
Pretty much. See, I have a different kernel stack for each task (kernel or user task doesn't matter). If an IRQ arrives while I'm in userspace, then the IRET frame will be on top of the stack. If an IRQ arrives while I'm inside the kernel, then the IRET frame will be in the middle of the stack somewhere, and it doesn't really matter where. And yes, if a user space task is performing calculations until its time is up and it is preempted by the timer IRQ, then the timer IRQ will set the timeout flag, which will make the return-to-userspace code call schedule(), and then the task is basically sidelined with the IRET frame and whatever else is needed to return to userspace on stack.

8infy wrote: But i'm still confused about the userland example, does sysret_fast_path set the iret frame?
No. I should mention that my struct regs already contains an iret frame at the top. That frame will not be used, however. The definition is this:
Code: Select all
struct regs {
  uint64_t r15, r14, r13, r12, r11, r10, r9, r8;
  uint64_t di, si, dx, cx, bx, ax, bp;
  uint64_t code, ip, cs, flags,  sp, ss;
};
The last line is an IRET frame, and the "code" member corresponds to either the error code in exceptions, or else the syscall number in system calls or the negative interrupt number in IRQs. This way, all causes of entering the kernel can create exactly the same register image on stack, and therefore anyone can find out the registers of any task at any time, simply by looking at the top of the kernel stack. That helps with debugging.

sysret_fast_path is a symbol in the syscall code, which expects RSP to point to this structure, and merely copies all members of this structure into their respective registers (except for SP). Then, it also discards the "code" member. Then, if CX happens to equal IP (which is still on stack), and R11 happens to equal "flags" (which is still on stack), then the return to userspace can be done with the sysret instruction. Else something fishy is going on, but my code will just perform an iret. If that iret faults, I'll tell the user process about it in the form of a SIGSEGV.

Now, you can't copy this completely, since you also have to deal with segment registers beyond what I showed here (indeed, the full version of schedule() also saves and restores the FS and GS base), but you must deal with switching DS and ES when entering and leaving kernel space. Though that is only a small deviation from what I did here.

Thanks

Do you think it's a good approach if my userland tasks first get directed to a userland bootstrap function (via ret in schedule()) that basically only does an IRET from a pre-initialized IRET frame? So all userland tasks' stack comes preinitialized with registers on it enough to do a schedule() followed by an iret

nullplan · Post by **nullplan** » Fri Jun 26, 2020 9:30 am

8infy wrote:Do you think it's a good approach if my userland tasks first get directed to a userland bootstrap function (via ret in schedule()) that basically only does an IRET from a pre-initialized IRET frame? So all userland tasks' stack comes preinitialized with registers on it enough to do a schedule() followed by an iret

You do you. But that is pretty much exactly what I'm doing. I'm only using sysret_fast_path because that code already had to exist. And of the two ways to return to userspace, sysret is the faster one. It clobbers two registers at least, but that is not a concern at process startup.

8infy · Post by **8infy** » Fri Jun 26, 2020 10:29 am

nullplan wrote:
8infy wrote:Do you think it's a good approach if my userland tasks first get directed to a userland bootstrap function (via ret in schedule()) that basically only does an IRET from a pre-initialized IRET frame? So all userland tasks' stack comes preinitialized with registers on it enough to do a schedule() followed by an iret
You do you. But that is pretty much exactly what I'm doing. I'm only using sysret_fast_path because that code already had to exist. And of the two ways to return to userspace, sysret is the faster one. It clobbers two registers at least, but that is not a concern at process startup.

Maybe that's a better way. I'm not that familiar with the differences between sysret and iret, I'm assuming it has something to do with fast syscalls and stuff, is it applicable to the non-longmode x86?
The reason why I'm asking this is because I literally started figuring out how multitasking works a few days ago, and the ideas that i'm suggesting could be absolutely stupid/broken. I don't 100% understand your approach because I don't know how sysret differs from iret and I'm not sure what you mean by `because that code already had to exist`.

nullplan · Post by **nullplan** » Fri Jun 26, 2020 11:20 am

8infy wrote:I'm not that familiar with the differences between sysret and iret, I'm assuming it has something to do with fast syscalls and stuff, is it applicable to the non-longmode x86?

x86 has two fast system call instructions: syscall and sysenter. They both have a reverse operation called sysret and sysexit, respectively.

SYSCALL was AMD's invention. They put it in AMD CPUs and on those, it works well in long mode and legacy mode. SYSENTER, on the other hand, was Intel's invention. They put that in their CPUs, and so SYSENTER works well in long mode and legacy mode on Intel CPUs. However, since there is no entity on the planet more petty than two competing companies, both Intel and AMD have completely kneecapped their implementations of the other company's innovation. And so, on an AMD CPU, SYSENTER will only work in legacy mode, and on an Intel CPU, SYSCALL will only work in long mode. For you, this means you actually can use SYSENTER on a modern CPU since that should be supported. It also means detecting support for SYSENTER and SYSCALL requires you to identify the CPU vendor. And no, I don't think Cyrix ever invented yet another mechanism. Or Centaur. Or VIA. Or... you know, there are a lot of x86 CPU vendors.

So yeah, these things can be supported in legacy mode. Now you just have to detect support correctly. Of note is that Linux will not use SYSCALL on x86 legacy mode even on AMD CPUs, because something didn't work right. They do however use it in long compatibility mode (so 64-bit kernel, 32-bit processes).

8infy wrote:I don't know how sysret differs from iret and I'm not sure what you mean by `because that code already had to exist`.

For our purposes, a minor technicality. SYSRET and IRET mainly differ in the former being faster, but clobbering two registers, and not switching stacks; you have to do that yourself. You know how IRET reads the information on the return context from the stack, right? Well SYSRET does not do this. SYSRET sets CS and SS to the selector values found in the STAR MSR, sets RIP to RCX and RFLAGS to R11. That is why RCX and R11 cannot contain arbitrary values, and why SYSRET is not suitable to return from interrupts. Also, since the stack is not switched, there is a tiny amount of time between loading RSP and executing the SYSRET that an interrupt might arrive while in kernel mode with a user stack. That problem can be solved by disabling interrupts before attempting that. There might be an NMI, but I can set up the NMI handler to get a guaranteed kernel stack.

The sysret code already had to exist, because I needed it to return from syscall. I wrote the syscall handler first, and then the multitasking code afterwards. The syscall handler was basically an outgrowth of the interrupt handler code I needed first.

8infy · Post by **8infy** » Fri Jun 26, 2020 11:45 am

nullplan wrote:
8infy wrote:I'm not that familiar with the differences between sysret and iret, I'm assuming it has something to do with fast syscalls and stuff, is it applicable to the non-longmode x86?
x86 has two fast system call instructions: syscall and sysenter. They both have a reverse operation called sysret and sysexit, respectively.

SYSCALL was AMD's invention. They put it in AMD CPUs and on those, it works well in long mode and legacy mode. SYSENTER, on the other hand, was Intel's invention. They put that in their CPUs, and so SYSENTER works well in long mode and legacy mode on Intel CPUs. However, since there is no entity on the planet more petty than two competing companies, both Intel and AMD have completely kneecapped their implementations of the other company's innovation. And so, on an AMD CPU, SYSENTER will only work in legacy mode, and on an Intel CPU, SYSCALL will only work in long mode. For you, this means you actually can use SYSENTER on a modern CPU since that should be supported. It also means detecting support for SYSENTER and SYSCALL requires you to identify the CPU vendor. And no, I don't think Cyrix ever invented yet another mechanism. Or Centaur. Or VIA. Or... you know, there are a lot of x86 CPU vendors.

So yeah, these things can be supported in legacy mode. Now you just have to detect support correctly. Of note is that Linux will not use SYSCALL on x86 legacy mode even on AMD CPUs, because something didn't work right. They do however use it in long compatibility mode (so 64-bit kernel, 32-bit processes).

8infy wrote:I don't know how sysret differs from iret and I'm not sure what you mean by `because that code already had to exist`.
For our purposes, a minor technicality. SYSRET and IRET mainly differ in the former being faster, but clobbering two registers, and not switching stacks; you have to do that yourself. You know how IRET reads the information on the return context from the stack, right? Well SYSRET does not do this. SYSRET sets CS and SS to the selector values found in the STAR MSR, sets RIP to RCX and RFLAGS to R11. That is why RCX and R11 cannot contain arbitrary values, and why SYSRET is not suitable to return from interrupts. Also, since the stack is not switched, there is a tiny amount of time between loading RSP and executing the SYSRET that an interrupt might arrive while in kernel mode with a user stack. That problem can be solved by disabling interrupts before attempting that. There might be an NMI, but I can set up the NMI handler to get a guaranteed kernel stack.

The sysret code already had to exist, because I needed it to return from syscall. I wrote the syscall handler first, and then the multitasking code afterwards. The syscall handler was basically an outgrowth of the interrupt handler code I needed first.

Ok thanks a lot, that makes sense. So if you didn't have access to sysret, would you do what I suggested in my previous post (the preset iret frame for a new user thread)?

kzinti · Post by **kzinti** » Fri Jun 26, 2020 11:55 am

nullplan wrote:

Code: Select all

int new_kernel_task(void (*routine)(void*), void* arg) {
  [alloc new stack and task structure]
  uint64_t *newstack = you_dont_ask_and_i_dont_tell();
  newstack -= 7;
  newstack[0] = (uint64_t)routine;
  newstack[1] = (uint64_t)arg;
  newstack[2] = (uint64_t)new_task;
  newstack[5] = 0;
  newstack[6] = (uint64_t)init_ktask_asm;
  new_task->arch.sp = (uint64_t)newstack;
}
noreturn void init_ktask(struct task *this, void (*routine)(void*), void *arg) {
...
}

I am not understanding this... Aren't the parameters to init_ktask() passed in rdi, rsi and rdx? Seeding the stack with the parameters and calling task_switch_asm() means the values you setup on the stack will endup in volatile registers (rbp, rbx, r12-15).

kzinti · Post by **kzinti** » Fri Jun 26, 2020 12:02 pm

Oh, I just noticed you aren't returning to init_ktask() directly but to init_ktask_asm(). I am guessing this is a trampoline that moves the parameters to the right registers.

OSDev.org

TSS and software switching

TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching

Re: TSS and software switching