OSDev.org

Posted: **Thu Mar 25, 2021 12:40 pm**

I'm trying to work on an alternative to using swapgs to set up a kernel stack when syscall is used. There's known difficulties with swapgs, making sure that you always swap back properly to restore the userspace gs base, plus there's an entire swapgs-specific Spectre exploit.

Frankly, I'm a little annoyed at the way it was implemented. They provided a symmetrical solution to an asymmetrical problem. It might just have been better to create a new segment register that's only accessible at CPL 0, which could be preset to each CPU's data struct, and have a new opcode prefix sequence to access it.

The way I'm trying it instead is to use rip-relative addressing to access the CPU struct instead of addressing like %gs:0, %gs:8, etc. It's working fine, but scaling it to anything more than 1 CPU is a little messy. The idea I had was that you could use dedicated linear address mappings for each CPU's syscall entry point and per-CPU struct. The code addresses would all point to the same physical memory address, but the CPU struct mappings would have to point to separate physical addresses. The rip-relative addressing should work everywhere as long as the offset between the code and data linear addressing is consistent.

The downside, of course, is a little more memory consumption, and more effort to set up, but the simplicity of addressing and avoiding swapgs's problems seems more than worth it.

Any thoughts? Am I asking for trouble trying to deviate from the norm?

Posted: **Thu Mar 25, 2021 1:18 pm**

My understanding is that if you don't use swapgs, you'll have to read IA32_GS_BASE and save it on syscall entry (and restore it before sysret). Similar to how you would save/restore %fs on ia32.

The downside would be that it is far less efficient in term of CPU cycles. Or that's what I remember anyways.

It was tricky to understand and get the swapgs logic working, but when you don't get it right, it tends not to work at all. So it's not that hard to debug / get right in the end.

And yes the asymmetry was driving me nuts. What a bad design.

Posted: **Thu Mar 25, 2021 1:22 pm**

kzinti wrote:You could just save %gs on syscall entry

How and where? That's the issue that led to swapgs. You don't have a kernel stack yet. Plus, there's a tiny performance hit changing segment registers, which defeats the whole point of fast syscall.

Posted: **Thu Mar 25, 2021 1:24 pm**

Yes my bad, I updated my answer to say that IA32_GS_BASE needs to be saved. But yeah, how do you even do that since you need one location per CPU.

I just bit the bullet and used swapgs. I think it's the best way to go about it unfortunately.

A bug I had and took me a while to figure out is that I had forgotten to use swapgs when first entering user mode (and thus not returning from a syscall).

Posted: **Thu Mar 25, 2021 2:11 pm**

The only way I see to avoid swapgs is to have the kernel stack be at a fixed address. Which probably doesn't work with multiple processes and definitely doesn't work for multiple processors if those addresses are referring to the same physical addresses. So they would have to be different for each task. Makes task switching more complicated. Also, tasks cannot share pointers to stack memory anymore, which would throw a wrench in my kernel, since I need different tasks to share objects on their stacks a lot, but if you don't, that would be a workable solution. However, even then you need to ensure the fixed address is within the last two GB of address space for the sign extension on the 32-bit displacements to work out.

Posted: **Thu Mar 25, 2021 3:04 pm**

swapgs can be tamed by distinguishing three cases:

If you know that you're interrupting userspace (e.g., on syscalls), you just swapgs unconditionally. This is the only case in which you have to load the stack from GS, and that can be exploited in the following cases.
If the interrupt is due to an IRQ or a synchronous exception (e.g., page fault), you know that the interrupted code is not in a syscall/IRQ trampoline and you check the previous CS to see if you need to swapgs. This assumes that exceptions never interrupt exception entry code, but hey, if that happens, you're likely doomed anyway.
Otherwise, you are in an asynchronous exception (NME or MCE). In this case, you are on an IST stack anyway; in particular, you are at a known offset compared to the top of the stack. You can just embed the kernel GS above the top of the stack and manually change the GS base to this value (while saving the previous GS base).

Posted: **Thu Mar 25, 2021 3:55 pm**

I have to admit I'm a little confused by the use of swapgs for interrupts and exceptions. You definitely need to do something for syscalls, because no automatic stack switch occurs, but other events change the stack for you, either to rsp0 in the TSS, or by using an IST.

If you stick to using swapgs solely for syscalls, then I guess it's relatively simple. If you return from the syscall, do swapgs again before sysret. If you don't return immediately (say, the task terminates, or requests a device access that suspends the process), then swapgs to restore MSR_GSBASE before branching off somewhere else (scheduler, etc.)

Still, I don't like the scenario where you somehow could return to user space with the wrong gs base. You could check it every time, but that adds overhead.

Posted: **Thu Mar 25, 2021 4:22 pm**

You want GS not only to obtain the syscall stack but also to access per-CPU data. Hence, you also need to make sure that GS is correct in IRQ and exception handlers.

You never return to user space with a wrong GS base.

Posted: **Thu Mar 25, 2021 4:30 pm**

Korona wrote:You want GS not only to obtain the syscall stack but also to access per-CPU data. Hence, you also need to make sure that GS is correct in IRQ and exception handlers.

Well, sure, but the reason you use swapgs for syscall is because you normally can't easily access the per-CPU data without modifying registers, and you don't have a stack yet. In an IRQ or exception handler, you can push registers, access the per-CPU data, and then pop them back.

If the reason is to avoid having two different mechanisms to access the per-CPU data, then you could use the %gs:X method just long enough to swap stacks, and use the stack-based method after that.

I'm just not liking the weirdness I'm seeing where OS's like Linux are using swapgs for exceptions and then having to dig through the stack to skip swapgs if it's not appropriate, then figure out whether to swap it back. It all seems so hackish to me. I can understand having to do hackish things because of mistakes Intel made in 1978. We shouldn't have to do it in this century.

Posted: **Thu Mar 25, 2021 5:17 pm**

If you have an alternative per-CPU mechanism based on per-CPU entry stubs, just use that to obtain the syscall stack. But IMHO it is questionable whether per-CPU entry stubs are a better solution than swapgs (per-CPU stubs are certainly more tedious to implement). (Although, yes, swapgs is a horrible design; I think everybody will agree on that -- even Intel and AMD are fixing the syscall/interrupt model in upcoming CPUs.)

You do not need to dig through any stacks to swapgs correctly, except in the case of asynchronous exceptions, i.e., NMI and MCE. For everything else, you can just check the CS value of the iret frame to see if you interrupted kernel mode or not.

Posted: **Thu Mar 25, 2021 5:42 pm**

Korona wrote:If you have an alternative per-CPU mechanism based on per-CPU entry stubs, just use that to obtain the syscall stack. But IMHO it is questionable whether per-CPU entry stubs are a better solution than swapgs (per-CPU stubs are certainly more tedious to implement).

It's more tedious to set up, I admit. But setup happens when nothing is on the line, and I like the idea that once it's set up you can use it in a simple and straightforward manner, and there's no way to leak it into user space by accident. I still think a CPL 0-only register would be much better.

even Intel and AMD are fixing the syscall/interrupt model in upcoming CPUs.

Oh, good, a FOURTH system call mechanism.

I hope they're fixing exceptions, too. That's part of what I was referring to as Intel's 1978 mistakes. They should have pushed the exception/interrupt number on the stack with the return address/flags/etc. so you don't need 256 separate stubs just to identify which interrupt may have fired by mistake.

Posted: **Fri Mar 26, 2021 2:36 am**

That's what's happening. It's not a new syscall mechanism, it replaces the IDT entirely.

Posted: **Fri Mar 26, 2021 6:59 pm**

This entire mechanism still seems overly complicated. From what I can tell, it also only works in IA32-E mode, not in long mode (you have to set CR4.FRED [bit 31] to one, but that only appears to enable it in emulated 32-bit mode and not in long mode). I'm not really sure what the point of this is if it only works in IA32-E mode, especially since nearly every OS runs solely in long mode now.

Posted: **Fri Mar 26, 2021 7:51 pm**

"IA-32e mode" is Intel's name for long mode.

AMD proposes fixing the stupid design mistakes in the existing system rather than replacing it entirely. It's unclear whether having two competing proposals is a good thing.

Posted: **Fri Mar 26, 2021 8:14 pm**

Octocontrabass wrote:"IA-32e mode" is Intel's name for long mode.

AMD proposes fixing the stupid design mistakes in the existing system rather than replacing it entirely. It's unclear whether having two competing proposals is a good thing.

They're "fixing" it by rolling back part of the entire purpose of syscall, which is fast entry. Writing an exception-like frame to the new stack seems like the wrong thing to do.

OSDev.org

alternatives to swapgs

alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs

Re: alternatives to swapgs