How do modern OSs handle the CPU regs of multiple processes?

xeyes · Post by **xeyes** » Tue Aug 09, 2022 2:28 am

nullplan wrote:
xeyes wrote:I guess people tend to say return because of the flexiblility of IRET (and to a lesser degree, far return) on x86.
It is the same on many architectures. My beloved PowerPC has "rfi" (return from interrupt), and that instruction is used twice for each syscall and interrupt. Because someone decided that all exceptional conditions ought to turn off the MMU, so the first-level exception handlers have to find a kernel stack to write their info to and then use "rfi" to turn on the MMU again and transition to the kernel handler in the same instruction. And then afterwards it uses "rfi" to return to the user space code.

However, this is a detail. I tend to see execution on a CPU as something in which userspace resides on the topmost stack frame. Any interrupt or syscall transitions to kernel space and pushes the outermost stack frame onto the kernel stack (on PowerPC this just happens in software while the MMU is off, but that is incidental), and finally the CPU will return back to userspace. Scheduling means switching stacks, and the lowlevel switch function does nothing but saving the non-volatile registers before doing so. On thread exit the same thing happens, except the task is marked with a flag that means it will never be scheduled in again. And on startup, initialization does partly entail constructing an initial task stack. The funny thing is, you never need to clean up the stack entirely when running the userspace task for the first time: Just construct an IRET frame and perform an IRET (or maybe just SYSRET), and next time the kernel is called, stack starts over at the place named in the TSS.

Thanks for sharing the interesting detail. So the handlers (or part of them) have to be identity mapped due to running cross MMU on/off boundary? I wonder whether this is because of the fact that the architecture was from a period where most CPUs don't have MMUs? The new power arch (like POWER 10) probably doesn't have the same limitation anymore?

nullplan wrote:The funny thing is, you never need to clean up the stack entirely when running the userspace task for the first time

Why do you need to clean it up ever? Is it because of using some fast syscall instructions that don't swich stack during a syscall? Otherwise if there's always a stack switch, the user level code won't be able to look at it so there's no need to clean up right?

nullplan · Post by **nullplan** » Tue Aug 09, 2022 11:21 pm

xeyes wrote:Thanks for sharing the interesting detail. So the handlers (or part of them) have to be identity mapped due to running cross MMU on/off boundary? I wonder whether this is because of the fact that the architecture was from a period where most CPUs don't have MMUs? The new power arch (like POWER 10) probably doesn't have the same limitation anymore?

The handlers also have fixed addresses (e.g. Data Access Exception is at 0x300, External Interrupt is at 0x700, etc.), so in practice, most operating systems simply copy themselves to the start of address space. The handlers are all in the first 12kB, and why make it more complicated than it has to be? When the kernel takes control of the system, all of RAM is free, so it might as well move itself to address 0. Linux has a linear mapping of address 0 to the 3GB line, so translating between physical and virtual is pretty simple in that range.

As to POWER10, I kind of doubt it. I couldn't find an architecture manual, but I looked at one for PPC64 not too long ago and found that it too will turn off the DR and IR bits in the MSR. I'm guessing they want to keep compatibility, so not all the OSes have to be redeveloped.

xeyes wrote:Why do you need to clean it up ever? Is it because of using some fast syscall instructions that don't swich stack during a syscall? Otherwise if there's always a stack switch, the user level code won't be able to look at it so there's no need to clean up right?

I guess that was misunderstandable. What I meant is that there is no need to always return back exactly the way you came in. The initial task needs to construct its own IRET frame, and it can do so at the bottom of its stack rather than the top. This might mean that a few stack frames are left active when the IRET happens, but it doesn't matter, because the stack will start over at the very top next time the kernel is called. So that's what I meant by "clean up".

xeyes · Post by **xeyes** » Wed Aug 10, 2022 8:39 pm

nullplan wrote:The handlers also have fixed addresses (e.g. Data Access Exception is at 0x300, External Interrupt is at 0x700, etc.), so in practice, most operating systems simply copy themselves to the start of address space. The handlers are all in the first 12kB, and why make it more complicated than it has to be? When the kernel takes control of the system, all of RAM is free, so it might as well move itself to address 0. Linux has a linear mapping of address 0 to the 3GB line, so translating between physical and virtual is pretty simple in that range.

That certainly works, flexibility of placing the handlers isn't that essential esp. if the MMU will be turned off on exception.

nullplan wrote: As to POWER10, I kind of doubt it. I couldn't find an architecture manual, but I looked at one for PPC64 not too long ago and found that it too will turn off the DR and IR bits in the MSR. I'm guessing they want to keep compatibility, so not all the OSes have to be redeveloped.

Seems like a big performance hit to turn MMU off per exception. At least for x86 turning paging off invalidates all TLB and PT caches. But maybe PPC doesn't invalidate them the same way.

nullplan wrote:
xeyes wrote:Why do you need to clean it up ever? Is it because of using some fast syscall instructions that don't swich stack during a syscall? Otherwise if there's always a stack switch, the user level code won't be able to look at it so there's no need to clean up right?
I guess that was misunderstandable. What I meant is that there is no need to always return back exactly the way you came in. The initial task needs to construct its own IRET frame, and it can do so at the bottom of its stack rather than the top. This might mean that a few stack frames are left active when the IRET happens, but it doesn't matter, because the stack will start over at the very top next time the kernel is called. So that's what I meant by "clean up".

Got you. Yup it doesn't matter where the frame is. It can even be in a dedicated 'launch area' that isn't part of any stack.

nullplan · Post by **nullplan** » Thu Aug 11, 2022 12:16 am

xeyes wrote:Seems like a big performance hit to turn MMU off per exception. At least for x86 turning paging off invalidates all TLB and PT caches. But maybe PPC doesn't invalidate them the same way.

Well, it obviously doesn't (since it's designed to turn off the MMU all the time). The details don't matter here, but suffice it to say that PPC has had the idea of a PID-tagged address space since long before the Pentium III. No, the real performance hit is the weird processing that the first-level handlers have to do to find the kernel stack and save the registers there. I saw a talk once by someone who massively improved system call performance by simplifying that part of Linux.

OSDev.org

How do modern OSs handle the CPU regs of multiple processes?

Re: How do modern OSs handle the CPU regs of multiple proces

Re: How do modern OSs handle the CPU regs of multiple proces

Re: How do modern OSs handle the CPU regs of multiple proces

Re: How do modern OSs handle the CPU regs of multiple proces