OSDev.org

Posted: **Thu Jan 03, 2019 3:27 pm**

This is the summarized pseudocode related to the assembler code I use in my x86_64 SMP microkernel to handle interrupts:

Code: Select all

push scratch registers to the stack
make space for thread save and load pointers in the stack
call c_code // handles interrupts in C (passing %rsp as the first argument) // sets thread_save_pointer and thread_load_pointer if needed
if thread_save_pointer // if there is need to save state to thread_save_pointer
  copy saved stack header from the stack to thread_save_pointer
  copy saved scratch registers from the stack to thread_save_pointer
  copy current preserved registers to thread_save_pointer (the registers should have the same value as before the call, because of the calling convention)
if thread_load_pointer // if there is need to load state from thread_load_pointer
  discard old values from the stack
  push stack header from thread_load_pointer to the stack
  restore scratch registers from thread_load_pointer
  restore preserved registers from thread_load_pointer
  iretq
// it reaches here if there is no need to load
discard load and save pointers from the stack
pop scratch registers from the stack
discard interrupt number and error code from the stack
iretq

It is 'optimized' in the sense that it only pushes scratch registers to the stack, as the 'preserved' ones get preserved. Then it calls C/C++ code, and at the end saves/loads them if we have to switch thread (determined in C code). It preserves the control flow as if it were a function call.

But then I learned, in this thread: viewtopic.php?f=15&t=33393 that a proper way of handling, in kernel mode, the reads and writes to and from userspace, is to do a basic check on the pointer and then to perform the reads or writes and let the processor issue exceptions (page faults) if the memory is not readable/writable. As opossed to doing an exhaustive checking, which would mean to parse the page tables and to do some expensive locking.

I realized that to support exceptions (page faults) in kernel mode while accessing user memory, there is need to save or push to the stack also the preserved registers, at least in syscalls, because they will be needed in case the SIGSEGV signal gets catched (the thread should get the same state as before). This seems so because, after the syscall places its data on the kernel stack, and the page fault exception (nested) places its data too, we cannot return from the exception (it would repeat itself) to restore the preserved registers. Then if we do not have saved the preserved registers, they could have been modified (and saved in the stack, but, at unknown places), so there would be no easy way of getting them. If the process was to be terminated, that is not an issue, but if the process catches the signal, it is.

This would mean that I will need to push all registers, at least in syscalls, and that I no longer can preserve the control flow as before: in some cases I have to discard two or more kernel stack frames at once, instead of let it shrink completely, in reverse order as it grew.

So I am writing this to ask whether I reasoned correctly or, on the contrary, there is a better way of doing things.

Posted: **Thu Jan 03, 2019 4:26 pm**

Your principal problem is that you are mixing two things in that code that don't belong together. Interrupt code should be about saving and restoring processor state. Adding state saving and restoration is just asking for trouble.

Here's how I do it. The actual interrupt entry point will save all volatile registers, call the appropriate C handler function with the stack pointer as argument, then restore the registers and perform an iret. The handler function can change the stack image of those registers if necessary. If a page fault or a GPF happens in kernel mode, first I try to correct the issue (there are spurious page faults, or maybe the page table needs updating). If that wasn't it, I walk through an exception table I create at link time. If the exception happened at a place where I expected it, the return address in the iret frame is overwritten with the handler address. If that also fails, that's a panic.

This way, if an exception happens in kernel mode, all registers are restored to their previous value, but control is transfered to the exception handler. For userspace transfers, this will be a small block that returns failure. The failure bubbles up to the system call and the user sees an EFAULT return.

Signal handling is done by saving the state (as far as it exists) on the user stack and modifying the iret frame (or lret frame in my case).

Now, task switching is at the other end. I have a function called taskswitch(), which takes one argument, namely a pointer to the next task. It is written in C, but is arch dependent. On AMD64, it will call a function called taskswitch_asm() which takes two arguments: A pointer to the stack pointer in the current task structure, and a pointer to the stack pointer in the next task structure. That one is written in assembly and it does this:

Code: Select all

taskswitch_asm:
  push all preserved registers (including flags)
  movq %rsp, (%rdi)
  movq (%rsi), %rsp
  pop all preserved registers
  ret

The C based taskswitch() function does the fluff around that: Updating the current thread pointer, changing address space, testing for signals, etc.

It is rarely necessary to suspend a task from interrupt context, but it does happen for pure compute threads. That is the only time I run the scheduler in interrupt context (after the EOI, of course). Shouldn't be a problem, since flags are saved and restored. Otherwise only syscalls switch tasks.

Posted: **Fri Jan 04, 2019 7:35 am**

nullplan wrote: Here's how I do it. The actual interrupt entry point will save all volatile registers, call the appropriate C handler function with the stack pointer as argument, then restore the registers and perform an iret. The handler function can change the stack image of those registers if necessary. If a page fault or a GPF happens in kernel mode, first I try to correct the issue (there are spurious page faults, or maybe the page table needs updating). If that wasn't it, I walk through an exception table I create at link time. If the exception happened at a place where I expected it, the return address in the iret frame is overwritten with the handler address. If that also fails, that's a panic.

This way, if an exception happens in kernel mode, all registers are restored to their previous value, but control is transfered to the exception handler. For userspace transfers, this will be a small block that returns failure. The failure bubbles up to the system call and the user sees an EFAULT return.

Thanks for your reply.

I understand that in a syscall there is no need to save scratch registers, as they are caller save. Only they might need to be cleared before returning.

If userspace passes an invalid pointer in a syscall, is the usual behaviour to return EFAULT instead of raising SIGSEGV?

I do not fully understand whether your example would work in all cases. I came to think that in order to be safe, there is need to save all registers if we do not return from each pushed frame. Let me write an example:
- userspace calls write() with a pointer to unreadable or unmapped memory
- kernel pushes the scratch registers and runs the handler
- inside the handler, there is need for many registers and a preserved one is needed, so it is pushed to the stack and it is used/modified (unlikely but might happen)
- the handler then calls a function to read the data passed in write()
- the function produces a page fault
- the page fault handler does not have a way of restoring that one preserved register, because if it returns to the same address, it will produce a page fault again, and if it does not return to the same address, the register will not be restored because the pop instruction will not be executed and the original value is in an unknown place in the stack.

Please correct me if I am wrong.

Posted: **Fri Jan 04, 2019 4:59 pm**

Yes, EFAULT is the usual thing. SIGSEGV is for the case when userspace accesses an invalid address.

That last step is the stumbling block. The routine that accesses the userspace pointer is special, in that it has an exception table entry. Looks like this:

Code: Select all

copy_from_user:
  # input: rdi - kernel dst pointer
  # rsi - user src pointer
  # rdx - length
  # output: rax - 0 for success, 1 for fault
  movq %rdx, %rcx
  shrq $3, %rcx
  andl $7, %edx
  testq %rcx, %rcx
  jz 4f
1: rep movsq
4: movl %edx, %ecx
2: rep movsb
  xorl %eax, %eax
  ret
3: # thankfully, no stack to clean anywhere
  xorl %eax, eax
  incl %eax
  ret

.section "extable", "aw"
.long 1b, 3b
.long 2b, 3b

That last part is the exception table. This way only works because my kernel definitely runs in the last two GB of address space (kernel code model), therefore the code addresses of the link time functions must fit into 32 bits (with sign extension).

The page fault handler then saves the volatile registers, calls the handler function:

Code: Select all

page_fault:
  push volatile registers
  movq %rsp, %rdi
  call do_page_fault
  pop volatile registers
  addq $8, %rsp # remove error code
  iretq

The page fault handler function gets as argument a pointer to a structure containing all the registers, the error code, and the iret frame. It checks for diverse spurious errors and returns early if those are true. Otherwise for the kernel, it walks through the exception table:

Code: Select all

  extern int __begin_extable[], __end_extable[];
  for (int *p = __begin_extable; p < __end_extable; p += 2)
    if (regs->rip == (intptr_t)*p) {
      regs->rip = (intptr_t)p[1];
      return;
    }

So this way, the page fault handler restores all registers back to the values they had when the exception hit. But since rip is redirected, the iret instruction does not return to the faulting instruction, but to the exception handler (which sets rax to 1 and returns from the function). Then the cleanup code from that whole stuff resets all of its registers to their former values, until the return from syscall happens, at which point only rax is changed (set to -EFAULT).

By the way, the syscall should not change any register besides rax, because of info leak potential.

If copy_from_user() becomes more complicated later, the exception handler would have to restore any changed callee saved register. Think of the exception table as a conditional jump and the rest should follow from that.

Posted: **Sat Jan 05, 2019 6:59 am**

Thank you very much! Now I know how to do it, without pushing preserved registers. The key was the exception tables, which I did not know about.

I guess that the same procedure would work also for copying _to_ user.

If it were a read() instead of a write(), I guess that in case of EFAULT, it is allowed to leave the user memory in an undefined state (half copied or something).

nullplan wrote: By the way, the syscall should not change any register besides rax, because of info leak potential.

I prefer to save them, but maybe one could also do not save them and set them to zero.

Edit: I thought that, if there are many entries in the exception table, it might be sorted at boot, and use a binary search instead of a linear search, in order to be more efficient

Posted: **Sat Jan 05, 2019 1:31 pm**

rod wrote:Thank you very much! Now I know how to do it, without pushing preserved registers. The key was the exception tables, which I did not know about.

Before the exception tables, Linux used to check every address against its own data structures. But it was a lot of work and something the CPU does, anyway, so they changed it. And I copied the idea.

rod wrote:I guess that the same procedure would work also for copying _to_ user.

That particular one, yes, but this was just a lazy first attempt. For instance, the pointers are all misaligned (or at least, alignment is unchecked), which might be negligible, might be horrible, depending on the particular processor (benchmarks of misaligned memory access, done over a 10 year interval, return cyclical results

)

rod wrote:If it were a read() instead of a write(), I guess that in case of EFAULT, it is allowed to leave the user memory in an undefined state (half copied or something).

Interesting point, actually. read() and write() are only supposed to fail if the transfer did not work at all and not a single byte was transmitted. Otherwise they are supposed to return short. But on the other hand, we could argue that the transfer to/from userspace is treated atomically in this instance. And yes, they can trash the user memory. As long as no other memory is trashed, anyway.

rod wrote:I prefer to save them, but maybe one could also do not save them and set them to zero.

Save them, clear them, it's similar effort. You probably already have routines/macros in place to save registers, since that is needed for interrupts to work, so just using the same macros for syscalls is less effort, and less code. And I tell you, the code you didn't write is guaranteed to be bug-free.

rod wrote: Edit: I thought that, if there are many entries in the exception table, it might be sorted at boot, and use a binary search instead of a linear search, in order to be more efficient

Actually that's the point of making the table writable. But a linear search was easier to write in that post. Though, I'm thinking of redoing the startup. I used to dislike self-modifying code, but all the function pointers I'm using are becoming cumbersome. I think I might go for the "alternatives" mechanism after all. In which case I will need to write the code section during startup. So placing the exception table in the code segment, then sorting it after applying all the binary patches, and then applying kernel-mode write-protection might be the better option (the more "write-once" data in that segment, the better for overall safety).

OSDev.org

Do push the preserved registers in the interrupt handler?

Do push the preserved registers in the interrupt handler?

Re: Do push the preserved registers in the interrupt handler

Re: Do push the preserved registers in the interrupt handler

Re: Do push the preserved registers in the interrupt handler

Re: Do push the preserved registers in the interrupt handler

Re: Do push the preserved registers in the interrupt handler