yr wrote:If any changes to the context are applied on resuming, then we have a potential security and stability risk
No we don't. The mcontext contains the context of the userspace thread servicing the signal. If a system call was interrupted, the mcontext will contain registers attesting to that fact, so it will either have IP point to the SYSCALL instruction and all the registers set up for system call entry, or it will have IP point to the instruction following SYSCALL and the return register containing the result.
The important thing here, which I only learned after piecing it together for a few years, is that signals will not be delivered "at any time", as I always thought. One simple implementation of signals is to deliver them only on resuming user space. You have your kernel entry code, and both for interrupts and syscalls, it will at some point enter a kernel function and then return. Before actually doing that, however, just test if there are signals outstanding against the current thread if you are returning to user space. And if so, call another kernel function to handle those.
The ability to change the mcontext is pretty important. musl libc for instance uses that in its implementation of thread cancellation to detect whether the cancel signal was received before or after a system call. If it was before or during the system call, the IP register is set to some place else to act on the cancellation immediately. But if it is set even to the first instruction following the syscall, the cancellation is not acted upon.
yr wrote:So how can we restart the system call within the kernel? Seems like we have to return to user space and have the user space system call wrapper handle the restart.
Yes. That is one reason why for syscalls with timeout, Linux likes to write the remaining time back into the arguments. The idea is that the registers show the userspace state anyway, so if a system call has to be restarted, then all that is needed is to restore the return register to its original value and back off the IP register a few bytes to the syscall instruction again. In x86:
Code: Select all
regs->ax = regs->orig_ax;
regs->ip -= 2;
That is because all system call instructions allowed in x86 are exactly two bytes in length. It is also the reason I gave up on using call gates as a possibility: The only far call instruction available in 64-bit mode is the indirect far call, and that means there is a memory reference in the instruction, and so the instruction can be between 2 and 15 bytes long. And x86 machine code is awkward enough to parse forwards, parsing it backwards is, in general, impossible.
Anyway, that register image will be put on user stack, then on kernel stack the image will be manipulated to safely invoke the signal handler. On return from the signal handler, a special system call called sigreturn will be run, and that will restore user space state with the register image on user stack. If the userspace did something bad here, it can only manipulate things to get access to what it already has. Only the segments need explicit checking. If exceptions are generated, at some point, the process is just terminated.