Questions about signals implementation

yr · Post by yr » Mon Dec 14, 2020 10:08 pm

I'm in the process of adding support for signals in my kernel, and I have a couple of questions about the design implications of the POSIX API.

According to spec, the extended signature for a signal handler is:

Code: Select all

void handler(int signo, siginfo_t* info, void* context);

where context is actually a pointer to type ucontext_t, which has at least the following members:

Code: Select all

typedef struct ucontext_t
{
    struct ucontext_t* uc_link;
    sigset_t uc_sigmask;
    stack_t uc_stack;
    mcontext_t uc_mcontext;
} ucontext_t;

The uc_mcontext member is a machine-specific representation of the interrupted context, and the signal handler can examine and modify its contents (at the cost of portability). If any changes to the context are applied on resuming, then we have a potential security and stability risk: if the interrupted context is in the kernel (e.g., in a system call), the signal handler can cause arbitrary kernel code to be executed. So either we treat the uc_mcontext member as read-only (i.e., just pass in a copy and ignore any changes), or we ensure that the interrupted context is always in user space and validate that is still the case before resuming. Treating the context as read-only prevents the execution of random kernel code, but it still exposes addresses and register contents from the kernel, which seems undesirable. So we have to pass in the interrupted context from user space, which pretty much necessitates that signals are delivered at the point where the kernel would return to user space (there are other reasons for this as well). However, this seems to complicate restarting of system calls.

Consider a scenario where a thread is waiting inside the kernel in some system call, and it gets interrupted because of a signal. In order to deliver the signal, we need to save the user context, which requires exiting the system call and returning up the kernel call stack until all the user space register values are restored. At this point, we can save the user context and invoke the signal handler. When the signal handler returns back to the kernel, if SA_RESTART was specified in sigaction() flags when the handler was specified, then we need to restart the interrupted system call. However, since we unwound the kernel stack in order to deliver the signal, there is no generic way to restart properly (each system call will require its own restart logic). So how can we restart the system call within the kernel? Seems like we have to return to user space and have the user space system call wrapper handle the restart. Am I overlooking something?

nullplan · Post by **nullplan** » Mon Dec 14, 2020 10:34 pm

yr wrote:If any changes to the context are applied on resuming, then we have a potential security and stability risk

No we don't. The mcontext contains the context of the userspace thread servicing the signal. If a system call was interrupted, the mcontext will contain registers attesting to that fact, so it will either have IP point to the SYSCALL instruction and all the registers set up for system call entry, or it will have IP point to the instruction following SYSCALL and the return register containing the result.

The important thing here, which I only learned after piecing it together for a few years, is that signals will not be delivered "at any time", as I always thought. One simple implementation of signals is to deliver them only on resuming user space. You have your kernel entry code, and both for interrupts and syscalls, it will at some point enter a kernel function and then return. Before actually doing that, however, just test if there are signals outstanding against the current thread if you are returning to user space. And if so, call another kernel function to handle those.

The ability to change the mcontext is pretty important. musl libc for instance uses that in its implementation of thread cancellation to detect whether the cancel signal was received before or after a system call. If it was before or during the system call, the IP register is set to some place else to act on the cancellation immediately. But if it is set even to the first instruction following the syscall, the cancellation is not acted upon.

yr wrote:So how can we restart the system call within the kernel? Seems like we have to return to user space and have the user space system call wrapper handle the restart.

Yes. That is one reason why for syscalls with timeout, Linux likes to write the remaining time back into the arguments. The idea is that the registers show the userspace state anyway, so if a system call has to be restarted, then all that is needed is to restore the return register to its original value and back off the IP register a few bytes to the syscall instruction again. In x86:

Code: Select all

regs->ax = regs->orig_ax;
regs->ip -= 2;

That is because all system call instructions allowed in x86 are exactly two bytes in length. It is also the reason I gave up on using call gates as a possibility: The only far call instruction available in 64-bit mode is the indirect far call, and that means there is a memory reference in the instruction, and so the instruction can be between 2 and 15 bytes long. And x86 machine code is awkward enough to parse forwards, parsing it backwards is, in general, impossible.

Anyway, that register image will be put on user stack, then on kernel stack the image will be manipulated to safely invoke the signal handler. On return from the signal handler, a special system call called sigreturn will be run, and that will restore user space state with the register image on user stack. If the userspace did something bad here, it can only manipulate things to get access to what it already has. Only the segments need explicit checking. If exceptions are generated, at some point, the process is just terminated.

rdos · Post by **rdos** » Tue Dec 15, 2020 4:49 am

nullplan wrote:
yr wrote:If any changes to the context are applied on resuming, then we have a potential security and stability risk
No we don't. The mcontext contains the context of the userspace thread servicing the signal. If a system call was interrupted, the mcontext will contain registers attesting to that fact, so it will either have IP point to the SYSCALL instruction and all the registers set up for system call entry, or it will have IP point to the instruction following SYSCALL and the return register containing the result.

The important thing here, which I only learned after piecing it together for a few years, is that signals will not be delivered "at any time", as I always thought. One simple implementation of signals is to deliver them only on resuming user space. You have your kernel entry code, and both for interrupts and syscalls, it will at some point enter a kernel function and then return. Before actually doing that, however, just test if there are signals outstanding against the current thread if you are returning to user space. And if so, call another kernel function to handle those.

The ability to change the mcontext is pretty important. musl libc for instance uses that in its implementation of thread cancellation to detect whether the cancel signal was received before or after a system call. If it was before or during the system call, the IP register is set to some place else to act on the cancellation immediately. But if it is set even to the first instruction following the syscall, the cancellation is not acted upon.

yr wrote:So how can we restart the system call within the kernel? Seems like we have to return to user space and have the user space system call wrapper handle the restart.
Yes. That is one reason why for syscalls with timeout, Linux likes to write the remaining time back into the arguments. The idea is that the registers show the userspace state anyway, so if a system call has to be restarted, then all that is needed is to restore the return register to its original value and back off the IP register a few bytes to the syscall instruction again. In x86:
Code: Select all
regs->ax = regs->orig_ax;
regs->ip -= 2;
That is because all system call instructions allowed in x86 are exactly two bytes in length. It is also the reason I gave up on using call gates as a possibility: The only far call instruction available in 64-bit mode is the indirect far call, and that means there is a memory reference in the instruction, and so the instruction can be between 2 and 15 bytes long. And x86 machine code is awkward enough to parse forwards, parsing it backwards is, in general, impossible.

Anyway, that register image will be put on user stack, then on kernel stack the image will be manipulated to safely invoke the signal handler. On return from the signal handler, a special system call called sigreturn will be run, and that will restore user space state with the register image on user stack. If the userspace did something bad here, it can only manipulate things to get access to what it already has. Only the segments need explicit checking. If exceptions are generated, at some point, the process is just terminated.

When I studied this I concluded that it should be possible to deliver signals to userspace from kernel by modifying the return SS:ESP and CS:EIP and then add a call to the actual signal handler as soon as execution resumed in user space. Although, a potential problem is how to force a running user thread into kernel space, but the debug flag in the TSS might be able to do that. Still, I never completed this. I actually find the whole signal concept as outdated legacy.

thewrongchristian · Post by **thewrongchristian** » Tue Dec 15, 2020 5:24 am

rdos wrote:
nullplan wrote: Anyway, that register image will be put on user stack, then on kernel stack the image will be manipulated to safely invoke the signal handler. On return from the signal handler, a special system call called sigreturn will be run, and that will restore user space state with the register image on user stack. If the userspace did something bad here, it can only manipulate things to get access to what it already has. Only the segments need explicit checking. If exceptions are generated, at some point, the process is just terminated.
When I studied this I concluded that it should be possible to deliver signals to userspace from kernel by modifying the return SS:ESP and CS:EIP and then add a call to the actual signal handler as soon as execution resumed in user space. Although, a potential problem is how to force a running user thread into kernel space, but the debug flag in the TSS might be able to do that. Still, I never completed this. I actually find the whole signal concept as outdated legacy.

Signals delivery is handled on the return from kernel to user code, so why would you have to force a user thread into kernel space?

Synchronous signals, like SIGSEGV and SIGFPE, happen in response to user thread actions, and so will enter the kernel as a result of those actions. For example, a NULL pointer reference will trigger a CPU trap, which will be transformed in the kernel into SIGSEGV, and if the user process is catching these, it is handled at the point of returning from the trap.

Asynchronous signals, like SIGTERM, will be handled the next time the process enters the kernel for whatever reason. That reason might be system call, or it might be a hardware interrupt. But once the kernel gets control, the signal can be handled, and the kernel will necessarily be in control anyway as it is the kernel that posts the signal.

While signals might be crude in terms of what they can communicate, at least in comparison to general purpose message IPC, there is nothing that ties signal delivery implementations to any specific strategy. Your kernel can have a general purpose message passing IPC, and signals can be built on top of such a mechanism. It might remove the awkward kernel provided trampoline code, and put the onus on the user side of the code to retrieve and deliver signals, but all that can be hidden behind a POSIX looking signal API.

rdos · Post by **rdos** » Tue Dec 15, 2020 6:19 am

thewrongchristian wrote: Signals delivery is handled on the return from kernel to user code, so why would you have to force a user thread into kernel space?

The obvious reason is that your user application gets stuck in an infinite loop in userspace, and so you want to terminate it with CTRL-C. The problem is that the keyboard event might not happen in the context of the thread that is looping, or even on the same core. That's why you need to have some way of forcing the thread into kernel space.

I can handle this case by setting the debug trap flag for the thread which results in it being stopped in debug mode. Although, it would also be possible to chain a termination request to the application on the usermode return stack and restart it.

thewrongchristian wrote: Synchronous signals, like SIGSEGV and SIGFPE, happen in response to user thread actions, and so will enter the kernel as a result of those actions. For example, a NULL pointer reference will trigger a CPU trap, which will be transformed in the kernel into SIGSEGV, and if the user process is catching these, it is handled at the point of returning from the trap.

Those are exceptions and not signals in my context, as well as in the Win32 context.

thewrongchristian wrote: Asynchronous signals, like SIGTERM, will be handled the next time the process enters the kernel for whatever reason. That reason might be system call, or it might be a hardware interrupt. But once the kernel gets control, the signal can be handled, and the kernel will necessarily be in control anyway as it is the kernel that posts the signal.

Then your IRQ handlers and/or scheduler need to be aware of signals, which is not optimal.

thewrongchristian · Post by **thewrongchristian** » Tue Dec 15, 2020 7:02 am

rdos wrote:
thewrongchristian wrote: Signals delivery is handled on the return from kernel to user code, so why would you have to force a user thread into kernel space?
The obvious reason is that your user application gets stuck in an infinite loop in userspace, and so you want to terminate it with CTRL-C. The problem is that the keyboard event might not happen in the context of the thread that is looping, or even on the same core. That's why you need to have some way of forcing the thread into kernel space.

I can handle this case by setting the debug trap flag for the thread which results in it being stopped in debug mode. Although, it would also be possible to chain a termination request to the application on the usermode return stack and restart it.

I'm not sure what the problem is. If you're trying to kill a rogue user thread, whether using signals or some other mechanism, the kernel has to get control. If you're killing the thread from another core entirely, then that would obviously involve an inter-processor interrupt, which would put the target processor into kernel mode.

Whatever the context the signal is posted from (which may, as you say, be under the context of another process being interrupted by a keyboard interrupt) the kernel will be in control at that point, so it can interrupt the user thread whether we're in that user thread context or not.

What problem is your debug trap solution trying to solve?

nullplan · Post by **nullplan** » Tue Dec 15, 2020 9:25 am

rdos wrote: The obvious reason is that your user application gets stuck in an infinite loop in userspace, and so you want to terminate it with CTRL-C. The problem is that the keyboard event might not happen in the context of the thread that is looping, or even on the same core. That's why you need to have some way of forcing the thread into kernel space.

It's called a tick, and it typically happens once every millisecond.

rdos wrote:Those are exceptions and not signals in my context, as well as in the Win32 context.

The OP is very obviously trying to build a POSIX compatible system. With POSIX signals. Your definitions are unlikely to help them.

rdos wrote:Then your IRQ handlers and/or scheduler need to be aware of signals, which is not optimal.

Specify optimal. In my case, I have to write that code once for every architecture, but signal handling is architecture dependent, anyway. And in my case, all interrupt, exception, and syscall return paths (tho only the slow path in the latter case) converge on the exact same code, and it is that code that tests for having to re-enter the kernel (which it does for signals and for scheduling). So what is the problem?

rdos · Post by **rdos** » Tue Dec 15, 2020 10:07 am

thewrongchristian wrote: I'm not sure what the problem is. If you're trying to kill a rogue user thread, whether using signals or some other mechanism, the kernel has to get control. If you're killing the thread from another core entirely, then that would obviously involve an inter-processor interrupt, which would put the target processor into kernel mode.

Well, I don't support it at this point at all. There is a lot of messy stuff with terminating a user application and so I require things to shut down by themselves in an orderly fashion. Although I have cleanup mechanism for handles and other user resources in place, and the only real issue is thread termination in case it's a multithreaded application.

thewrongchristian wrote: Whatever the context the signal is posted from (which may, as you say, be under the context of another process being interrupted by a keyboard interrupt) the kernel will be in control at that point, so it can interrupt the user thread whether we're in that user thread context or not.

What problem is your debug trap solution trying to solve?

It's a solution that doesn't require IRQs, the scheduler or a common syscall entrypoint to check signals. I don't think such checks should be part time-critical code. The debug trap is a specific IRQ that is not time-critical and thus could check for signals without affecting the performance of time-critical stuff.

rdos · Post by **rdos** » Tue Dec 15, 2020 10:14 am

nullplan wrote: It's called a tick, and it typically happens once every millisecond.

I assume you mean the preemption timer? I don't find it a good idea to burden that with checks for signals. Besides, you might want to kill a user process that doesn't make the preemption timer expire, and so only placing the check in the preemption timer will not do. You will need to place it in the scheduler.

nullplan wrote: Specify optimal. In my case, I have to write that code once for every architecture, but signal handling is architecture dependent, anyway. And in my case, all interrupt, exception, and syscall return paths (tho only the slow path in the latter case) converge on the exact same code, and it is that code that tests for having to re-enter the kernel (which it does for signals and for scheduling). So what is the problem?

I find that a pretty bad idea. Using a single entry-point into kernel is bad enough, but also burdening it with checking signals is even worse.

Gigasoft · Post by **Gigasoft** » Tue Dec 15, 2020 11:12 am

Using the debug trap flag sounds like a nice idea, but you'd have to take note of whether the next instruction is a PUSHF/PUSHFD/PUSHFQ, and handle the case where the next instruction causes an exception. Instead, in my OS, it simply swaps out the return address. If the interrupted thread is on another core (not implemented yet), execution will be synchronized using an IPI. An explicit check is only needed when returning from an exception, since the old return address is about to be overwritten.

nullplan · Post by **nullplan** » Tue Dec 15, 2020 12:12 pm

rdos wrote:I find that a pretty bad idea. Using a single entry-point into kernel is bad enough, but also burdening it with checking signals is even worse.

What is "bad"? What is "worse"? Are you talking about efficiency? If so, have you measured? Or are you just saying nay for the hell of it? Because I am sick to death of destructive criticism like that.

To make it perfectly clear: I don't care about speed. At all. Maybe, once it has been shown that an actual speed problem exists, will I investigate and rectify it, but beyond that, I care about correctness a hell of a lot more than about speed. In fact, it is correctness, then ease of understanding, and speed as a distant third. Right now, my code might not be the fastest implementation out there (in fact I know it isn't), but it is correct. Signals do get delivered unless blocked. Threads do get scheduled out, no matter what they do. And I don't have to break my IRQ handling system to do it. And this all happens in a single place that every user space thread must get past if it has run at all. What would you suggest that does that? And is it as easy to understand as this?

Optimizing for speed is a fools errand these days, since CPUs work in extremely non-intuitive ways. I just saw a lecture by Andrey Alexandrescu in which he sped up a sorting algorithm by doing more work. It is complicated, it makes little sense, and it's a moving target. Most of all: The results are fleeting. Next CPU generation, all your hard work is going to be out of date and you have to start over. But if you make a hash of your source code in pursuit of that coveted last cycle to optimize away, that is here to stay. You are going to have to live with that mess you've made forever. If I can get to 90% of optimal but with readable source and understandable code and design, that is good enough.

yr · Post by yr » Tue Dec 15, 2020 9:36 pm

nullplan wrote:The important thing here, which I only learned after piecing it together for a few years, is that signals will not be delivered "at any time", as I always thought. One simple implementation of signals is to deliver them only on resuming user space. You have your kernel entry code, and both for interrupts and syscalls, it will at some point enter a kernel function and then return. Before actually doing that, however, just test if there are signals outstanding against the current thread if you are returning to user space. And if so, call another kernel function to handle those.

Yes, I realized this recently as well. And then discovered subsequently that this is explicitly described in "The Design of the UNIX Operating System" in the chapter on signals.

nullplan wrote:The ability to change the mcontext is pretty important. musl libc for instance uses that in its implementation of thread cancellation to detect whether the cancel signal was received before or after a system call.

That's interesting to know. Presumably that ties musl libc to the Linux-specific definition of mcontext_t.

nullplan wrote:
yr wrote:So how can we restart the system call within the kernel? Seems like we have to return to user space and have the user space system call wrapper handle the restart.
Yes. That is one reason why for syscalls with timeout, Linux likes to write the remaining time back into the arguments. The idea is that the registers show the userspace state anyway, so if a system call has to be restarted, then all that is needed is to restore the return register to its original value and back off the IP register a few bytes to the syscall instruction again. In x86:
Code: Select all
regs->ax = regs->orig_ax;
regs->ip -= 2;

Thanks. This is a really helpful pointer.

nullplan · Post by **nullplan** » Wed Dec 16, 2020 12:07 pm

yr wrote:That's interesting to know. Presumably that ties musl libc to the Linux-specific definition of mcontext_t.

Somewhat. Since mcontext_t also depends on architecture, each architecture gets to define what the PC/IP field in the mcontext_t structure is called. And there are some that also have to set a second register. The idea is certainly portable. That said, musl explicitly only supports Linux, and using it for other operating systems always means some porting work.

OSDev.org

Questions about signals implementation

Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation

Re: Questions about signals implementation