The cost of a system call

gerryg400 · Post by **gerryg400** » Sun May 06, 2012 3:07 pm

My syscall handler starts out like this. I think it's safe as long as an NMI handler uses its own stack.

syscall_instr_handler:

        /* Get the kernel GS selector */
        swapgs
        /* Save rsp to scratch space */
        movq    %rsp,   %gs:16
        /* move onto the kernel stack */
        movq    %gs:0,  %rsp
        ...

Brendan · Post by **Brendan** » Sun May 06, 2012 5:58 pm

Hi,

Cognition wrote:To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.

Task gates (in 32-bit kernels) would actually work. A second (or third, fourth, etc) NMI would find the NMI task's busy bit set and cause a general protection fault ("interrupt attempted to switch to a busy task"), where the general protection fault handler would end up using the NMI task's (safe) stack. The same would happen for machine check. NMI can interrupt the machine check task, and machine check can interrupt the NMI task. The general protection fault handler can use the error code to determine if the general protection fault was caused by a second NMI or a second machine check.

IST doesn't work though. The first NMI switches to a specific stack, and before the NMI handler can execute its first instruction a second NMI could occur and trash the return RIP of the first NMI handler. You can't do anything to prevent this in the NMI handler (because you didn't get a chance to execute a single instruction). The best you can do is check if the "return RSP" points to the NMI handler's special stack. If it does you know you're screwed - you can't return from the second NMI (because the first NMI's return RIP and return RSP are trashed) and you can't kill the process (because you can't know how many locks, etc the kernel may have been holding at the time the first NMI occurred). Doing a reset/reboot (and potentially losing any data in disk caches) is the only "slightly safe" option left.

I've been trying to think of a better way; but everything I think of ends up being worse. The only way that guarantees 100.000% reliability in long mode (rather than just 99.999%) is disabling SYSCALL/SYSRET and using something else. For 32-bit kernels, because a lot of CPUs (all Intel and older AMD) don't support SYSRET anyway, I'd be tempted to use the same solution (leave SYSCALL/SYSRET disabled) rather than bothering with the extra complexity of TSS shuffling.

Cheers,

Brendan

abhoriel · Post by **abhoriel** » Sun Dec 23, 2012 12:54 pm

Cognition wrote:Generally if you were to use SYSCALL in long mode you simply swapgs and load in a known good pointer.
Code: Select all
user_enter_syscall64:
     swapgs
     mov rax, [gs:KSTACK_OFFSET]
     mov [gs:USTACK_OFFSET], rsp
     mov rsp, rax
     ...
     mov rsp, [gs:USTACK_OFFSET]
     swapgs
     sysret
This is making the assumption you could at least clobber RAX initially as you'll probably return some value in it later. You could also do similar things for protected mode.
Code: Select all
user_enter_syscall32:
   mov ax, PROC_SPECIFIC_DATA_SEG
   mov gs, ax
   mov eax, [gs:KSTACK_OFFSET]
   mov [ss:eax+4], esp
   mov esp, eax
   ...
   pop gs
   pop esp
   sysret
Here the user space GS value is assumed to be determinable from some other structure (thread info for example), which should work out since it's usually used for thread specific data anyways. To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.

Obviously, NMIs are a big problem, but I don't see how this code is safe from normal maskable IRQs. In the syscall code, if an IRQ fires before %rsp has been set to the kernel stack, then the CPU will push everything it needs to push onto the *user* %rsp - which might not even point to a valid stack. This is because the interrupt does not cause a privilege change, so no RSP is loaded from the TSS. One can get round this particular problem by setting the FMASK MSR, such that IF is unset from %rflags when syscall is used. However, the same problem occurs just before sysret, where interrupts must be enabled, and the user %rsp must be replaced.

For example, using the code from above:

Code: Select all

user_enter_syscall64:
     swapgs
     mov rax, [gs:KSTACK_OFFSET]
     mov [gs:USTACK_OFFSET], rsp
     mov rsp, rax
     ...
     mov rsp, [gs:USTACK_OFFSET]
     ; <-------- What if an IRQ occurs here?
     swapgs 
     sysret

I suppose I misunderstand something, otherwise it would have been mentioned here already? I apologise for posting on a relatively old thread.

bluemoon · Post by **bluemoon** » Sun Dec 23, 2012 1:50 pm

abhoriel wrote:In the syscall code, if an IRQ fires before %rsp has been set to the kernel stack,...

Therefore you set the interrupt mask in syscall MSR. I made this same mistake before.

abhoriel wrote:However, the same problem occurs just before sysret,

No, usually syscall stub are not supposed to be re-entry.
Furthermore, if an IRQ trigger inside syscall (after IF set), there will be no stack switch the the IRQ runs on top of the assigned kstack.

abhoriel · Post by **abhoriel** » Sun Dec 23, 2012 1:57 pm

bluemoon wrote:
abhoriel wrote:In the syscall code, if an IRQ fires before %rsp has been set to the kernel stack,...
Therefore you set the interrupt mask in syscall MSR. I made this same mistake before.

abhoriel wrote:However, the same problem occurs just before sysret,
No, syscall stub may or may not be re-entry by design, and there are methods to handle re-entry problem (eg, counters and divert to different stacks).

Thanks for your reply! I'm not talking about re-entracy specifically, just interrupts in general: what if an IRQ occurs after %rsp has been set to the user stack, but before sysret? the CPU will push everything onto the user stack. how can this be prevented?

bluemoon · Post by **bluemoon** » Sun Dec 23, 2012 2:03 pm

abhoriel wrote:what if an IRQ occurs after %rsp has been set to the user stack, but before sysret? the CPU will push everything onto the user stack. how can this be prevented?

Disable interrupt before restore rsp, the flags will be restored from r11 after sysret.

abhoriel · Post by **abhoriel** » Sun Dec 23, 2012 2:18 pm

bluemoon wrote:
abhoriel wrote:what if an IRQ occurs after %rsp has been set to the user stack, but before sysret? the CPU will push everything onto the user stack. how can this be prevented?
Disable interrupt before restore rsp, the flags will be restored from r11 after sysret.

oh yeah, I'd forgotten that flags was restored, my code was saving them manually :/
thanks a lot for your the help.

Owen · Post by **Owen** » Sun Dec 23, 2012 5:22 pm

Alternatively, you could have just made this the user's job, which I would be tempted to do anyway to reduce the amount of time spent with interrupts disabled,

With regards to NMI/machine check, what is the probability of one being recoverable anyway? Aren't they pretty much double fault-like situations, I.e. the machine is going down no matter what you do?

Brendan · Post by **Brendan** » Mon Dec 24, 2012 12:16 am

Hi,

Owen wrote:With regards to NMI/machine check, what is the probability of one being recoverable anyway? Aren't they pretty much double fault-like situations, I.e. the machine is going down no matter what you do?

If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.

For machine check exceptions, the entire point is to make it possible for people to figure out what went wrong. For example, your machine check exception handler might display a "blue screen of death" or append information to an error log somewhere or maybe send a packet to some sort of system health monitoring server on the network before the machine goes down. If you want people with hardware faults to assume the hardware is fine and that your software is to blame, then you can leave the machine check exception disabled (so that hardware faults end up causing triple faults that look exactly like kernel bugs).

Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.

Cheers,

Brendan

Owen · Post by **Owen** » Mon Dec 24, 2012 9:50 am

Brendan wrote:If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.

If a watchdog timer has expired, then the correct course of action is to reboot the system ASAP.

For machine check exceptions, the entire point is to make it possible for people to figure out what went wrong. For example, your machine check exception handler

Brendan wrote:might display a "blue screen of death" or append information to an error log somewhere or maybe send a packet to some sort of system health monitoring server on the network before the machine goes down. If you want people with hardware faults to assume the hardware is fine and that your software is to blame, then you can leave the machine check exception disabled (so that hardware faults end up causing triple faults that look exactly like kernel bugs).

So, again, they're non recoverable. You might wish to perform actions inside of them, but normal functioning has certainly ceased; returning to the previous code is not the appropriate course of action

Brendan wrote:Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.

If a CPU has faulty cache, then the entire contents of every other CPU's cache and main memory are suspect. I'd say the correct action would be to log this somewhere, except that logging it somewhere further risks the integrity of whatever device you log it to. The best approach is probably to immediately halt all cores and display an error message via whatever the appropriate channels are. If one core has faulty cache, this implies major issues with the processor anyhow.

So, both #NMI and #MC almost certainly signify situations where the current node is going down, in which case returning to the previous state of execution is certainly the worst cause of action.

Brendan · Post by **Brendan** » Mon Dec 24, 2012 11:32 am

Hi,

Owen wrote:
Brendan wrote:If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.
If a watchdog timer has expired, then the correct course of action is to reboot the system ASAP.

If the NMI is used by the kernel to check if it has locked up or not, then if nothing has locked up you want to return to whatever was interrupted; and you wouldn't want to reboot the system ASAP.

Owen wrote:
Brendan wrote:Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.
If a CPU has faulty cache, then the entire contents of every other CPU's cache and main memory are suspect. I'd say the correct action would be to log this somewhere, except that logging it somewhere further risks the integrity of whatever device you log it to. The best approach is probably to immediately halt all cores and display an error message via whatever the appropriate channels are. If one core has faulty cache, this implies major issues with the processor anyhow.

For which CPU/s?

Caches in Xeon's use ECC; so it's very likely that the contents of other CPUs and main memory are perfectly fine because an (uncorrected) cache fault wasn't detected earlier. In this case would you shut down ASAP (and lose unsaved data and potentially corrupt file systems, etc) to avoid an insignificant risk?

Cheers,

Brendan

Owen · Post by **Owen** » Mon Dec 24, 2012 2:39 pm

Brendan wrote:Hi,

Owen wrote:
Brendan wrote:If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.
If a watchdog timer has expired, then the correct course of action is to reboot the system ASAP.
If the NMI is used by the kernel to check if it has locked up or not, then if nothing has locked up you want to return to whatever was interrupted; and you wouldn't want to reboot the system ASAP.

You're fundamentally misunderstanding the way watchdog timers work (in particular, they only trigger if the system has failed to clear them sufficiently recently)

Brendan wrote:
Owen wrote:
Brendan wrote:Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.
If a CPU has faulty cache, then the entire contents of every other CPU's cache and main memory are suspect. I'd say the correct action would be to log this somewhere, except that logging it somewhere further risks the integrity of whatever device you log it to. The best approach is probably to immediately halt all cores and display an error message via whatever the appropriate channels are. If one core has faulty cache, this implies major issues with the processor anyhow.
For which CPU/s?

Caches in Xeon's use ECC; so it's very likely that the contents of other CPUs and main memory are perfectly fine because an (uncorrected) cache fault wasn't detected earlier. In this case would you shut down ASAP (and lose unsaved data and potentially corrupt file systems, etc) to avoid an insignificant risk?

Cheers,

Brendan

A single uncorrectable ECC fault is likely to have no implications beyond the termination of the process which was running at the time (assuming the fault occurred in user space). If it occurred in kernel space, then all bets are off with regards to the consistency of whatever the kernel was working on, and therefore continued correct behaviour of the kernel is unlikely.

An ECC exception while a machine check exception was executing is, by definition, an ECC error in kernel space.

rdos · Post by **rdos** » Mon Dec 24, 2012 3:10 pm

Owen wrote: You're fundamentally misunderstanding the way watchdog timers work (in particular, they only trigger if the system has failed to clear them sufficiently recently)

Good point. Additionally, watchdogs typically don't use NMI, but othen do hardware resets directly. A well-written OS has no need for hardware watchdogs at all.

Owen wrote: A single uncorrectable ECC fault is likely to have no implications beyond the termination of the process which was running at the time (assuming the fault occurred in user space). If it occurred in kernel space, then all bets are off with regards to the consistency of whatever the kernel was working on, and therefore continued correct behaviour of the kernel is unlikely.

An ECC exception while a machine check exception was executing is, by definition, an ECC error in kernel space.

At least I have a different view of that. Kernel faults are of two types: Faults in the scheduler are fatal and unrecoverable and would immediately hangup or reboot the system. Faults in the kernel, which doesn't have scheduler locks, typically only affect a single device, and thus are not fatal, but could somewhat lockup certain hardware devices. Such faults could also be application-related, and no more severe than the same faults in user-space.

rdos · Post by **rdos** » Sun Dec 30, 2012 4:53 am

Finally I've got syscall to work. There are some problems (especially for a kernel that cannot tolerate RSP being above 4G).

In order to make it work, I need to use the IA32_FMASK to clear interrupt flag (and direction flag, just for safety). In the entry for syscall I must load the kernel RSP before enabling interrupts. That handles everything except single-stepping over syscall. I could also clear trap flag in the IA32_FMASK, but then I would need to reload it after loading kernel RSP, which seems unnecessary. Another solution that works (which I use now) is to reserve an IST-stack for single-step.

As long as NMI doesn't switch to compability-mode, there should be no problem with NMI. Using an IST stack for NMI should work if NMI switches to compability-mode.

Loading the correct kernel stack is a similar problem as for sysenter (I actually just removed the sysenter code again because I feel it is redundant now). It is possible to have the stack pointed to by FS or GS, but that would make it vulnerable to user level tampering. I suppose swapgs could also be used, but that doesn't feel compelling either. I think I will instead do one syscall entry stub per core (dynamically allocated), and let the first instruction load the linear address of the processor core. This structure contains the current thread linear address, which contains the linear address of the kernel stack. That should be just as fast as the other alternatives, and need no shared resource, neither any involvement of the scheduler.

rdos · Post by **rdos** » Sun Dec 30, 2012 10:34 am

Here is the new syscall entry handler (which is copied to a new memory area per core):

Code: Select all

syscall_entry:
    mov r9,core_block_linear64            ; the address is patched at core initialization time
    mov r9d,[r9].ps_syscall_esp
    xchg rsp,r9
    push r9                                     ; save stack
    push rcx                                    ; save entry-point
    push r11
    popfq                                        ; this fixes trap-flag and enables ints again
    mov rcx,r8                                 ; reload original ECX value

; do something

    mov r8,rcx
    pop rcx
    cli
    pop rsp
    db 48h
    sysret

IA32_FMASK is set to 700h to disable ints, TF and DF. There is still a need for an IST for single step, because otherwise the code would fault just before sysret.

IA32_LSTAR is programmed to different addresses per core. The current kernel stack of (the long mode) thread is patched in scheduler (read from thread-block and saved to processor block (ps_syscall_esp)).

No general register is destroyed (however syscall uses RCX, which is pretty bad). The code calling syscall must load r8 with rcx (or load contents into r8 when the API uses ECX), save r9 and r11 (or inform GCC that r8, r9 and r11 are used + all the general registers where the upper halfs might become destroyed).

OSDev.org

The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call

Re: The cost of a system call