Code: Select all
syscall_instr_handler:
/* Get the kernel GS selector */
swapgs
/* Save rsp to scratch space */
movq %rsp, %gs:16
/* move onto the kernel stack */
movq %gs:0, %rsp
...
Code: Select all
syscall_instr_handler:
/* Get the kernel GS selector */
swapgs
/* Save rsp to scratch space */
movq %rsp, %gs:16
/* move onto the kernel stack */
movq %gs:0, %rsp
...
Task gates (in 32-bit kernels) would actually work. A second (or third, fourth, etc) NMI would find the NMI task's busy bit set and cause a general protection fault ("interrupt attempted to switch to a busy task"), where the general protection fault handler would end up using the NMI task's (safe) stack. The same would happen for machine check. NMI can interrupt the machine check task, and machine check can interrupt the NMI task. The general protection fault handler can use the error code to determine if the general protection fault was caused by a second NMI or a second machine check.Cognition wrote:To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.
Obviously, NMIs are a big problem, but I don't see how this code is safe from normal maskable IRQs. In the syscall code, if an IRQ fires before %rsp has been set to the kernel stack, then the CPU will push everything it needs to push onto the *user* %rsp - which might not even point to a valid stack. This is because the interrupt does not cause a privilege change, so no RSP is loaded from the TSS. One can get round this particular problem by setting the FMASK MSR, such that IF is unset from %rflags when syscall is used. However, the same problem occurs just before sysret, where interrupts must be enabled, and the user %rsp must be replaced.Cognition wrote:Generally if you were to use SYSCALL in long mode you simply swapgs and load in a known good pointer.
This is making the assumption you could at least clobber RAX initially as you'll probably return some value in it later. You could also do similar things for protected mode.Code: Select all
user_enter_syscall64: swapgs mov rax, [gs:KSTACK_OFFSET] mov [gs:USTACK_OFFSET], rsp mov rsp, rax ... mov rsp, [gs:USTACK_OFFSET] swapgs sysret
Here the user space GS value is assumed to be determinable from some other structure (thread info for example), which should work out since it's usually used for thread specific data anyways. To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.Code: Select all
user_enter_syscall32: mov ax, PROC_SPECIFIC_DATA_SEG mov gs, ax mov eax, [gs:KSTACK_OFFSET] mov [ss:eax+4], esp mov esp, eax ... pop gs pop esp sysret
Code: Select all
user_enter_syscall64:
swapgs
mov rax, [gs:KSTACK_OFFSET]
mov [gs:USTACK_OFFSET], rsp
mov rsp, rax
...
mov rsp, [gs:USTACK_OFFSET]
; <-------- What if an IRQ occurs here?
swapgs
sysret
Therefore you set the interrupt mask in syscall MSR. I made this same mistake before.abhoriel wrote:In the syscall code, if an IRQ fires before %rsp has been set to the kernel stack,...
No, usually syscall stub are not supposed to be re-entry.abhoriel wrote:However, the same problem occurs just before sysret,
Thanks for your reply! I'm not talking about re-entracy specifically, just interrupts in general: what if an IRQ occurs after %rsp has been set to the user stack, but before sysret? the CPU will push everything onto the user stack. how can this be prevented?bluemoon wrote:Therefore you set the interrupt mask in syscall MSR. I made this same mistake before.abhoriel wrote:In the syscall code, if an IRQ fires before %rsp has been set to the kernel stack,...
No, syscall stub may or may not be re-entry by design, and there are methods to handle re-entry problem (eg, counters and divert to different stacks).abhoriel wrote:However, the same problem occurs just before sysret,
Disable interrupt before restore rsp, the flags will be restored from r11 after sysret.abhoriel wrote:what if an IRQ occurs after %rsp has been set to the user stack, but before sysret? the CPU will push everything onto the user stack. how can this be prevented?
oh yeah, I'd forgotten that flags was restored, my code was saving them manually :/bluemoon wrote:Disable interrupt before restore rsp, the flags will be restored from r11 after sysret.abhoriel wrote:what if an IRQ occurs after %rsp has been set to the user stack, but before sysret? the CPU will push everything onto the user stack. how can this be prevented?
If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.Owen wrote:With regards to NMI/machine check, what is the probability of one being recoverable anyway? Aren't they pretty much double fault-like situations, I.e. the machine is going down no matter what you do?
If a watchdog timer has expired, then the correct course of action is to reboot the system ASAP.Brendan wrote:If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.
So, again, they're non recoverable. You might wish to perform actions inside of them, but normal functioning has certainly ceased; returning to the previous code is not the appropriate course of actionBrendan wrote:might display a "blue screen of death" or append information to an error log somewhere or maybe send a packet to some sort of system health monitoring server on the network before the machine goes down. If you want people with hardware faults to assume the hardware is fine and that your software is to blame, then you can leave the machine check exception disabled (so that hardware faults end up causing triple faults that look exactly like kernel bugs).
If a CPU has faulty cache, then the entire contents of every other CPU's cache and main memory are suspect. I'd say the correct action would be to log this somewhere, except that logging it somewhere further risks the integrity of whatever device you log it to. The best approach is probably to immediately halt all cores and display an error message via whatever the appropriate channels are. If one core has faulty cache, this implies major issues with the processor anyhow.Brendan wrote:Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.
If the NMI is used by the kernel to check if it has locked up or not, then if nothing has locked up you want to return to whatever was interrupted; and you wouldn't want to reboot the system ASAP.Owen wrote:If a watchdog timer has expired, then the correct course of action is to reboot the system ASAP.Brendan wrote:If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.
For which CPU/s?Owen wrote:If a CPU has faulty cache, then the entire contents of every other CPU's cache and main memory are suspect. I'd say the correct action would be to log this somewhere, except that logging it somewhere further risks the integrity of whatever device you log it to. The best approach is probably to immediately halt all cores and display an error message via whatever the appropriate channels are. If one core has faulty cache, this implies major issues with the processor anyhow.Brendan wrote:Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.
You're fundamentally misunderstanding the way watchdog timers work (in particular, they only trigger if the system has failed to clear them sufficiently recently)Brendan wrote:Hi,
If the NMI is used by the kernel to check if it has locked up or not, then if nothing has locked up you want to return to whatever was interrupted; and you wouldn't want to reboot the system ASAP.Owen wrote:If a watchdog timer has expired, then the correct course of action is to reboot the system ASAP.Brendan wrote:If NMI is being used for a watchdog timer or something, then it's extremely likely that NMI will be recoverable.
A single uncorrectable ECC fault is likely to have no implications beyond the termination of the process which was running at the time (assuming the fault occurred in user space). If it occurred in kernel space, then all bets are off with regards to the consistency of whatever the kernel was working on, and therefore continued correct behaviour of the kernel is unlikely.Brendan wrote:For which CPU/s?Owen wrote:If a CPU has faulty cache, then the entire contents of every other CPU's cache and main memory are suspect. I'd say the correct action would be to log this somewhere, except that logging it somewhere further risks the integrity of whatever device you log it to. The best approach is probably to immediately halt all cores and display an error message via whatever the appropriate channels are. If one core has faulty cache, this implies major issues with the processor anyhow.Brendan wrote:Also note that "virtually all" machine check exceptions can't be corrected, but this doesn't mean that an OS can't recover. For example, if a machine check exception tells you the CPU's cache is faulty; then you could disable the CPU's caches, and/or shutdown the effected CPU and continue running with other CPUs; and terminate the process that was running at the time; and inform the user of the problem; and keep everything else running.
Caches in Xeon's use ECC; so it's very likely that the contents of other CPUs and main memory are perfectly fine because an (uncorrected) cache fault wasn't detected earlier. In this case would you shut down ASAP (and lose unsaved data and potentially corrupt file systems, etc) to avoid an insignificant risk?
Cheers,
Brendan
Good point. Additionally, watchdogs typically don't use NMI, but othen do hardware resets directly. A well-written OS has no need for hardware watchdogs at all.Owen wrote: You're fundamentally misunderstanding the way watchdog timers work (in particular, they only trigger if the system has failed to clear them sufficiently recently)
At least I have a different view of that. Kernel faults are of two types: Faults in the scheduler are fatal and unrecoverable and would immediately hangup or reboot the system. Faults in the kernel, which doesn't have scheduler locks, typically only affect a single device, and thus are not fatal, but could somewhat lockup certain hardware devices. Such faults could also be application-related, and no more severe than the same faults in user-space.Owen wrote: A single uncorrectable ECC fault is likely to have no implications beyond the termination of the process which was running at the time (assuming the fault occurred in user space). If it occurred in kernel space, then all bets are off with regards to the consistency of whatever the kernel was working on, and therefore continued correct behaviour of the kernel is unlikely.
An ECC exception while a machine check exception was executing is, by definition, an ECC error in kernel space.
Code: Select all
syscall_entry:
mov r9,core_block_linear64 ; the address is patched at core initialization time
mov r9d,[r9].ps_syscall_esp
xchg rsp,r9
push r9 ; save stack
push rcx ; save entry-point
push r11
popfq ; this fixes trap-flag and enables ints again
mov rcx,r8 ; reload original ECX value
; do something
mov r8,rcx
pop rcx
cli
pop rsp
db 48h
sysret