CPU Bug mitigations
Posted: Wed Nov 28, 2018 10:43 am
Hi all,
laughing at the misfortune of other is always cathartic, but it just doesn't improve the situation. So I recently looked at the CPU Bugs wiki page and noticed that for most of them, no mitigations are listed at all. So I thought maybe we should add them.
So to start, I'll list the mitigations I know:
1. ESP not cleared
When returning from 64-bit or 32-bit mode to 16-bit mode, bits 31:16 of ESP aren't cleared. Bits 63:32 probably aren't, either, but you can't see those from 16-bit mode. Linux's solution is to leak those bits with pride! That is to say, they create a special tiny stack per CPU and switch to it before returning to anything that might be a 16-bit segment. That tiny stack (needs only 48 bytes in 64-bit mode) is set to readonly, since the only thing they do with it is to write the IRET frame to it (using a writable address) and then IRET. If an exception arrives while the kernel is on that stack, it will immediately cause a page fault, and therefore a double fault. #DF is on an IST stack. The #DF handler will recognize the situation and set things up so it looks like a general protection fault came in from userspace.
This way, the bits are still leaking, but they only identify the CPU the code is running on, not the actual kernel stack address. And since the stacks are so small and the location is randomized on bootup, you could only find out about your CPU if the machine has more than 85 CPUs.
I just noticed: That means 16-bit userspace can never use the top 16 bits of ESP. Which is bad, because usually userspace can do whatever they want with the registers and the kernel needn't care. You could use BX for a stack pointer if you don't use the PUSH or CALL instructions.But here that's not possible.
2. SYSRET with non-canonical address
Intel CPUs have a problem with SYSRET with non-canonical return address. The description in the wiki is a bit vague; can someone expand?
Easiest fix is probably to not allow an executable page to be mapped at the last address before the address boundary. The only way I can think of how the problem can happen is if you have a SYSCALL instruction at 0x00007ffffffffffe, which is an error, anyway. So you could also recognize the error in the syscall handler and deliver a SIGSEGV or similar.
3. SS selector
Apparently, AMD CPUs don't update the SS descriptor cache correctly on SYSRET, which is a problem if you SYSRET after a task switch out of an interrupt handler. One possible mitigation would be to IRET out of every syscall that was interrupted. Or alternatively explicitly save and load SS on task switch even in 64-bit mode.
4. PUSH selector
I found that one actually documented on felixcloutier.com. In 32-bit mode on Intel CPUs, if a segment selector is pushed onto the stack for any reason (be it a push instruction or an implicit push following an intterupt), then only a 16-bit move to memory is used. The high 16 bits are garbage in that case. Obvious mitigation is to only consider the low 16 bits of any such slot to be significant, which is good practice, anyway.
5. Nesting of NMI interrupts
And what fun we had with this one. Obvious mitigation is to not put the NMI handler on an IST stack. Which necessitates not using SYSCALL, or else hoping really hard that no NMI happens between the syscall entry point and the moment the kernel stack is set up.
6. F00F bug
The wording is a bit weird on the wiki page, but I think they mean the mitigation is to map the page which contains the IDT entry for #UD as uncachable or write-through.
7. FDIV bug
If you actually still care about this one (all CPUs with >120MHz are unaffected), on the affected machines you could just emulate the coprocessor. Maybe even emulate it with itself. There is no option to get an interrupt just on FDIV, so you would instead get an interrupt for every coprocessor command. And unless that command is FDIV you can just execute that command in kernel space. And for FDIV you can calculate everything in software.
Or alternatively disable the FPU entirely on the affected machines.
8. Meltdown
As I understand it, the workaround on the affected machines (they will patch this in hardware later, right?) is to move all the kernel entry points (syscall, interrupts, maybe call gates), and the "current process" descriptor into special sections each. Every process then contains two different CR3 values (and attendant map tables): One which contains the entire kernel mapping (as usual), and one in which only these special entry sections and the entire userspace are mapped. On entry, then, CR3 has to be loaded with the value for the full kernel mappings, and on exit to userspace it has to be loaded with the value for the partial mapping. This way, out-of-order execution can't access kernel space at all, since those maps are marked as "not present".
All right, that's about all I know about these. For many other bugs, the description is sparse and the mitigation is non-present. What about you guys?
P.S.: Does anyone have an idea how to format this post so it looks more structured?
laughing at the misfortune of other is always cathartic, but it just doesn't improve the situation. So I recently looked at the CPU Bugs wiki page and noticed that for most of them, no mitigations are listed at all. So I thought maybe we should add them.
So to start, I'll list the mitigations I know:
1. ESP not cleared
When returning from 64-bit or 32-bit mode to 16-bit mode, bits 31:16 of ESP aren't cleared. Bits 63:32 probably aren't, either, but you can't see those from 16-bit mode. Linux's solution is to leak those bits with pride! That is to say, they create a special tiny stack per CPU and switch to it before returning to anything that might be a 16-bit segment. That tiny stack (needs only 48 bytes in 64-bit mode) is set to readonly, since the only thing they do with it is to write the IRET frame to it (using a writable address) and then IRET. If an exception arrives while the kernel is on that stack, it will immediately cause a page fault, and therefore a double fault. #DF is on an IST stack. The #DF handler will recognize the situation and set things up so it looks like a general protection fault came in from userspace.
This way, the bits are still leaking, but they only identify the CPU the code is running on, not the actual kernel stack address. And since the stacks are so small and the location is randomized on bootup, you could only find out about your CPU if the machine has more than 85 CPUs.
I just noticed: That means 16-bit userspace can never use the top 16 bits of ESP. Which is bad, because usually userspace can do whatever they want with the registers and the kernel needn't care. You could use BX for a stack pointer if you don't use the PUSH or CALL instructions.But here that's not possible.
2. SYSRET with non-canonical address
Intel CPUs have a problem with SYSRET with non-canonical return address. The description in the wiki is a bit vague; can someone expand?
Easiest fix is probably to not allow an executable page to be mapped at the last address before the address boundary. The only way I can think of how the problem can happen is if you have a SYSCALL instruction at 0x00007ffffffffffe, which is an error, anyway. So you could also recognize the error in the syscall handler and deliver a SIGSEGV or similar.
3. SS selector
Apparently, AMD CPUs don't update the SS descriptor cache correctly on SYSRET, which is a problem if you SYSRET after a task switch out of an interrupt handler. One possible mitigation would be to IRET out of every syscall that was interrupted. Or alternatively explicitly save and load SS on task switch even in 64-bit mode.
4. PUSH selector
I found that one actually documented on felixcloutier.com. In 32-bit mode on Intel CPUs, if a segment selector is pushed onto the stack for any reason (be it a push instruction or an implicit push following an intterupt), then only a 16-bit move to memory is used. The high 16 bits are garbage in that case. Obvious mitigation is to only consider the low 16 bits of any such slot to be significant, which is good practice, anyway.
5. Nesting of NMI interrupts
And what fun we had with this one. Obvious mitigation is to not put the NMI handler on an IST stack. Which necessitates not using SYSCALL, or else hoping really hard that no NMI happens between the syscall entry point and the moment the kernel stack is set up.
6. F00F bug
The wording is a bit weird on the wiki page, but I think they mean the mitigation is to map the page which contains the IDT entry for #UD as uncachable or write-through.
7. FDIV bug
If you actually still care about this one (all CPUs with >120MHz are unaffected), on the affected machines you could just emulate the coprocessor. Maybe even emulate it with itself. There is no option to get an interrupt just on FDIV, so you would instead get an interrupt for every coprocessor command. And unless that command is FDIV you can just execute that command in kernel space. And for FDIV you can calculate everything in software.
Or alternatively disable the FPU entirely on the affected machines.
8. Meltdown
As I understand it, the workaround on the affected machines (they will patch this in hardware later, right?) is to move all the kernel entry points (syscall, interrupts, maybe call gates), and the "current process" descriptor into special sections each. Every process then contains two different CR3 values (and attendant map tables): One which contains the entire kernel mapping (as usual), and one in which only these special entry sections and the entire userspace are mapped. On entry, then, CR3 has to be loaded with the value for the full kernel mappings, and on exit to userspace it has to be loaded with the value for the partial mapping. This way, out-of-order execution can't access kernel space at all, since those maps are marked as "not present".
All right, that's about all I know about these. For many other bugs, the description is sparse and the mitigation is non-present. What about you guys?
P.S.: Does anyone have an idea how to format this post so it looks more structured?