How to fix Meltdown on my OS ?

Haoud · Post by **Haoud** » Tue Jan 09, 2018 4:19 am

Hello everyone,
I have a very simple question: how do I fix the meltdown vulnerability under my OS? I have read that the kernel code(but above all the data in kernel space) should be separated from the user code, but how?
Thank you for your answer.

oscoder · Post by **oscoder** » Tue Jan 09, 2018 6:06 am

It depends if you have any sensitive data in your kernel - that is, data you don't want userspace code to be able to read (like keys, etc). Since meltdown only allows reading kernel memory, you only need protection if the kernel contains sensitive data.

If you're writing a monolithic kernel then probably you need it. If you're writing a microkernel then maybe not.

It also depends on the platform you're developing for. If your OS is just running on raspberry pis or something similar, those processors are not vulnerable and so you don't need any mitigation.

If you do need to protect against meltdown, then the first thing you need is some way of detecting whether you're on a vulnerable CPU (since you don't want to cause a massive slowdown if you don't have to). Second, you'll want to implement the feature called "kernel page table isolation". I'm not familiar with the specifics of how this is done (maybe someone else here has looked into it?) but I'd suggest looking at the linux kernel source for an example implementation.

Finally, I should point out that you can't "fix" meltdown since it's a bug in Intel processors. Technically you can only "mitigate" it

davidv1992 · Post by **davidv1992** » Tue Jan 09, 2018 6:12 am

The basic idea of kernel page table isolation is that different page tables are used in kernel mode (ring 0) versus user mode (ring 3) of the processor. This requires some small pieces of kernel code to still be present in the user space page tables, which are then responsible for doing the page table swap.

In essence, this piece of code is (one of) the first things to run after a mode switch, and would be responsible for switching in the kernel page tables when going from ring 3 to ring 0, and switching to the userspace one when going from ring 0 to ring 3.

mallard · Post by **mallard** » Tue Jan 09, 2018 7:01 am

One thing you can do is always immediately terminate a program if it causes a protection voilation page fault.

Meltdown relies on the OS giving applications a way to handle/ignore such faults so they can examine the state of the CPU cache after such a fault occurs. If the OS doesn't provide any way of doing that, there's no way to exploit the issue.

Also, one small advantage for those of us whose kernels are only 32-bit: hardware task switching can be used to improve the efficiency of KPTI...

Can anyone more familiar with x86 CPUs comment on whether simply performing a "WBINVD" in response to a priviledge violation page fault would provide significant mitigation? The only issue I can see there is that in a multi-core processor a thread running on another core may get the opportunity to inspect the CPU cache before the instruction runs (I suppose that's the same with the "immediately terminate" mitigation, but at least then the process's entire exploit has a very short time window, rather than just each read attempt)... Priviledge violation page faults shouldn't be common enough for it to cause a significant performance issue.

EDIT: Turns out that improving protection voilation page fault is not enough, see below...

Haoud · Post by **Haoud** » Tue Jan 09, 2018 7:07 am

The idea would therefore be, when a system call is made, to load the kernel space and unload it after the system call is finished.
Does marking pages as non-present work properly (in relation to the flaw)?
My operating system is designed to run on x86 and is a monolithic kernel
In case of page errors, my bone kills the whole process if it accesses an unauthorized space (outside its address space): My OS is not concerned? Even if the problem is hardware

Solar · Post by **Solar** » Tue Jan 09, 2018 7:16 am

mallard wrote:One thing you can do is always terminate a program if it causes a protection voilation page fault.

Meltdown relies on the OS giving applications a way to handle/ignore such faults...

That is not correct.

Meltdown works because of the CPU speculatively pre-fetches memory contents to cache. That fetch may be based on information that should trigger a page fault if that execution branch were actually executed -- but it isn't. The attack is done by checking what's in cache and what's not.

The following is a visualization of the process, not a showcasing how it's actually done:

Code: Select all

unsigned kindex = 0xff80000; // kernel memory address
char * dummy = 0;

if ( /* some false condition */ )
{
    int v1 = dummy[ kindex ]; // WOULD trigger page fault
    unsigned uindex = ( v1 & 1 ) * 0x100; // 0x0 or 0x100
    int v2 = array[ uindex ]; // some array we have
}

// check whether array[ 0 ] or array[ 0x100 ] is in cache

The "if" part is never actually executed, so no page fault is ever triggered -- but we just determined the lowermost bit of 0xff80000.

This is a CPU bug, not something an OS "allows" for. The solution is to not have any critical memory mapped in ring 3 page tables (as davidv1992 described).

mallard · Post by **mallard** » Tue Jan 09, 2018 7:24 am

Solar wrote:
mallard wrote:One thing you can do is always terminate a program if it causes a protection voilation page fault.

Meltdown relies on the OS giving applications a way to handle/ignore such faults...
That is not correct.

Meltdown works because of the CPU speculatively pre-fetches memory contents to cache. That fetch may be based on information that should trigger a page fault if that execution branch were actually executed.

The following is a visualization of the process, not a showcasing how it's actually done:
Code: Select all
unsigned kindex = 0xff80000; // kernel memory address
char * dummy = 0;

if ( /* some false condition */ )
{
    int v1 = dummy[ kindex ]; // WOULD trigger page fault
    unsigned uindex = ( v1 & 1 ) * 0x100; // 0x0 or 0x100
    int v2 = array[ uindex ]; // some array we have
}

// check whether array[ 0 ] or array[ 0x100 ] is in cache
The "if" part is never actually executed, so no page fault is ever triggered -- but we just determined the lowermost bit of 0xff80000.

Thanks, that's a good example of why just improving handling of priviledge violation page faults is not enough. I though there'd have to be more to it when that approach seems to be not mentioned anywhere. Researching concrete information about this (and Spectre, which is much harder to mitigate) is difficult when 99.99% of articles are horribly dumbed-down summaries for the masses.

Also, by "never actually executed", I think you mean "never logically executed". The whole issue is that code that's not logically executed is actually executed thanks to speculative execution.

Still, implementing KPTI via hardware task switching is (as far as I can tell) still a valid mitigation for 32-bit systems where that's still possible (now if it emerges that AMD were aware of the issue back when they were designing x86_64, we'd have a nice little conspiracy theory there).

Solar wrote: This is a CPU bug, not something an OS "allows" for.

No need to get patronising...

Solar wrote: The solution is to not have any critical memory mapped in ring 3 page tables (as davidv1992 described)

Yes, KPTI...

Korona · Post by **Korona** » Tue Jan 09, 2018 10:31 am

Why is KPTI + hardware switching faster than KPTI + software switching? Are there any benchmarks on that?

mallard · Post by **mallard** » Tue Jan 09, 2018 10:54 am

Korona wrote:Why is KPTI + hardware switching faster than KPTI + software switching? Are there any benchmarks on that?

It's not "KPTI + hardware switching" it's "KPTI implemented using hardware task switching". I'm not talking about using hardware task switching for ordinary switching between procceses. That's known to be slow. I'm talking about using it to switch from userspace to kernelspace (and vice versa), keeping their page tables seperate.

Currently, conventional OSs have one task (TSS entry) per core and use "call gates" in the IDT to handle interrupts. Implementing KPTI "simply" means having two tasks per core (one for userspace and another for kernelspace) and using "task gates" to handle interrupts.

The "software" method requires a two-stage context switch (userspace to constantly-mapped-mini-kernel, mini-kernel to full kernel) the "hardware task switching" method reduces this to one stage and elminiates the need for the "mini-kernel". While I've not tested it (yet), it's hard to believe that's going to be any slower.

I doubt any "major" OS will choose to do it this way, since it's only possible on a 32-bit OS (We can speculate that had Meltdown been discovered earlier and hardware task switching become the standard way of dealing with it, it may have been preserved in long mode... Although CPUs would likely have been fixed instead, rendering it unniecissary.), but it's well within the "crazy ideas for a hobby OS" sphere.

Korona · Post by **Korona** » Tue Jan 09, 2018 10:59 am

Well, its not that there is a separate "mini-kernel"; the IRQ stubs do the same work that they normally do (e.g. push registers, swap segment registers etc.) and just have to do an additional CR3 switch, so there is really no wasted work here. It is hard to believe that a CR3 switch by hardware tasking is so much faster than a CR3 normal CR3 switch that the additional costs of hardware tasking are offset by that. On the contrary, I would expect the hardware tasking (including CR3 switch) to be heavily microcoded and to be much slower than the manual CR3 switch.

zaval · Post by **zaval** » Tue Jan 09, 2018 1:22 pm

Hello everyone,
I have a very simple question: how do I fix the meltdown vulnerability under my OS? I have read that the kernel code(but above all the data in kernel space) should be separated from the user code, but how?
Thank you for your answer.

"should" is a strong word. it might. if you want. to turn your hobby OS into a hell dumb slowpoke to calm down your non-existent "enterprise" users that their private keys are OK.

seriously, I feel this hype will make more harm by "mitigations" in from of ugly compilers patches and other crutches into the code, than really some security "dangers". not a problem for hobby OSes, forget it! Or, jump into programming non OoO 53rd cortexes and others. Like me! xD

Roman · Post by **Roman** » Tue Jan 09, 2018 1:27 pm

In addition to what has been said here also see this (it's about improving KPTI performance with PCID): https://groups.google.com/forum/m/#!top ... 9mHTbeQLNU

Brendan · Post by **Brendan** » Tue Jan 09, 2018 1:43 pm

Hi,

mallard wrote:
Korona wrote:Why is KPTI + hardware switching faster than KPTI + software switching? Are there any benchmarks on that?
It's not "KPTI + hardware switching" it's "KPTI implemented using hardware task switching". I'm not talking about using hardware task switching for ordinary switching between procceses. That's known to be slow. I'm talking about using it to switch from userspace to kernelspace (and vice versa), keeping their page tables seperate.

Hardware task switching is known to be slow regardless of what you use it for - almost everything that can be done with a single micro-code instruction can be done faster with multiple simpler instructions that don't use micro-code; and there are things that hardware task switching does (e.g. managing the "busy" bit for both TSSs) that could be skipped to make doing it in software even faster.

Note that for kernel system calls (but not IRQs) there's a chance the CPU supports SYSENTER or SYSCALL (that avoid GDT accesses and protection checks for CS and SS segment loads) and therefore there's a chance that hardware task switching would be even worse in comparison.

For interrupts (IRQs, exceptions, etc); if the interrupt occurs at CPL=3 you'd have to switch CR3 but if the interrupt occurs at CPL=0 you don't need to change CR3. In this case, you'd probably want to duplicate all of the interrupt handlers to avoid using hardware task switching for the "interrupt occurs at CPL=0" case - e.g. one IDT where all the IDT entries use task gates and a second IDT where all the IDT entries user interrupt/trap gates, with separate "interrupt handling stubs" for each case; where kernel does an "LIDT" to change which IDT CPU is using immediately after any switch from CPL=3 to CPL=0 and immediately before any switch from CPL=0 to CPL=3.

Also don't forget that (as far as I know) "meltdown" doesn't effect AMD CPUs and doesn't effect very old CPUs (Pentium II and later, and all Cyrix, Transmeta, NSC, SiS and IBM CPUs, and at least most of VIA's CPUs); and (because they didn't do "out-of-order") I suspect it doesn't effect the earliest Atom CPUs or the earliest Xeon Phi; and won't effect future Intel CPUs. I'd also assume that all good operating systems will eventually end up with logic to disable the meltdown mitigations for "trusted processes" running on CPUs that are effected. What this means is that you will want some sort of "if(CPU is effected and the process isn't trusted) { enable meltdown mitigation } else { don't enable meltdown mitigation }" logic; and you will want to minimise the differences between "mitigations enabled" and "mitigations disabled" throughout the kernel to minimise the impact on code maintenance.

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Tue Jan 09, 2018 5:23 pm

How often does an OS actually need to access data that needs to be kept hidden? Wouldn't it be possible for the kernel itself to run under two different sets of page tables, with one of them shutting out any memory that you don't want apps to be able to access through these vulnerabilities? That way, if an interrupt occurs at CPL=3 you wouldn't need a CR3 switch unless the kernel actually needs to access private data, and I suspect that in most cases it doesn't.

bluemoon · Post by **bluemoon** » Tue Jan 09, 2018 10:08 pm

The problem is you never know what information is classified, and provided the urgency, the best fix would just unmap the whole kernel.

And I see your point, yes in future the kernel can re-map itself on lazy approach.

OSDev.org

How to fix Meltdown on my OS ?

How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?

Re: How to fix Meltdown on my OS ?