CPU bug makes virtually all chips vulnerable

Brendan · Post by **Brendan** » Sun Jan 07, 2018 6:04 pm

Hi,

~ wrote:If it's possible to keep clean the cache between processes and flush it at any faults/exceptions, flushing despite bogus out-of-order execution using synchronization in the kernel, as well as keeping selected pages cache-disabled for private data, then it won't be so difficult to fix, but has to be implemented and run to see if it really stops the attack, we can't know without executing these measures.

All operating systems support multi-threaded processes now; which means that one thread/CPU can (e.g.) cause a page fault and other threads/CPU (in the same process) can measure the effects on (shared) caches before the kernel has had a chance to flush any caches (where "caches" includes things like TLBs, which are probably just as usable as data caches). Mostly, the only thing flushing caches would do is ruin performance for everything without improving security much.

Cheers,

Brendan

JAAman · Post by **JAAman** » Sun Jan 07, 2018 7:36 pm

you don't need to trigger a fault (and do not need TSX) to use this flaw to deduce kernel memory contents -- it does not rely on any fault happening and can be executed without any fault of any kind

it can be done like this, without producing any faults, or using any fancy modern features (this will work on all CPUs supporting OoO):

Code: Select all

XOR EAX, EAX
CMP EAX, 0
JZ .done

MOV ECX, kernal_address ;this line is never executed, thus never produces a fault
AND ECX, 1
SHL ECX, 0x10
MOV EAX, [myData+ECX]

.done
;here we will test what part of myData was loaded into cache

this obviously won't work as is, but the idea is to first train the branch predictor so that it mis-predicts the conditional branch beginning this block, so the block itself is speculatively executed (thus fetching the kernel address and loading the relevant portion of myData into cache)

normally you would get a fault when attempting to access the kernel address, but since the if block is not actually executed, no fault happens

~ · Post by ~ » Sun Jan 07, 2018 7:46 pm

Or make the kernel to load any new memory allocations directly into cache for any cache-enabled pages. That would also make Meltdown/Spectre a lot more indiscernible.

The kernel can probably halt all the threads in a process as soon as a page fault occurs, and it can decide what to do from there.

I would try to at least flush cache only for the used addresses in the offending operation, and restart the program after some multitasking cycles to allow the timing effect of the cache to dissipate, give the system and the other programs the chance to do other things to get rid of the noise caused by a cache usage that is too automatic.

Implementing similar solutions could still be useful, looking for the faster options and providing programs with APIs for better securing their data as more tricks about caching/paging are learned with the help of many more programmers than before.

I think that page faults, even legitimate ones, would be scarce, so programs would only see a slight delay when requesting a lot of new memory, or in systems that have too little RAM and too many processes, flushing the cache would only happen for selected cases when programs request it and when page faults or other exceptions occur, but once the RAM has been allocated, the program would run at normal speed because nothing more would cause relatively expensive operations like cache flush.

Maybe some things could be done like immediately allocating physical RAM instead of doing so until it's accessed, and making certain buffers/variables, or a whole program, to always stay fully allocated in RAM so that page faults almost never happen, by requesting it via API calls to optimize memory usage for cache, for speed/immediate allocations, etc. It would be valuable as long as no malware and well-designed programs are the ones present.

The bigger advantage keeps being that if we control the programs that will run in our machine, then no leakage can take place, and if browsers (which run in the same program address space) are implemented such that only non-cached memory will be used when looking at SSL-enabled pages for logging in, to see those web pages without cache, or when reading cookies or the like, then no deduction of user data can happen if we only deal with cache-disabled pages for critical user data, only while navigating secure pages, by default, unless the user changes the settings.

Solar · Post by **Solar** » Mon Jan 08, 2018 2:24 am

JAAman wrote:you don't need to trigger a fault (and do not need TSX) to use this flaw to deduce kernel memory contents -- it does not rely on any fault happening and can be executed without any fault of any kind

[...]

normally you would get a fault when attempting to access the kernel address, but since the if block is not actually executed, no fault happens

No fault happens.

Page 7 of the discussion and Tilde is still barking up the wrong tree...

DavidCooper · Post by **DavidCooper** » Mon Jan 08, 2018 11:26 am

Would it be possible/practical to change page table entries every time control is handed over to an app to hide memory containing crucial data from it so that it's only addressable when the kernel is running? That way, the "MOV ECX, kernal_address" part of the code would take useless values from somewhere else instead. Would that cause any slowing beyond the short time taken to change values in the table? Most interrupt routines wouldn't need to access such memory so there would be no slowing of those.

Korona · Post by **Korona** » Mon Jan 08, 2018 11:46 am

DavidCooper wrote:Would it be possible/practical to change page table entries every time control is handed over to an app to hide memory containing crucial data from it so that it's only addressable when the kernel is running? That way, the "MOV ECX, kernal_address" part of the code would take useless values from somewhere else instead. Would that cause any slowing beyond the short time taken to change values in the table? Most interrupt routines wouldn't need to access such memory so there would be no slowing of those.

That is exactly what the patches do (well they don't change tables on the fly, they just swap CR3), as I stated some pages earlier. The most performance loss comes from the TLB invalidation and not from the table setup cost though.

~ · Post by ~ » Tue Jan 09, 2018 9:05 am

Shouldn't it be more efficient to always load the memory of a program into cache, at least beginning with the most recent buffer page addresses, every time that an allocation or a switch to that task happens?

In any case, that whenever a page needs to be physically reserved, to also load it into cache such that the program will always be guaranteed to see its memory loaded into cache before using it?

Wouldn't loading the cache in this way be considerably faster than just constantly flushing it on exceptions?

That would not only make Meltdown/Spectre very transparent to be used as a mere side effect (being the guaranteed cache speed for each program the important optimization), but would also ensure that a program would always be sped up because it would be loaded into the fast CPU cache every time it was its turn to run or to allocate RAM. Then Meltdown/Spectre would always yield the same leaked result (all 1 or all 0) because the cache of its buffers would always be found loaded and fast-responding.

At least I remember that MTRR registers of several new and not so new Pentium III-type CPUs (from around 2000) could be used to specify how to use the cache. Maybe those could be set up per running task when it's the turn of the process to run and before actually running, to guarantee that the memory it has allocated will already be in the cache. It sound like a very useful optimization.

Now that I think about it, the Meltdown/Spectre papers don't specify which cache mode was used from the several ways to configure it, or whether they work in all of those modes (write-combine, write-through, write-back, etc...).

bluemoon · Post by **bluemoon** » Tue Jan 09, 2018 9:52 am

Aggressively loading things to cache will only do cache pollution. Also consider a decent i7 with 12MB L3 cache, filling the cache means a lot of wasted traffic, just to get evicted 5us later on next task switch. Not to mention cache line works with hashing and not linearly.

EDIT: removed rant.

mariuszp · Post by **mariuszp** » Tue Jan 09, 2018 9:59 am

bluemoon wrote:I can't tell you are trolling or being serious.

But aggressively loading things to cache will only do cache pollution. Also consider a decent i7 with 12MB L3 cache, filling the cache means a lot of wasted traffic, just to get evicted 5us later on next task switch. Not to mention cache line works with hashing and not linearly.

To be fair, in this particular instance "~" doesn't seem to be trying to annoy us on purpose; not everyone knows the exact details of how a cache works. I understand that the idea makes no sense, but it does look like he is trying to make sense this time.

EDIT: Is it just me, or does a Tilde render as a minus sign now?

Korona · Post by **Korona** » Tue Jan 09, 2018 10:16 am

~ is refusing to do basic research here. I'm not sure if he is trolling, delusional, affected by Dunning-Kruger or does not have the mental capacity to read and understand the papers, patches and posts in this thread. However, he keeps mentioning cache eviction and exceptions even though it has been stated multiple times that this has nothing to do with the exploit.

His last post in particular absolutely lacks even basic preparation: Obviously, caches are too small to hold entire programs. No fix to meltdown flushes caches on exceptions. Heck, fixing Meltdown is a solved problem by the KPTI patches that are also far superior in performance to any invalidation method proposed by ~. MTRRs affect physical memory and not virtual memory.

By continuing with this low-effort-low-quality attitude, ~ heavily disrupts the genuine discussion that this thread is about. Not only does reiterating stuff dozens of times annoy people reading and responding to this thread (at least it does that for me), it also reflects badly on this forum as a place suitable place to discuss non-trivial OS development.

Sik · Post by **Sik** » Tue Jan 09, 2018 2:32 pm

mariuszp wrote:EDIT: Is it just me, or does a Tilde render as a minus sign now?

It's not just you, it's the font used by default by all phpBB 3 forums ._. (and yes, the post editor uses a larger font that renders the tilde as a tilde)

mariuszp · Post by **mariuszp** » Tue Jan 09, 2018 3:31 pm

I thought people were writing tilde as a minus as some sort of insult the whole time, haha

Anyway, here's another possible fix for Spectre: maybe we just get rid of branch prediction, and instead, when a branch is found in speculative execution, just switch to a different task (like a virtual core does). I don't know if that would completly mitigate the performance drop from removing branch prediction, but if the resulting drop is small enough, then it would be worth it for the security; and since it would be implemented in new chips only, an increase in clock speed which is to be expected might remove the performance drop completly.

I have no benchmarks to show or even any idea how this could be benchmarked by me, but it's an idea. What do you think?

Solar · Post by **Solar** » Wed Jan 10, 2018 2:01 am

mariuszp wrote:...an increase in clock speed which is to be expected might remove the performance drop completly.

Well... not really.

You see, from a hardware point of view, the issue it this:

With ever-increasing clock speeds, the number of things you could do in one clock cycle went down. The answer was to design longer pipelines, with each step ("stage") in the pipeline doing less, but having the clock speed go up to the point where the overall processing speed increased.

This in turn brought the problem of pipeline stalls. When the beginning of the pipeline encounters a branch, with the branch condition still being computed in later stages of the pipeline, you'd have to wait until the condition is computed before you knew which part of the branch to feed into the pipeline. Like a buffer flush, this was costly in terms of performance.

The answer was branch prediction, and speculative execution. You fed the "most likely" part of the branch to the pipeline, executing it speculatively, and if it turns out your guess was correct, you avoided the pipeline stall. If your guess is incorrect, you'd be no worse off than if you stalled.

If you get rid of branch prediction, and increase the clock speed (which more or less requires lengthening the pipeline, among other things), your pipeline stalls will just get worse.

So it's not as easy as that. And we're pretty much at the point of clock speed saturation. Have you realized that clock speeds have long since stopped increasing the way they used to? Recent performance increases have mostly come from architectural tweaks, not clock speed increases. And many of those tweaks were, you guessed it, better branch predictions...

Sik · Post by **Sik** » Wed Jan 10, 2018 4:38 am

Not to mention that clock speeds are going down instead... High clock speeds waste energy and it has gotten to the point of the waste outright becoming interference (hence rendering the CPU unusable unless underclocked). And that's ignoring things that run on battery.

Actually, given how battery-powered devices are the big deal right now, I'm surprised we aren't encouraging programmers to optimize their code as much as possible for slower CPUs, which would also make performance loss a much lesser concern since we'd be already going with that mentality in mind (even if it's true that the antenna is the biggest battery eater, keeping the CPU running too fast isn't doing any favors and the heat does damage the battery).

Solar · Post by **Solar** » Wed Jan 10, 2018 6:13 am

Sik wrote:...I'm surprised we aren't encouraging programmers to optimize their code as much as possible for slower CPUs...

"We" are doing that. "They" aren't, because what's selling is moar features, moar eye candy, moar FPS, semi-transparent windows, inlined videos, ...

By the way, RaspberryPi is not vulnerable. In case someone is looking for a home-banking platform.

OSDev.org

CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable

Re: CPU bug makes virtually all chips vulnerable