Some kernel information can leak to user-space through various side-channels; and some of these are known (spectre, meltdown) and people have created defences against them. However, it'd be foolish to assume that all of them are known and foolish to assume that the defences people (Microsoft, Google, Linux) have implemented adequately defend against any "currently unknown" side channels.
These side channels come from information that the CPU retains during a switch from kernel to user-space. To prevent all known and all unknown side channels, you'd only have to "nuke" any information that the CPU would've retained during a switch from kernel to user-space. "Nuking" the information means invalidating it, polluting it with garbage, or replacing it with something unimportant; so that an attacker in user-space can't obtain it.
The information a CPU may retain can be broken down into:
- Caches
- TLBs and higher level paging structure caches
- Store Buffers
- Return Buffers
- Branch Target Buffers
- Pipelines
- Time
TLBs can be invalided by either modifying CR4 (e.g. toggle the "global pages enabled/disable" flag) or reloading CR3 if global pages aren't being used. Note that PCID still leaves information in the TLBs that may still be exploitable (e.g. by seeing if a TLB entry belonging to an attacker's address space is evicted by something the kernel does in the kernel's address space) and is insufficient to guard against "currently unknown" attacks.
Store Buffers only really matter if you using non-temporal stores or write-combining; and a store fence ("sfence") followed by a dummy store should be enough to ensure the store buffers are empty.
Information in return buffers can be overwritten by unimportant data by having a series of (at least) 16 "call $" instructions followed by an "add rsp,16*8" to remove the return addresses from the stack.
In theory; Branch Target Buffers can be overwritten with a series of (at least) 4096 branches that are spread out over a large enough area. How large that area needs to be depends on the BTB topology (associativity, set size). Without knowing the BTB topology (and without knowing the BTB entry eviction policy); it'd still be "relatively safe" to assume that 65536 branches spread over a 256 KiB area (an average of 4 bytes per branch) is sufficient. Note that (because BTB indexing works on virtual addresses) you could have a single physical page full of branches and map that same physical page 64 times at consecutive virtual addresses in kernel space to create a 256 KiB area full of branches (in other words, it needn't cost 256 KiB of RAM).
After doing all of the above, there's approximately zero chance that the CPU's pipelines will still contain anything from before you started the "ultra-nuke".
For time; a kernel can add a small random delay before returning to user-space (from kernel's API, exceptions handlers, IRQ handlers, etc), and a kernel can add "random jitter" to functions used to obtain time. This alone does not hide the time a kernel spends doing anything - to circumvent "random uncertainty" an attacker can do the same thing a few thousand times and use the average. To guard against that the kernel can check if user-space is doing the same thing lots of times (e.g. calling the same kernel API function with the same parameters) and take evasive actions (e.g. stall the potential attacker's task for 10 seconds each time, so that calling the same kernel API function with the same parameters a few thousand times ends up taking several days). In addition; when many tasks are running a kernel can queue pending operations (kernel API calls and exceptions) and perform them in random order, so that no single task is able to guess how long its own operations took. Batch system calls (e.g. tasks telling kernel to do a list of things rather than asking kernel to do one thing at a time) would also help a lot.
Of course "ultra-nuke" would be incredibly expensive. If an OS tracks how trusted/untrusted a process is, it could easily compromise - e.g. use the full "ultra-nuke" when returning to a very untrusted process, (randomly?) disable/skip pieces when returning to a slightly untrusted process, and disable/skip the entire thing when returning to a very trusted process. This "trust based dynamic compromise" can also include the ability to fall back to less expensive approaches (e.g. use KPTI and PCID when full TLB flushing is disabled).
Any thoughts/ideas or suggested improvements appreciated!
Cheers,
Brendan