Kernel Side-Channel "Ultra-Nuke"

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
simeonz
Member
Member
Posts: 360
Joined: Fri Aug 19, 2016 10:28 pm

Re: Kernel Side-Channel "Ultra-Nuke"

Post by simeonz »

Brendan wrote:For "row select" timing in the RAM chips themselves, it'd be ruined by CPU fetching the first instruction from user-space. Doing WBINVD (invalidating all the caches) between kernel API calls would make it impossible to infer anything from preceding kernel API calls. The timing mitigation/s I mentioned should (I hope) cover the problem of using 2 kernel API functions at the same time (where one is used to infer something about the other).
tRP + tRCD will be triggered in the user code itself by performing access on a (virtual address) range allocated with large page or "consecutively" if deterministic allocation is possible, with a fixed page stride to determine if any row will trigger "row conflict". If timed sufficiently accurately, it may reveal the relative (to some unknown offset) location of a physical page used by the kernel during the last service call. It is a long shot to a full exploit, but may leverage an extended attack plan.
Brendan wrote:For device drivers; a micro-kernel using the IOMMU (to restrict what devices can/can't access) goes a long way to prevent device drivers from using devices to use figure out what kernel did/didn't fetch into cache. Assuming that the kernel handles timers itself (e.g. and doesn't have a "HPET driver' in user-space); I'm not sure if a device driver can use a device to get better time measurements (e.g. to bypass the timing mitigation/s I mentioned).
The idea was to detect a filesystem cache hit (not a CPU cache hit), to determine if an API reuses a storage device location for two pieces of information, relying on the I/O latency of the storage device to detect actual read. Similarly, in a different scenario, disk seeking may approximately reveal the distances between file extents, which I speculate, may become part of an offline attack plan.
Brendan wrote:You'd still need to either make a small number of "accurate enough" measurements (which would be ruined by random delays before returning from kernel to user-space) or need to make a large number of measurements (which the kernel could hopefully detect). If you can't do either of these things then you can't gather the data to send to a remote server to analyse.
The random jitter could be compensated by the attacker, assuming it has a stationary mean. Also, running busy loop in parallel provides a coarse measure of time. I do agree that the system can abstract and protect its services better, but it could be that the programming model may have to be scrapped and core changes may be necessary. Changes on the scheduling and synchronization level (and shared memory may have to be discarded overall). I also feel that JIT-compiling infrastructure would hypothetically enforce such complicated constraints with smaller performance penalty.
Brendan wrote:Note that network access would be disabled by default (a processes can't send/receive packets unless admin explicitly enabled network access for that executable).
Depends. On one hand, the executable can be entirely stopped from executing. But if launched anyhow, since software uses all sorts of online databases and servers, it may not be feasible to make such restriction effective (because the user or the security manifest are likely to attest to the network usage). This all depends on the system. Servers will be obviously administered differently and with a narrower purpose, from say, home desktops.
Brendan wrote:I'd say that if a process is intentionally malicious it should crash and burn; and if a process is unintentionally exploitable then it should crash and burn.
Ideally, yes. But it may not happen. Spyware, ransomware, bots, do not always need to escalate their privileges to do damage. Of course, escalating control over the system and becoming viral are the bigger threats, generally speaking, but some users may care less about that after the fact. In contrast, unintentionally exploitable code may assist in forcing its own crash during or after the attack, by defining the scope of its current operation and requesting policing by the kernel.
Brendan wrote:I don't think it makes any difference if a piece of software is created by a human or created by a machine - if the software is malicious or exploitable then it dies regardless of what created it.
Security over-relies (with justification) on immutability. Signing, DEP, secure boot, etc, are effective way to limit the software zoo in software designed around the Harvard architecture. But if the software comes bundled with some sort of interpreter, such as most games today and a lot of the office software, the data inside can offer Turing complete behavior that must be protected as well. For generalized learning AI, despite the usual safeguards, the state is by necessity mutable and in a narrow sense, Turing-complete. It can certainly be exploited to re-purpose itself in contrast to its original intent.
Last edited by simeonz on Wed Mar 21, 2018 12:03 pm, edited 2 times in total.
simeonz
Member
Member
Posts: 360
Joined: Fri Aug 19, 2016 10:28 pm

Re: Kernel Side-Channel "Ultra-Nuke"

Post by simeonz »

Schol-R-LEA wrote:simeonz wrote:
Vaguely reminds me of the sci-fi concept of homomorphic encryption - information to process, but not to inspect.


Uhm, is this sarcasm? It maybe experimental, but 'sci fi' seems a bit dismissive, though I suppose you could have meant it as being 'extremely advanced to the point of seeming fantastical', which does seem apt. Unfortunately, nuanced communication is rather difficult over plain text...
Admittedly, that came out wrong. I meant that this will appear a whimsical reference. On the other hand I do hope it develops, because if cloud computing becomes prevalent, everyone's information will be completely in the open. Unless a person is willing to rely on remote attestation from big companies like Intel and Microsoft and hopes that they are not required by local laws to disclose information or that the value of their brand is bigger than the collective value of the information that they host, this remains the only prospective solution I am aware of. Not to mention that it will be a cool achievement. I actually like future-tech more so than practical tech, even if it is beyond my level of mathematical maturity. :)
Schol-R-LEA wrote:While it may not be practical yet except in very limited forms, according to that page there have been proof-of-concept implementations of the fully homomorphic encryption for a couple of years (it's even available as FOSS libraries, though I doubt they are much use yet except to researchers), and apparently significant improvements have come in the past few years.
Honestly, I haven't looked into the current progress carefully, because I thought the stuff is very conceptual. Although I need time to get grasp of the terminology to better understand what realistic performance expectations one can have at the moment. But indeed, there appears to be a steady stream of comparatively recent progress and experimentation.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Brendan »

Hi,
simeonz wrote:
Brendan wrote:For "row select" timing in the RAM chips themselves, it'd be ruined by CPU fetching the first instruction from user-space. Doing WBINVD (invalidating all the caches) between kernel API calls would make it impossible to infer anything from preceding kernel API calls. The timing mitigation/s I mentioned should (I hope) cover the problem of using 2 kernel API functions at the same time (where one is used to infer something about the other).
tRP + tRCD will be triggered in the user code itself by performing access on a (virtual address) range allocated with large page or "consecutively" if deterministic allocation is possible, with a fixed page stride to determine if any row will trigger "row conflict". If timed sufficiently accurately, it may reveal the relative (to some unknown offset) location of a physical page used by the kernel during the last service call. It is a long shot to a full exploit, but may leverage an extended attack plan.
If this was possible; physical address space randomisation would mean it'd take a huge amount time (e.g. billions of tests) for the attacker to become "relatively confident" that a virtual page it can access is in the same row as a sensitive piece of data; and re-randomising physical pages periodically while the OS is running (if done often enough) would prevent the attacker from becoming "relatively confident".
simeonz wrote:
Brendan wrote:For device drivers; a micro-kernel using the IOMMU (to restrict what devices can/can't access) goes a long way to prevent device drivers from using devices to use figure out what kernel did/didn't fetch into cache. Assuming that the kernel handles timers itself (e.g. and doesn't have a "HPET driver' in user-space); I'm not sure if a device driver can use a device to get better time measurements (e.g. to bypass the timing mitigation/s I mentioned).
The idea was to detect a filesystem cache hit (not a CPU cache hit), to determine if an API reuses a storage device location for two pieces of information, relying on the I/O latency of the storage device to detect actual read. Similarly, in a different scenario, disk seeking may approximately reveal the distances between file extents, which I speculate, may become part of an offline attack plan.
Ah - if an attacker has permission to read a file, then the attack can measure how long a read takes to determine if something else access the file recently. To defend against this you'd just use file system permissions effectively (ensure an attacker can't read from a file if it matters that the file was accessed recently).
simeonz wrote:
Brendan wrote:You'd still need to either make a small number of "accurate enough" measurements (which would be ruined by random delays before returning from kernel to user-space) or need to make a large number of measurements (which the kernel could hopefully detect). If you can't do either of these things then you can't gather the data to send to a remote server to analyse.
The random jitter could be compensated by the attacker, assuming it has a stationary mean. Also, running busy loop in parallel provides a coarse measure of time. I do agree that the system can abstract and protect its services better, but it could be that the programming model may have to be scrapped and core changes may be necessary. Changes on the scheduling and synchronization level (and shared memory may have to be discarded overall). I also feel that JIT-compiling infrastructure would hypothetically enforce such complicated constraints with smaller performance penalty.
These things don't work. For example, if you split everything into isolated single-threaded processes that are only able to communicate via. messaging (a pure actor model where no memory is shared at all) and then use managed languages for all user-space code; then one process can just send messages to another as fast as it can and the receiver can measure how much time passed by counting the number of messages it received.

Mostly; there's a benefit in disallowing access to extremely precise time measurements (e.g. TSC's "better than 1 nanosecond precision" timing) because it's easy for kernel to make it harder for an attacker (without making it hard for legitimate code); but making things harder shouldn't be confused with prevention - making sure there are no timing differences an attacker can measure should be the primary goal.
simeonz wrote:
Brendan wrote:Note that network access would be disabled by default (a processes can't send/receive packets unless admin explicitly enabled network access for that executable).
Depends. On one hand, the executable can be entirely stopped from executing. But if launched anyhow, since software uses all sorts of online databases and servers, it may not be feasible to make such restriction effective (because the user or the security manifest are likely to attest to the network usage). This all depends on the system. Servers will be obviously administered differently and with a narrower purpose, from say, home desktops.
If the OS failed to prevent differences in timing, and if the publisher of "malware.exe" hasn't been blacklisted world-wide already, and if the admin was tricked into explicitly allowing "malware.exe" to access the network; then the attacker probably deserves a reward for bypassing 3 entirely different protections. ;)
simeonz wrote:
Brendan wrote:I'd say that if a process is intentionally malicious it should crash and burn; and if a process is unintentionally exploitable then it should crash and burn.
Ideally, yes. But it may not happen. Spyware, ransomware, bots, do not always need to escalate their privileges to do damage. Of course, escalating control over the system and becoming viral are the bigger threats, generally speaking, but some users may care less about that after the fact. In contrast, unintentionally exploitable code may assist in forcing its own crash during or after the attack, by defining the scope of its current operation and requesting policing by the kernel.
Brendan wrote:I don't think it makes any difference if a piece of software is created by a human or created by a machine - if the software is malicious or exploitable then it dies regardless of what created it.
Security over-relies (with justification) on immutability. Signing, DEP, secure boot, etc, are effective way to limit the software zoo in software designed around the Harvard architecture. But if the software comes bundled with some sort of interpreter, such as most games today and a lot of the office software, the data inside can offer Turing complete behavior that must be protected as well. For generalized learning AI, despite the usual safeguards, the state is by necessity mutable and in a narrow sense, Turing-complete. It can certainly be exploited to re-purpose itself in contrast to its original intent.
I still don't see how it makes any difference.

Let's say "Dave" writes some software and signs it with their key, and you decide to trust "Dave" and execute their software, but their software is malicious and tries to do something that OS doesn't allow so the OS terminates it. Why does it matter if "Dave" is a human or an AI? It was still terminated, and you still don't trust "Dave" anymore.

Now let's assume that "Fred" writes some software and signs it with their key, and you decide to trust "Fred" and execute their software, but their software is exploitable and is tricked into doing something the OS doesn't allow, so the OS terminates it. Why does it matter if the software used an interpreter or not? It was still terminated, and you still don't trust "Fred" anymore.

Now let's say "Bob" writes some software and signs it with their key, and you decide to trust "Bob" and execute their software, but (under specific circumstances) it does something the OS doesn't allow, and the OS terminates it. Why does it matter if the software used fancy ("intelligent") self modifying code or not? It was still terminated, and you still don't trust "Bob" anymore.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
AMenard
Member
Member
Posts: 67
Joined: Mon Aug 25, 2014 1:27 pm

Re: Kernel Side-Channel "Ultra-Nuke"

Post by AMenard »

the thing that bug me the most about these security flaws is the fact that we just don't know what
other crappy code did the hardware manufacturer or oem put inside the chips themselves... It's not like AMD or Intel open source their micro-code.
simeonz
Member
Member
Posts: 360
Joined: Fri Aug 19, 2016 10:28 pm

Re: Kernel Side-Channel "Ultra-Nuke"

Post by simeonz »

Brendan wrote:If this was possible; physical address space randomisation would mean it'd take a huge amount time (e.g. billions of tests) for the attacker to become "relatively confident" that a virtual page it can access is in the same row as a sensitive piece of data; and re-randomising physical pages periodically while the OS is running (if done often enough) would prevent the attacker from becoming "relatively confident".
All right. Let's say, some comparatively innocuous API "A" accesses statically allocated kernel data "X". And a system event "B", kept undisclosed for security reasons, accesses statically allocated kernel structure "Y". From reverse-engineering, we establish that the compiler has placed "X" and "Y" in a single page. The software finds a row that it owns that triggers collision with "X", by calling API "A" and then probing repetitively for latency. By accessing that same location in the future, depending on other factors, such as the rest of the content in that kernel page and the frequency of access to it, it may acquire a reasonably informative test for "B" occuring. Similar reasoning could be applied to the file cache test, albeit with different assumptions.

I admit. My examples are still weak. I realize that. But spectre and meltdown to me are the symptom of a known general illness - that side-channel attacks could involve residual effects that we have not anticipated, be it in hardware, CPU or adapters, in the system software, in how multiple entities interact when combined, etc. The point is, programs extract information in ways inconceivable. Even if the information is not directly useful, it could leverage the chance of success in a direct exploit. And the latter do unfortunately happen. So virtualizing the execution environment of an application, leaving it to its own devices within the confines of a sandbox may not be a secure architecture anymore
Brendan wrote:These things don't work. For example, if you split everything into isolated single-threaded processes that are only able to communicate via. messaging (a pure actor model where no memory is shared at all) and then use managed languages for all user-space code; then one process can just send messages to another as fast as it can and the receiver can measure how much time passed by counting the number of messages it received.

Mostly; there's a benefit in disallowing access to extremely precise time measurements (e.g. TSC's "better than 1 nanosecond precision" timing) because it's easy for kernel to make it harder for an attacker (without making it hard for legitimate code); but making things harder shouldn't be confused with prevention - making sure there are no timing differences an attacker can measure should be the primary goal.
I have not seriously pondered on the details involved, but let me draw a picture. Let's say that the program is a graph of events embedded within a larger graph, that of the system execution. Our aim is to order the events in the program graph in linear time, such that they are statistically independent of the order of the rest of the graph, except where the control flow dependencies through APIs, i.e. "directed edges", are explicitly required.

What this means is that any API call made, which may carry an unintended control flow side-effect, requires us to randomly reorder some of the program's own execution events. The reason I think a JIT compiler could do that more efficiently is because it can rearrange the very code and introduce dynamic branches where necessary, or can package multiple randomization events together at compile-time into one. A JIT compiler could, in principle, use optimized user-space micro-scheduling of sorts. It could even reorder loops, in some situations. In any case, it will be cheaper than a typical system scheduler call. Similarly, communication through primitives can be done through shared memory, controlled by JIT. Synchronization intercepted at compile-time will be reordered dynamically in the code when needed, but I imagine could offer smaller throughput reduction than a similar binary interface. So, purely hypothetically, a JIT infrastructure could introduce a very "weak" execution model, and then run it on hardware with stronger execution guarantees, to enforce information barrier on the program. The program could then never reliably extract side-channel data.
Brendan wrote:I still don't see how it makes any difference.

Let's say "Dave" writes some software and signs it with their key, and you decide to trust "Dave" and execute their software, but their software is malicious and tries to do something that OS doesn't allow so the OS terminates it. Why does it matter if "Dave" is a human or an AI? It was still terminated, and you still don't trust "Dave" anymore.

Now let's assume that "Fred" writes some software and signs it with their key, and you decide to trust "Fred" and execute their software, but their software is exploitable and is tricked into doing something the OS doesn't allow, so the OS terminates it. Why does it matter if the software used an interpreter or not? It was still terminated, and you still don't trust "Fred" anymore.

Now let's say "Bob" writes some software and signs it with their key, and you decide to trust "Bob" and execute their software, but (under specific circumstances) it does something the OS doesn't allow, and the OS terminates it. Why does it matter if the software used fancy ("intelligent") self modifying code or not? It was still terminated, and you still don't trust "Bob" anymore.
If the goal is to protect the user from damages, exploits are a case of stolen software identity. Let's say, the user has downloaded the software from a website that she trusts, because it has a valid SSL certificate and believes that the payload is protected during storage and transmission, or from a trusted online software store and the content is signed. Let's say, the software in question is a thin client for online book store, where the user fills in sensitive information. (Not that I understand a lot about online commerce.) If the security subsystem offers self-administered protections to applications, such as DEP, demoting the privileges in child processes, applying firewall rules in child processes, etc, when the process is exploited, it will most likely crash. Without such mitigations, the process could start running malicious code, will contact its own server and send the information unhindered. From the system's point of view, no problem. From the user's point of view, not so much.

The problem with AI, is that it cannot do DEP conceptually. The entire data is "intelligent" and mutable, because changes are made to it continuously. Therefore, it will be impossible to freeze the logic into immutable state and sign it, because it has floating nature. If it gets exploited, we cannot discriminate a good AI transition, from malicious AI transition. One of the main counter-exploitation mitigations, authentication and immutability of the controlling logic, is lost. Again, this is not a problem for the system integrity, but for the affected user.
Brendan wrote:If the OS failed to prevent differences in timing, and if the publisher of "malware.exe" hasn't been blacklisted world-wide already, and if the admin was tricked into explicitly allowing "malware.exe" to access the network; then the attacker probably deserves a reward for bypassing 3 entirely different protections. ;)
If the goal is to protect the core services and to maintain the process, user and session isolation, there is indeed no difference between malware and exploitable software. This is one layer of security, and the system can also assist the applications in protecting themselves, assist the users in protecting themselves, or even protect the applications from the users, or the users from the applications, and even protect itself from "itself". I agree that improperly authorized malware cannot be stopped and the system has no responsibility to stop it. But my argument was that with malicious software, the user experiences losses by default. In contrast, if he interacts with exploitable software, if the software is attacked, it is by virtue of special security services available to the application, assuming the developer has consciously used them, that consequent damage will be prevented (most likely with a crash). This exploitable software's identity will not be impersonated by malicious code, even if it suffers from downtime and unavailability. So, from this point of view, software exploitation is a separate concern for security, which aims to protect the user, not the system.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Brendan »

Hi,
simeonz wrote:
Brendan wrote:If this was possible; physical address space randomisation would mean it'd take a huge amount time (e.g. billions of tests) for the attacker to become "relatively confident" that a virtual page it can access is in the same row as a sensitive piece of data; and re-randomising physical pages periodically while the OS is running (if done often enough) would prevent the attacker from becoming "relatively confident".
All right. Let's say, some comparatively innocuous API "A" accesses statically allocated kernel data "X". And a system event "B", kept undisclosed for security reasons, accesses statically allocated kernel structure "Y". From reverse-engineering, we establish that the compiler has placed "X" and "Y" in a single page. The software finds a row that it owns that triggers collision with "X", by calling API "A" and then probing repetitively for latency. By accessing that same location in the future, depending on other factors, such as the rest of the content in that kernel page and the frequency of access to it, it may acquire a reasonably informative test for "B" occuring. Similar reasoning could be applied to the file cache test, albeit with different assumptions.
I think you're confusing "statically allocated virtual space" with "statically allocated physical pages". If an OS uses statically allocated physical pages and maps them at random virtual addresses (e.g. KASR without PASR) then this would be a potential problem. If an OS uses randomly allocated physical pages and maps them at statically allocated virtual addresses (PASR without KASR) then this isn't a potential problem.
simeonz wrote:
Brendan wrote:These things don't work. For example, if you split everything into isolated single-threaded processes that are only able to communicate via. messaging (a pure actor model where no memory is shared at all) and then use managed languages for all user-space code; then one process can just send messages to another as fast as it can and the receiver can measure how much time passed by counting the number of messages it received.

Mostly; there's a benefit in disallowing access to extremely precise time measurements (e.g. TSC's "better than 1 nanosecond precision" timing) because it's easy for kernel to make it harder for an attacker (without making it hard for legitimate code); but making things harder shouldn't be confused with prevention - making sure there are no timing differences an attacker can measure should be the primary goal.
I have not seriously pondered on the details involved, but let me draw a picture. Let's say that the program is a graph of events embedded within a larger graph, that of the system execution. Our aim is to order the events in the program graph in linear time, such that they are statistically independent of the order of the rest of the graph, except where the control flow dependencies through APIs, i.e. "directed edges", are explicitly required.

What this means is that any API call made, which may carry an unintended control flow side-effect, requires us to randomly reorder some of the program's own execution events.
For some background...

A long time ago I decided that (to avoid the cost of "CPL=3 -> CPL=0 -> CPL=3" switches and improve performance); it'd be a good idea to support "batch system calls", where a thread creates a list of kernel functions (and their parameters) and asks the kernel to do the list of things. That way if a thread wanted kernel to do 10 things it could pay the cost of "CPL=3 -> CPL=0 -> CPL=3" switching once and not pay that cost 10 times. Note: this is also why I'm seriously considering "ultra-nuke" (which would add a huge amount of overhead to the "CPL=0 -> CPL=3" costs) - it'd be too expensive without "batch system calls".

About 5 years ago I decided to try "kernel stack per CPU" (my older kernels were all "kernel stack per thread"). The consequence of "kernel stack per CPU" is that expensive kernel operations (e.g. process creation) can't be interrupted because there's no other stack for that CPU to use. For example, if an IRQ occurs that unblocks a high priority task, then the kernel can't switch to that high priority task and return to user-space until after it has completed the expensive operation. To fix that you end up wanting to break expensive operations into smaller pieces so that you can have "safe interruption points" between the pieces. For example, instead of having a "create process" operation, you might split it into a "create empty virtual address space" piece, then several "map next part of executable into virtual address space" pieces, then a "create initial thread" piece, etc; so that you can put those "safe interruption points" in between all the pieces (and so that you can allow a switch to a higher priority task to happen between any of those pieces and avoid waiting for ages for the whole lengthy operation to complete). Of course I've always liked "do important stuff first" so naturally I wanted to give all these pieces priorities, so I could postpone unimportant work when there's more important work to do (e.g. postpone "create virtual address space intended for a new low priority process" so that I could do "give high priority user-space thread CPU time" sooner). Then it occurred to me that this can also have scalability advantages for multi-CPU.

Essentially, a thread would tell kernel "do this list of things" and the kernel would split more expensive things into smaller pieces and give all the smaller pieces a priority (that is derived directly from the priority of the thread that asked the kernel to do the work); and all of the CPUs would do "highest priority piece first" in parallel (whenever running a user-space thread isn't more important); so that something like creating a process might use 5 CPUs at the same time (and finish faster). For dependencies; there's no problem - if "small piece B" can't be started until "small piece A" has finished, then initially only "small piece A" would be put onto the kernel's work queue, and when "small piece A" finished the kernel would put "small piece B" onto the kernel's work queue. Of course with many user-space threads running on multiple CPUs (plus work the kernel does on its own) the kernel's work queues end up being a mixed bag of many small unrelated pieces.

With all of this in mind; what if I added a random "boost or nerf" adjustment to the priority of the smaller pieces (instead of the priority of the smaller pieces being derived directly from the priority of the thread that asked the kernel to do the work)? With a very minor change, the kernel would execute the "mixed bag of many small unrelated pieces" in a semi-randomised order.

Note that I'm currently only really concerned with protecting the kernel from user-space, and currently only want to make the kernel unpredictable.
simeonz wrote:The reason I think a JIT compiler could do that more efficiently is because it can rearrange the very code and introduce dynamic branches where necessary, or can package multiple randomization events together at compile-time into one. A JIT compiler could, in principle, use optimized user-space micro-scheduling of sorts. It could even reorder loops, in some situations. In any case, it will be cheaper than a typical system scheduler call. Similarly, communication through primitives can be done through shared memory, controlled by JIT. Synchronization intercepted at compile-time will be reordered dynamically in the code when needed, but I imagine could offer smaller throughput reduction than a similar binary interface. So, purely hypothetically, a JIT infrastructure could introduce a very "weak" execution model, and then run it on hardware with stronger execution guarantees, to enforce information barrier on the program. The program could then never reliably extract side-channel data.
Last year I was experimenting with graphics; and wanted to convert a buffer of pixel data that was in a standard format (96-bits per pixel in the CIE XYZ colour space, with no gamma) into whatever the video mode and monitor happens to want (e.g. maybe 24-bit per pixel in one of the RGB colour spaces, with gamma), including doing dithering. Normal optimised assembly wasn't really fast enough (there were too many branches and too many memory accesses for local variables) so my solution was dynamic code generation. Specifically, a collection of pre-optimised snippets that would be stitched together and then patched, so that half the branches disappeared and half of the local variables effectively became constants.

Of course this was faster than normal assembly, which was faster than "ahead of time compiled" code, which is faster than "JIT compiled" code (where you can't afford to do expensive optimisations at run time). For a crude estimate; I'd say JIT would've been at least 500% slower.

Of course this is a special case and JIT is fine for a lot of things; but there's lots of different special cases and "everything must be JIT" is completely unacceptable because of those special cases.
simeonz wrote:The problem with AI, is that it cannot do DEP conceptually. The entire data is "intelligent" and mutable, because changes are made to it continuously. Therefore, it will be impossible to freeze the logic into immutable state and sign it, because it has floating nature. If it gets exploited, we cannot discriminate a good AI transition, from malicious AI transition. One of the main counter-exploitation mitigations, authentication and immutability of the controlling logic, is lost. Again, this is not a problem for the system integrity, but for the affected user.
Let's assume that you have a human who creates plain text files (containing C source code), and you also have an AI that creates plain text files (containing C source code). In this case (after compiling, signing, packaging, ...) both the human's code and the AI's code can use DEP; so what is the difference?

Now let's assume that a human and an AI both create plain text files (containing C source code); where (after compiling, signing, packaging, ...) the resulting code is self modifying (e.g. like the "graphics blitter from stitched together snippets" I mentioned above). In this case, neither the human's code nor the AI's code can use DEP; so what is the difference?

You're mostly saying that self modifying code is less secure (which is arguably correct); and then assuming that AI always means "intelligently self modifying". From my perspective; "intelligently self modifying" is the least plausible scenario. For example; the idea of using an extremely expensive and extremely unintelligent "try all the things until something works" approach every time anyone decides to (e.g.) open a text editor (and then not saving the generated code after you've done the extremely expensive/unintelligent code generation, and not ending up with "AI generated code that is not self modifying at all") is far fetched at best.
simeonz wrote:
Brendan wrote:If the OS failed to prevent differences in timing, and if the publisher of "malware.exe" hasn't been blacklisted world-wide already, and if the admin was tricked into explicitly allowing "malware.exe" to access the network; then the attacker probably deserves a reward for bypassing 3 entirely different protections. ;)
If the goal is to protect the core services and to maintain the process, user and session isolation, there is indeed no difference between malware and exploitable software. This is one layer of security, and the system can also assist the applications in protecting themselves, assist the users in protecting themselves, or even protect the applications from the users, or the users from the applications, and even protect itself from "itself". I agree that improperly authorized malware cannot be stopped and the system has no responsibility to stop it. But my argument was that with malicious software, the user experiences losses by default. In contrast, if he interacts with exploitable software, if the software is attacked, it is by virtue of special security services available to the application, assuming the developer has consciously used them, that consequent damage will be prevented (most likely with a crash). This exploitable software's identity will not be impersonated by malicious code, even if it suffers from downtime and unavailability. So, from this point of view, software exploitation is a separate concern for security, which aims to protect the user, not the system.
The goal is prevention.

Most of the reason vulnerabilities exist is because there's very little incentive to prevent them from existing. A vulnerability will be discovered, most users won't know anything about the vulnerability and just groan about being coerced into updating their software without knowing why, and the relatively small number users who do find out about it hear about vulnerabilities so regularly that they've become numb (it's just "Yawn, another vulnerability this week, everything is still as bad as it's always been"). For larger companies the incentive to prevent vulnerabilities adds up to a small loss of sales that lasts for a few weeks or until their competitor (if there is one) makes a similar blunder; which works out to a temporary loss of profit that's significantly less than the typical weekly advertising budget.

So, given that saying "don't worry, you'll do better next time" while patting them on the back in sympathy has failed to make any difference for decades; how can we encourage people to try harder to avoid releasing exploitable software? Threatening to destroy their business (by blacklisting all their products and guaranteeing zero sales forever) is one way to encourage prevention (but perhaps that's a little extreme?).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Korona »

Seems like that as of today, you have to add a long (i.e. 2^13 instructions) list of taken (or non-taken) direct branches that is executed three times¹ to your nuke... Also: Disable performance counters to make the attack less reliable.

I think examples like this show that the "nuke" defense is not really desirable. The security - performance, security - developer annoyance (they have to have their cose signed) and user annoyance (they will blame the OSif non-signed code is slow) is just bad. And it still does not defend against unknown zero days!

I think we just have to accept the existence of zero days, make hardware manufacturers take responsibility for the speculation bugs (i.e. by fixing them in future processors) and stress that users cannot rely on the security of their systems if they do not keep them up to date. I acknowledge that this has had only mixed success in the past but it also seems that there is at least a bit more awareness after Meltdown and Spectre that OS manufacturers can capitalize on.

¹ Considerung that hapf of the branches will be mispredicted during the first iteration, the cost of this alone is already on the order of magnitude as a ping over gigabit LAN. Granted, that is nothing to the cost of the WBINV instruction, which takes more time than an entire process time slice in Linux.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Brendan »

Hi,
Korona wrote:Seems like that as of today, you have to add a long (i.e. 2^13 instructions) list of taken (or non-taken) direct branches that is executed three times¹ to your nuke...
That's annoying - they're finding vulnerabilities faster than I can implement mitigations. Let's hope that the current increased rate of vulnerability discoveries is just a temporary thing (and not the beginning of an exponential growth curve ;) ).
Korona wrote:Also: Disable performance counters to make the attack less reliable.
For debugging and profiling facilities (including breakpoints, single-step, step on branches, reading and writing to the process' virtual address space, seeing contents of messages sent/received, tracing system calls, etc) you'd have a monitoring process and a monitored process, where kernel does the actual monitoring and sends/receives messages to the monitoring process. For security; I'm intending to have an "allow inspection" flag in the executable file header that needs to be explicitly set by the developer at compile time (where the kernel refuses to monitor a process if the flag isn't set), and I'm intending to have a some way for admin to allow an executable to act as a monitoring process (where kernel refuses to send/receive "monitoring messages" to the process if admin hasn't allowed it). These would be mutually exclusive - a process that allows inspection can't act as a monitoring process (partly because I don't want to deal things like "send me a message when I receive a message" tricking the kernel into creating an unstoppable message flood).
Korona wrote:I think examples like this show that the "nuke" defense is not really desirable. The security - performance, security - developer annoyance (they have to have their cose signed) and user annoyance (they will blame the OSif non-signed code is slow) is just bad. And it still does not defend against unknown zero days!
I decided that all executable code must be digitally signed ages ago (before any of these vulnerabilities were found). Developer annoyance should be almost nothing (configure a packaging tool and let it sign whatever you package/publish with your key). Excluding performance; end user annoyance should be less (e.g. no annoyances like having to pay crypto-malware ransom).

For performance, I need to do a lot of work before I can really measure it properly.
Korona wrote:I think we just have to accept the existence of zero days, make hardware manufacturers take responsibility for the speculation bugs (i.e. by fixing them in future processors) and stress that users cannot rely on the security of their systems if they do not keep them up to date. I acknowledge that this has had only mixed success in the past but it also seems that there is at least a bit more awareness after Meltdown and Spectre that OS manufacturers can capitalize on.
That's not an option for anyone - you can't just tell people not to use any computers for 10 years while manufactures fix flaws (and then find out that in 10 years time CPUs are far more complex and have more vulnerabilities than they do now). More useful would be to go back to 80386 CPUs (that were significantly simpler and therefore had a lot less vulnerabilities).

Note that "keep your system up to date" does not work - you get attacked for years, then the vulnerability is reported and you get attacked for a few more weeks while an update is created; then (if you're not too lazy to bother with the hassle) you update, but the damage has already been done and the attacker has moved to the next vulnerability anyway. Even if updates did work it only replaces a small risk of "potential hassle caused by attacks" with the guaranteed hassle caused by updates, where the latter is typically worse than the former (especially for Windows and Linux where updates often cause breakage and can cost hours of downtime). I never want to update (unless it's for new features that make the hassle worthwhile), I want to know that the system is secure and doesn't need updates.
Korona wrote:¹ Considerung that hapf of the branches will be mispredicted during the first iteration, the cost of this alone is already on the order of magnitude as a ping over gigabit LAN. Granted, that is nothing to the cost of the WBINV instruction, which takes more time than an entire process time slice in Linux.
Let's break the cost of a branch misprediction into 3 parts - pipeline flush to get rid of the wrong (speculatively executed) code, fetching the right code, and decoding the right code. The pipeline flush cost depends on pipeline depth and is around 10 cycles, and decoding is only a few cycles. The majority is fetching the right code from where ever it happens to be - maybe 150 cycles if it's not cached at all, and maybe 4 cycles if it's cached in L1.

Now let's assume that the branch ends up at the same instruction regardless of whether it's taken or not; like this:

Code: Select all

    jne .next
.next:
In this case you can expect that the right code will be in L1 cache because the correct code was fetched regardless of whether the branch was predicted or mispredicted; so the cost of a branch misprediction is approximately 10+4+2 = 16 cycles (and not a worst case of around 162 cycles).

Now assuming that the branch direction predictor has 4 states (strongly predicted taken, weakly predicted taken, weakly predicted not taken, strongly predicted not taken) you can expect that 50% of the first set of branches will be mispredicted and then 25% of the second set of branches will be mispredicted. For 4096 "branch prediction slots" (most current CPUs) and 3*8192 branches (twice as many branches than necessary just in case); that adds up to 2048+1024 = 3072 mispredicted branches at about 16 cycles each, plus 3*8192-3072 = 21504 predicted branches at maybe about 1 cycle each; or "3072*16 + 21504*1 = 70656" cycles. On a relatively slow 1 GHz CPU it'd be about 71 microseconds.

For WBINVD the minimum cost is around 2000 cycles (and is more expensive if data was modified and needs to be written back); but the real cost is all the cache misses you'd expect after WBINVD has finished. The worst case (huge caches full of modified data, followed by lots of data that needs to be fetched that would've been cached if the caches weren't flushed) is horrific - e.g. maybe as bad as 300 milliseconds. It's impossible to estimate the total cost accurately, but (given that most data isn't modified and that recently used kernel data won't be used after returning to user-space anyway) I'd be tempted to expect the cost to be "tens of milliseconds".

Now; let's assume that for each process there's an 8-bit value representing how much the kernel trusts the process. For an "extremely trusted" process (e.g. "trust_value = 0xFF") the kernel does none of this and it costs nothing; and for an "extremely distrusted" process (e.g. "trust_value = 0x00") the kernel might do all of this every time it returns to that process. Maybe the kernel does "if(trust_value < 0x40) { do_WBIVND(); }", and maybe the kernel does "number_of_branches_used_for_BTB_nuking = (3*8192 * (0xFF - trust_value)) / 0xFF". Maybe the OS increases protection for "known dodgy" CPUs and reduces it for "fixed CPUs" (that won't be released for a few years). Maybe the admin knows the computer will only be used for games (or will be used for "high security" purposes like medical records or banking) and configures it to suit.

In other words; the question is not "is it always too expensive", the question is "how much expense is justified under which conditions".

Note: In January Intel release a whitepaper containing information on a few changes to mitigate branch target injection attacks - specifically, Indirect Branch Restricted Speculation (IBRS), Single Thread Indirect Branch Predictors (STIBP) and Indirect Branch Predictor Barrier (IBPB). These new features were supposed to be retro-actively added to (some?) older/existing CPUs via. micro-code updates. Then some people complained that Intel's planned feature/s were stupid (and/or too expensive) and found alternative ways to mitigate the attacks ("retpoline", etc), and some micro-code udpates were released for some CPUs and then revoked because they were buggy. Since then I have no idea what happened - none of it has been documented in Intel's manual (or anywhere else I can find) and I don't know if it was completely cancelled. In theory some of these features might be beneficial for more than just Spectre alone, but at this stage it seems impossible (for me) to guess what their "undocumented vapor" might or might not be useful for, or if/when it might exist on which CPUs.

EDIT: Buried deep in a random guy's twitter feed, I stumbled across a link to an elusive Intel document called "Speculative Execution Side Channel Mitigations" that describes the IBRS, STIBP, IBPB, etc features. I'm currently reading this to figure out if it can be used to reduce the performance costs of "ultra-nuke" (if the CPU actually supports it).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
simeonz
Member
Member
Posts: 360
Joined: Fri Aug 19, 2016 10:28 pm

Re: Kernel Side-Channel "Ultra-Nuke"

Post by simeonz »

Brendan wrote:I think you're confusing "statically allocated virtual space" with "statically allocated physical pages". If an OS uses statically allocated physical pages and maps them at random virtual addresses (e.g. KASR without PASR) then this would be a potential problem. If an OS uses randomly allocated physical pages and maps them at statically allocated virtual addresses (PASR without KASR) then this isn't a potential problem.
I indeed required originally that the physical allocator is deterministic, or at least offers a large page allocation interface to the user process. However, in my last proposal, the requirement is only that the kernel pages are resident and stationary, which is not unreasonable in general.

Here is another even more concise illustration of the idea. Let's say a process calls the kernel API GetRandomInteger 1000 times, and concludes that after this call some 20 virtual addresses accesses (in different pages of that process) are consistently slower. From now on, every time 18 or more of those 20 pages generate latencies, the process will assume that someone has called GetRandomInteger. That is the gist of it. In order for this to work, the program must guarantee the 20 pages it owns (not their kernel counterparts, which are most likely resident anyway) stay resident in memory and are not relocated. A high enough access frequency should avoid paging, so unless the kernel actively relocates pages as a security measure, or there is an extreme memory pressure, there wont be a problem. We also need regularly occurring eviction from cache, to force a direct RAM access. This may be possible by vacuously writing to the process memory in high volume, assuming that the CPU load is favorable, the monitored event frequency (GetRandomInteger in our case) is sufficiently low, and preferably that write combining optimization is available. Otherwise a system-wide cache flushing must be somehow triggered, which if not possible foils the plan. The information obtained in this way, I imagine, can narrow down a brute-force cryptographic attack, but don't quote me on that.
Brendan wrote:Essentially, a thread would tell kernel "do this list of things" and the kernel would split more expensive things into smaller pieces and give all the smaller pieces a priority (that is derived directly from the priority of the thread that asked the kernel to do the work); and all of the CPUs would do "highest priority piece first" in parallel (whenever running a user-space thread isn't more important); so that something like creating a process might use 5 CPUs at the same time (and finish faster). For dependencies; there's no problem - if "small piece B" can't be started until "small piece A" has finished, then initially only "small piece A" would be put onto the kernel's work queue, and when "small piece A" finished the kernel would put "small piece B" onto the kernel's work queue. Of course with many user-space threads running on multiple CPUs (plus work the kernel does on its own) the kernel's work queues end up being a mixed bag of many small unrelated pieces.
I remember your mention of that scheme in a different post and it seems suitable to your architecture. I am slightly curious why do you emphasize the performance of the service calls. I imagine process synchronization and communication will be the most demanding kernel activities. However, if the primitives are user-space objects, only blocking and unblocking during contention for the mutex case, and starvation for the producer-consumer case, would require kernel involvement. And I ponder if frequent blocking under full workload is not essentially a software issue - insufficient parallelism, excessive parallelism, or excessive synchronization. That is, can't it be argued that in a perfect world, performance sensitive software should not trigger these events frequently. It does in practice. There is also memory management, process and thread management, preemption. I would consider those rare in comparison. And interrupt services are not in the same basket, I believe. (They cannot be interleaved with the other tasks.)
Brendan wrote:With all of this in mind; what if I added a random "boost or nerf" adjustment to the priority of the smaller pieces (instead of the priority of the smaller pieces being derived directly from the priority of the thread that asked the kernel to do the work)? With a very minor change, the kernel would execute the "mixed bag of many small unrelated pieces" in a semi-randomised order.
How would one calibrate correctly the standard deviation of the boost? I mean, it sounds sensible, but can it ever be proven correct and random enough?
Brendan wrote:Last year I was experimenting with graphics; and wanted to convert a buffer of pixel data that was in a standard format (96-bits per pixel in the CIE XYZ colour space, with no gamma) into whatever the video mode and monitor happens to want (e.g. maybe 24-bit per pixel in one of the RGB colour spaces, with gamma), including doing dithering. Normal optimised assembly wasn't really fast enough (there were too many branches and too many memory accesses for local variables) so my solution was dynamic code generation. Specifically, a collection of pre-optimised snippets that would be stitched together and then patched, so that half the branches disappeared and half of the local variables effectively became constants.
The performance improvement from code generation is understandable. On the other hand, prepackaged code similarly has to make unnecessarily generalizations about the execution environment.

Indeed, patience testing build times are not unusual for larger projects. But those involve source parsing and a whole program optimization. Ahead of time compilers optimize the entire code, despite the fact that the majority of it is executed in a 1:100 ratio. (Although sometimes performance can be more distributed admittedly.) What JIT can do instead is perform a generic first compilation pass, usually on-demand to reduce load times, and then follow it by optimizing pass for hot spots detected at run-time. Those hot spots are far fewer and smaller (even if a few calls deep) then the scope of a link-time optimization pass. So, they should compile within the first seconds of execution. Meanwhile the slow generic version of the code would be running. The code can also be cached and compilation needn't restart every time the software loads.

Also, some of the work is done at the development site. Like source parsing and most of the global and local optimizations, i.e. inlining, constant folding, common subexpression elimination, loop-invariant code motion, etc. Backend tasks, such as register allocation, vectorization, instruction scheduling, will be done JIT at the user site. It can also try to further inline one component's functions in a different component during hot spot optimization. This need not be done globally, as the goal is to eliminate the ABI cost, to eliminate dead code branches from argument checks, etc.

So, JIT may be slower all things equal, but 500%? And it can leverage other tactical advantages. We are not talking about managed languages in general here, though. They suffer from lack of concise memory layout (as every object lives in a separate allocation) and can introduce sparse access patterns. This relates to automatic memory management and the object-oriented facilities. Also, managed languages are heavier on dynamic dispatch, especially when the programmers don't bother with type sealing.
Brendan wrote:You're mostly saying that self modifying code is less secure (which is arguably correct); and then assuming that AI always means "intelligently self modifying". From my perspective; "intelligently self modifying" is the least plausible scenario. For example; the idea of using an extremely expensive and extremely unintelligent "try all the things until something works" approach every time anyone decides to (e.g.) open a text editor (and then not saving the generated code after you've done the extremely expensive/unintelligent code generation, and not ending up with "AI generated code that is not self modifying at all") is far fetched at best.
Well, I am talking about either self-modifying code, or liberal use of code-generation in software. AI is just a context, which draws my curiosity in regard to the security aspect, but is otherwise irrelevant. More concretely, the question is how freely the software should be allowed to dynamically generate control flow logic. If we reach a solution for AGI someday, possibly even after our lifetime - by the same line of reasoning, our security principles will be completely inapplicable to it.
Brendan wrote:So, given that saying "don't worry, you'll do better next time" while patting them on the back in sympathy has failed to make any difference for decades; how can we encourage people to try harder to avoid releasing exploitable software? Threatening to destroy their business (by blacklisting all their products and guaranteeing zero sales forever) is one way to encourage prevention (but perhaps that's a little extreme?).
Software correctness is plain difficult. It is true that the companies are complacent, and that the market expectations have been molded. But with the present day tools, correctness is far too costly to be feasible, as an absolute goal. The skills and processes are insufficient to make it viable. My opinion is that unless we evolve the tools, defect-free software will be financially unfeasible. We will continue to take risks with defect-laden products instead. Or alternatively, if we take the stance that only non-exploitable software will be viable in the hostilities of our cyber-future, software will simply not be viable at all, unless revolutionary changes in the way we develop software take place.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Brendan »

Hi,
simeonz wrote:
Brendan wrote:I think you're confusing "statically allocated virtual space" with "statically allocated physical pages". If an OS uses statically allocated physical pages and maps them at random virtual addresses (e.g. KASR without PASR) then this would be a potential problem. If an OS uses randomly allocated physical pages and maps them at statically allocated virtual addresses (PASR without KASR) then this isn't a potential problem.
I indeed required originally that the physical allocator is deterministic, or at least offers a large page allocation interface to the user process. However, in my last proposal, the requirement is only that the kernel pages are resident and stationary, which is not unreasonable in general.
It's not unreasonable in general (e.g. for other/existing operating systems); but I'm planning to dynamically "re-randomise" all physical pages of RAM at a fixed rate (including kernel's physical pages, pages used for page tables, ..).

More specifically; periodically I'm going to sweep through all virtual address spaces (including kernel space) and:
  • If my "software ECC" feature is enabled; check for (and correct) any RAM faults in each physical page (including shifting pages that have errors into a "suspect page" list).
  • Update "least recently used" data for swap space management, and transfer pages to/from 1st level swap space (where 1st level swap space is "compressed data in RAM")
  • Check if physical pages can be replaced by better physical pages (for large page support, and for NUMA and cache colouring optimisations) and replace it if possible
  • Check if physical pages are in a "waiting for hot-remove" zone and if they are replace them with pages that aren't (so that eventually all pages in that zone aren't being used, and the memory can be taken offline)
  • If the physical page hasn't already been replaced for any of the reasons above (and if the "dynamically re-randomise physical pages" feature is enabled) replace the physical page anyway.
simeonz wrote:
Brendan wrote:Essentially, a thread would tell kernel "do this list of things" and the kernel would split more expensive things into smaller pieces and give all the smaller pieces a priority (that is derived directly from the priority of the thread that asked the kernel to do the work); and all of the CPUs would do "highest priority piece first" in parallel (whenever running a user-space thread isn't more important); so that something like creating a process might use 5 CPUs at the same time (and finish faster). For dependencies; there's no problem - if "small piece B" can't be started until "small piece A" has finished, then initially only "small piece A" would be put onto the kernel's work queue, and when "small piece A" finished the kernel would put "small piece B" onto the kernel's work queue. Of course with many user-space threads running on multiple CPUs (plus work the kernel does on its own) the kernel's work queues end up being a mixed bag of many small unrelated pieces.
I remember your mention of that scheme in a different post and it seems suitable to your architecture. I am slightly curious why do you emphasize the performance of the service calls.
It's not so much the performance, but the latency. For example, imagine an ethernet card generates an IRQ, so the kernel's IRQ handler sends an "IRQ occurred" message to the device driver in user-space, hopefully causing a high priority thread that was waiting for a message to be unblocked, and hopefully causing the scheduler to switch to that high priority thread "immediately" (so that the device driver can handle the device's IRQ as soon as possible and so that the entire distributed OS isn't crippled by poor networking performance). Now consider what happens if the device sends an IRQ just after the kernel started a lengthy operation (e.g. spawning a whole new process) and the kernel can't switch to the device driver's high priority thread until after that lengthy operation is finished (because the kernel is using "one kernel stack per CPU" and has to ensure the CPU's kernel stack isn't being used before returning to user-space). Mostly I need to split lengthy operations into small pieces to minimise delays between a high priority threads being unblocked and high priority threads actually getting CPU time.
simeonz wrote:
Brendan wrote:With all of this in mind; what if I added a random "boost or nerf" adjustment to the priority of the smaller pieces (instead of the priority of the smaller pieces being derived directly from the priority of the thread that asked the kernel to do the work)? With a very minor change, the kernel would execute the "mixed bag of many small unrelated pieces" in a semi-randomised order.
How would one calibrate correctly the standard deviation of the boost? I mean, it sounds sensible, but can it ever be proven correct and random enough?
If priorities are from 0 to 255, I'd just subtract 8 and add a random number from 0 to 15, then clamp it to the "0 to 255" range again.

It's simple to prove something is correct - you just define "correct" as whatever the code actually does, and then say that it must be correct because it does whatever it does. ;)
simeonz wrote:
Brendan wrote:Last year I was experimenting with graphics; and wanted to convert a buffer of pixel data that was in a standard format (96-bits per pixel in the CIE XYZ colour space, with no gamma) into whatever the video mode and monitor happens to want (e.g. maybe 24-bit per pixel in one of the RGB colour spaces, with gamma), including doing dithering. Normal optimised assembly wasn't really fast enough (there were too many branches and too many memory accesses for local variables) so my solution was dynamic code generation. Specifically, a collection of pre-optimised snippets that would be stitched together and then patched, so that half the branches disappeared and half of the local variables effectively became constants.
The performance improvement from code generation is understandable. On the other hand, prepackaged code similarly has to make unnecessarily generalizations about the execution environment.

Indeed, patience testing build times are not unusual for larger projects. But those involve source parsing and a whole program optimization. Ahead of time compilers optimize the entire code, despite the fact that the majority of it is executed in a 1:100 ratio. (Although sometimes performance can be more distributed admittedly.) What JIT can do instead is perform a generic first compilation pass, usually on-demand to reduce load times, and then follow it by optimizing pass for hot spots detected at run-time. Those hot spots are far fewer and smaller (even if a few calls deep) then the scope of a link-time optimization pass. So, they should compile within the first seconds of execution. Meanwhile the slow generic version of the code would be running. The code can also be cached and compilation needn't restart every time the software loads.

Also, some of the work is done at the development site. Like source parsing and most of the global and local optimizations, i.e. inlining, constant folding, common subexpression elimination, loop-invariant code motion, etc. Backend tasks, such as register allocation, vectorization, instruction scheduling, will be done JIT at the user site. It can also try to further inline one component's functions in a different component during hot spot optimization. This need not be done globally, as the goal is to eliminate the ABI cost, to eliminate dead code branches from argument checks, etc.

So, JIT may be slower all things equal, but 500%? And it can leverage other tactical advantages. We are not talking about managed languages in general here, though. They suffer from lack of concise memory layout (as every object lives in a separate allocation) and can introduce sparse access patterns. This relates to automatic memory management and the object-oriented facilities. Also, managed languages are heavier on dynamic dispatch, especially when the programmers don't bother with type sealing.
500% slower isn't unusual at all (until you start using massive libraries stuffed full of "AoT compiled" native code to hide the problem like Java does; or cache previously JIT compiled pieces like most web browsers do for Javascript now).

The other problem with JIT is security - "100% secure source code" becoming vulnerable due to flaws in the JIT compiler and/or in its libraries.
simeonz wrote:
Brendan wrote:You're mostly saying that self modifying code is less secure (which is arguably correct); and then assuming that AI always means "intelligently self modifying". From my perspective; "intelligently self modifying" is the least plausible scenario. For example; the idea of using an extremely expensive and extremely unintelligent "try all the things until something works" approach every time anyone decides to (e.g.) open a text editor (and then not saving the generated code after you've done the extremely expensive/unintelligent code generation, and not ending up with "AI generated code that is not self modifying at all") is far fetched at best.
Well, I am talking about either self-modifying code, or liberal use of code-generation in software. AI is just a context, which draws my curiosity in regard to the security aspect, but is otherwise irrelevant. More concretely, the question is how freely the software should be allowed to dynamically generate control flow logic. If we reach a solution for AGI someday, possibly even after our lifetime - by the same line of reasoning, our security principles will be completely inapplicable to it.
From my perspective; you define a set of rules (e.g. "processes aren't allowed to read from kernel space") then do whatever you can to ensure that it's impossible to break the rules. If you succeed; it doesn't matter who or what tries to break the rules - a human can't break the rules, 1 million monkeys (on 1 million type writers) can't break the rules, aliens from outer space can't break the rules, and AGI can't break the rules.

Of course the opposite is also true - if any one thing (e.g. monkeys) can break the rules, the all things (e.g. humans, aliens, AGI) can also break the rules.

Ironically, this implies that a million monkeys might be a useful way to test security ("fuzzing").
simeonz wrote:
Brendan wrote:So, given that saying "don't worry, you'll do better next time" while patting them on the back in sympathy has failed to make any difference for decades; how can we encourage people to try harder to avoid releasing exploitable software? Threatening to destroy their business (by blacklisting all their products and guaranteeing zero sales forever) is one way to encourage prevention (but perhaps that's a little extreme?).
Software correctness is plain difficult. It is true that the companies are complacent, and that the market expectations have been molded. But with the present day tools, correctness is far too costly to be feasible, as an absolute goal. The skills and processes are insufficient to make it viable. My opinion is that unless we evolve the tools, defect-free software will be financially unfeasible. We will continue to take risks with defect-laden products instead. Or alternatively, if we take the stance that only non-exploitable software will be viable in the hostilities of our cyber-future, software will simply not be viable at all, unless revolutionary changes in the way we develop software take place.
I'm also planning revolutionary changes in the way we develop software (splitting large monolithic applications into many small processes communicating via. open messaging protocols, and research on language and tools focusing on detecting bugs/problems at compile time). ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Korona »

Brendan wrote:For WBINVD the minimum cost is around 2000 cycles (and is more expensive if data was modified and needs to be written back); but the real cost is all the cache misses you'd expect after WBINVD has finished.
Your estimates semm to be reasonable; just to add to that: Yeahe real cost is the incrrased miss rate after WBINVD. Howver, the data transfer cost after and during WBINVD is also not negligible. DDR4 has a peak transfer rate of about 20000 MiB/s and you can easily have 20 MiB caches. That's 1 ms on data transfer alone!

I have another question about your trust model though: Assume I want to run a game (or a similar realtime application) on your OS. Certainly, it contains a scripting engine and you cannot be sure that the developers are competent enough to secure that against running speculative execution exploits. On the other hand, you cannot introduce multi ms pauses to flush caches and branch prediction buffers without rendering the application unusable. How do you handle this situation?
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Kernel Side-Channel "Ultra-Nuke"

Post by Brendan »

Hi,
Korona wrote:
Brendan wrote:For WBINVD the minimum cost is around 2000 cycles (and is more expensive if data was modified and needs to be written back); but the real cost is all the cache misses you'd expect after WBINVD has finished.
Your estimates semm to be reasonable; just to add to that: Yeahe real cost is the incrrased miss rate after WBINVD. Howver, the data transfer cost after and during WBINVD is also not negligible. DDR4 has a peak transfer rate of about 20000 MiB/s and you can easily have 20 MiB caches. That's 1 ms on data transfer alone!
It'd vary too much to say. For example, one case would be 16 CPUs doing "small piece of work then block" message handling, where all the CPUs share the same 20 MiB cache and the cache gets invalidated so often that it rarely contains any data all. A worst case might be a very trusted multi-threaded process using all of the CPUs except one to process a lot of data, with one CPU running untrusted code that frequently invalidates the cache (and ruins performance for the very trusted process' threads).
Korona wrote:I have another question about your trust model though: Assume I want to run a game (or a similar realtime application) on your OS. Certainly, it contains a scripting engine and you cannot be sure that the developers are competent enough to secure that against running speculative execution exploits. On the other hand, you cannot introduce multi ms pauses to flush caches and branch prediction buffers without rendering the application unusable. How do you handle this situation?
It's a distributed system. You send a request message, maybe 250 ms later you get a reply message. Maybe that 250 ms is because the Internet is congested, maybe it's because the receiver is struggling to handle load from 20 other computers, and maybe a tiny part of it is because your process isn't trusted. Too bad. If you can't handle 250 ms delays you're doing it wrong.

For games (real-time graphics and sound); the key is prediction - "I predict that in 123 ms time Ninja #1 will start raising his sword, in 567 ms time Ninja #1 will start swinging the raised sword forward, and in 678 ms Train #5 will collide with Ninja #1 causing the Ninja to be thrown in that direction". Let the video driver figure out what should be shown on the screen (when it finishes rendering the frame) based on the most recent predictions it was told (before it started rendering the frame).

For other things (e.g. high frequency trading) maybe it's just the wrong kind of OS.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply