OSDev.org

Posted: **Mon Mar 03, 2008 4:10 pm**

I once had a discussion with someone who wanted to write a self-hosting forth interpreter in C, and then run a kernel on top of this. A simplified instruction set, like that of forth, would allow for simplified scanning of executables for malicious code.

I've seen other discussions about scanning executables for malicious code, and it sounds tricky.

On x86 (long mode), the latencies for interrupts are 87-109 cycles for no permission level change, and 91-112 cycles for a permission level change. So the same interrupt driven architecture saves you at most 25 cycles (12.5 nanoseconds on a 2GHz processor) if you run everything in ring 0. Now, if you used the syscall/sysret instructions to get between ring 3 and 0, you end up with a total latency of 62 cycles, compared to a minimum of 203 cycles using regular interrupts (int imm8 latency + iretq latency). Finally, if you changed your system to use jumps to call and return from kernel code, you'd get a latency of 14 cycles (two 64 bit jumps, one LEA, and two mov rsp instructions).

So, the best options are:
1. ring 0 kernel, ring 3 code. Use syscall/sysret for kernel routine invocation. Latency of 62 cycles round-trip.
2. ring 0 for all. Use jumps and mov rsps for kernel routine invocation. Latency of 14 cycles round-trip.

That gives you an advantage of 48 cycles (24 nanoseconds on a 2GHz processor). A quick timed test on a linux machine under relatively high load (it's a school computer) showed about 50 context switches per second. This gives us a total latency of 1.5 microseconds per second (0.00016%) using ring3 processes, and 0.35 microseconds per second (0.00004%) using ring 0 processes.

Is it worth the extra work to have 99.99996% of cpu time for execution instead of 99.99984%? Certainly not.

Posted: **Mon Mar 03, 2008 5:48 pm**

Cycle counting is only part of the picture -- don't ignore the cache effects of privilege transitions.

Here is an interesting research paper that delves into this more deeply.

Posted: **Mon Mar 03, 2008 7:38 pm**

Protection level changes certainly don't have any effect on the cache. The associated code may bump some application code out of the cache, but that happens whether the application is running at kernel or user level permission...

The Singularity approach does seem to help with cache performance because of the possibility of running processes in the same address space.

Posted: **Mon Mar 03, 2008 10:38 pm**

@speal, The over all effect is much bigger than your examples, a much better working example is that the old xbox is little more than a legacy-free PC by Microsoft that consists of an Intel Celeron 733 MHz CPU, an nVidia GeForce 3MX, 64 MB of RAM, a 8/10 GB hard disk, a DVD drive and 10/100 Ethernet.

Now the old xbox could run the same games much faster than a P4 2GHz processor.
Here's the old xbox OS spec: http://www.extremetech.com/article2/0,1 ... 116,00.asp
Hardware spec:
http://www.xbox-linux.org/wiki/Main_Page

Why do you think a Co like M$ used it ?, this is not a case of lazy OS Dev's, its a case of its noticeably faster.

Posted: **Mon Mar 03, 2008 11:29 pm**

The XBox, while a good example of older hardware operating at peak performance, is a little off-topic. XBox code is optimized specifically for the XBox hardware (cache associativity, size, memory size, bus timings, etc..). Even more important, XBox hardware is optimized for fast transfers between main memory and graphics controllers. Using a DMA (which the XBox almost certainly is) has very little to do with context switching, and a lot to do with bus contention. To the best of my knowledge, the XBox uses a proprietary, high-speed bus which is likely responsible for much of the performance improvement. There's a LOT more to performance than the components - the interconnects are as important, if not more so.

I still think the only penalty for running ring 3 code is the additional latency attributed to permission level changes. Any other penalties (cache use, driver latencies, etc...) are byproducts of the kernel design. There is simply no other reason, in the hardware, for there to be any other incurred performance penalties.

Posted: **Tue Mar 04, 2008 1:51 am**

It could be that I'm thinking of penalties associated with segmentation checks that are probably skipped when you use syscall/sysenter etc....

Posted: **Tue Mar 04, 2008 4:25 pm**

syscall/sysret don't check segment limits, iirc. I know they don't in long mode since there are no segment limit checks (except fs and gs, but who's counting?).

I recently read a paper about microkernels and the penalties associated with the additional system calls between the kernel and drivers. The author rewrote portions of either Mach or L4 (can't remember right now...) to use syscall/sysret and eliminated the kernel checking of messages and ended up with a pretty insignificant performance penalty running applications on a POSIX compatibility layer as compared to Linux. The conclusion was that crossing permission levels just wasn't the bottleneck anymore because of advancements in kernel design and hardware optimizations.

I'll track this down and link it here when I get a chance...

Posted: **Tue Mar 04, 2008 8:28 pm**

Between the preemptive multi-tasking, all of the required and consistent privilege transitions, and all of the other random interrupts and any exceptions... it adds up.

If you are driving from one town to another, about 5 miles away, going 5 miles an hour faster than before... yeah... you are not making much difference. Now, take that concept and apply it to cross-country travel.

I don't know about the rest of you, but if we can keep the same (or better) security, stability and realistic scalability while increasing CPU response capability... that is most definitely something worth figuring out beyond some short-sided statistical guesses.

Then, remove the address separation and now you are really flying

PS: Protection level changes can affect the cache, maybe not directly, but at least indirectly (think microkernels.)

Posted: **Tue Mar 04, 2008 9:24 pm**

Protection levels don't add any cache problems to a system that already uses multiple address spaces. You can be clever about scheduling and message passing whether you switch permission levels or not. The system design (and resulting architecture) is responsible for cache penalties - the protection ring mechanism is not.

My whole point was that while there is a small penalty, it's so small it just doesn't add up like it used to. Processor makers have recognized that this was a bottleneck, and made it a significantly lower cost operation. A nanosecond here or there isn't worth worrying about if your scheduler takes 100ns to pick a thread (that's just 200 cycles on a 2GHz processor... not really a very complex scheduler).

Optimize where you get the most bang for your buck. Trying to optimize away a process that only takes 24ns just isn't worth the effort, especially if it makes other parts of your system run slower. If you look at my earlier posts, you'll see cycle counts and reasonable context-switch rates for a monolithic kernel system. The cache effects are real, but again, they're not caused by permission level changes.

OSDev.org

Procesess in Ring 0