Page 1 of 2

Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 2:59 pm
by tigujo
Quote from "Interrupts For Dummies":
"The linux API is 0x80"

Is there a faster method of calling a kernel API?
I'm still in the mood of trying to implement the fastest possible kernel calls in my OS, but still have to get a feeling for overall costs.

BTW, how many clock ticks does a software interrupt need, more or less?

I know, sheer testing will give me an answer, but perhaps someone has an answer ready...

Thanks, from a Dummy, tigujo himself ::)

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 3:26 pm
by Cjmovie
I'm not sure how fast interrupt calling is, but I can garuantee one thing: Having all ring0, and directly calling kernel code (via "call bla:bla") will always (?) be fastest.

I'm rather sure that interrupts are as close as you will get, though, if you want to consider safety (protection via segments/ring).

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 3:34 pm
by gaf
Is there a faster method of calling a kernel API?
The sysenter/sysexit (Intel) respectively syscall/sysret (amd) instructions are much faster as they skip most of the checks normally performed when loading the selectors. The cpu doesn't look up the gdt to get the usr/kernel descriptor but simply loads a standard flat-mode selector from hardware, which saves several memory accesses and protection checks without breaking the pmode security model.
BTW, how many clock ticks does a software interrupt need, more or less?
Not sure what you mean here as "int 0x80" already is a software interrupt..
I know, sheer testing will give me an answer, but perhaps someone has an answer ready...
I've ran a series of ?-benchmarks myself some time ago so that I can tell you that sysenter really has a nice performance advantage over traditional software interrupts. Nevertheless you shouldn't forget that systemcalls are always an expensive operation that ought to be avoided altogether if somehow possible. You should therefore design your systemcall interface very carefully to reduce the number of systemcalls necessary to a bearable minimum.

regards,
gaf

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 4:00 pm
by tigujo
My Quote:
BTW, how many clock ticks does a software interrupt need, more or less?
-> just thought, someone might have a raw latency estimate how many tics it does take to issue a software interrupt, the interrupt handler does say nothing, just returns back to the interrupted code.

Running everything on ring 0 sounds fine to me, just I'm not sure if I won't end up like ground zero without rudimentary protection :-X

Thanks to both of you.

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 4:19 pm
by Exabyte256
I think the time it takes would vary between the processor models. Things like pipelines and all those new technology things also make a difference.

Rather than try to sacrifice stability for speed, see if there's anything you can do inside the system calls to speed up the code, or reducing the number of calls you have to make.

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 4:32 pm
by AR
Exabyte256 wrote: Rather than try to sacrifice stability for speed, see if there's anything you can do inside the system calls to speed up the code, or reducing the number of calls you have to make.
You may loose the ability to make use of segmentation (security) with SYSENTER/SYSEXIT but what stability?

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 5:11 pm
by Exabyte256
That's what I meant, taking away protection would make the system insecure and then leave it open to any application error / maliciousness.

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 5:19 pm
by AR
Exabyte256 wrote: That's what I meant, taking away protection would make the system insecure and then leave it open to any application error / maliciousness.
What? SYSENTER enters the kernel at a segment register and address specified in the Model Specific Registers, the program can only enter at the preset point in the kernel. Ring 3 and Ring 0 are still used, only that they must be 0-4GB Segments.

Re:Fastest possible kernel API call?

Posted: Sun Nov 27, 2005 8:21 pm
by Brendan
Hi,
Cjmovie wrote:I'm rather sure that interrupts are as close as you will get, though, if you want to consider safety (protection via segments/ring).
A while ago I tested the performance of interrupts and call gates. For an Intel Pentium 4 a call gate is faster (not sure about other CPUs, and can't remember how many cycles - it was a while ago).

For SYSENTER and SYSCALL my mind isn't made up yet. They should be faster because they do less, but if the OS needs to do some of the things they skip then it might end up costing more.

For example, SYSCALL disables interrupts and I'd want to re-enable them. It also doesn't change ESP, which I consider dangerous (for some OSs there's no guarantee that the next "PUSH" won't cause a page fault, which would end up being a triple fault).

By the time you've fixed IF, saved ESP somewhere, set ESP to what it should be, and then restored the CPL=3 ESP when you return...


Cheers,

Brendan

Re:Fastest possible kernel API call?

Posted: Mon Nov 28, 2005 2:25 am
by Candy
Brendan wrote: For SYSENTER and SYSCALL my mind isn't made up yet. They should be faster because they do less, but if the OS needs to do some of the things they skip then it might end up costing more.

For example, SYSCALL disables interrupts and I'd want to re-enable them. It also doesn't change ESP, which I consider dangerous (for some OSs there's no guarantee that the next "PUSH" won't cause a page fault, which would end up being a triple fault).

By the time you've fixed IF, saved ESP somewhere, set ESP to what it should be, and then restored the CPL=3 ESP when you return...
The idea is true... however, I've heard about the performance of the interrupt method taking at least 50 cycles and up to some 125 cycles per invocation. The syscall/sysenter should work in around 5-10 cycles.

Restoring the IF is a small 1-cycle task. The ESP problem is the only main difference between the two, and I agree with AMD on the idea. The task itself makes the call so it should be its stack that suffers. If it pagefaults (first fault!) during the switch or somewhere later on, you can handle the page fault in any common way, you either terminate the task or you allocate another page of stack. There's no magic to it and it reduces the overhead of a thread by a full stack. In any case, the logic is simpler.

Re:Fastest possible kernel API call?

Posted: Mon Nov 28, 2005 3:57 am
by Brendan
Hi,
Candy wrote:Restoring the IF is a small 1-cycle task. The ESP problem is the only main difference between the two, and I agree with AMD on the idea. The task itself makes the call so it should be its stack that suffers. If it pagefaults (first fault!) during the switch or somewhere later on, you can handle the page fault in any common way, you either terminate the task or you allocate another page of stack. There's no magic to it and it reduces the overhead of a thread by a full stack. In any case, the logic is simpler.
I think you're missing something here...

For my OS a CPL=3 thread's stack uses allocation on demand so that it grows as needed. When the thread is first spawned there is no pages allocated for the stack, and the very first PUSH or CALL instruction causes a page fault.

Now imagine if the newly spawned thread uses SYSCALL as one of it's first instructions (i.e. before a PUSH or CALL), which is very common for my code. In this case the SYSCALL works and the kernel begins running at CPL=0 with ESP pointing to a "not present" page.

As soon as the kernel does a PUSH or CALL the CPU tries to start the page fault handler, but the CPU is already in CPL=0 so it does not switch stacks. Instead it tries to put an error code, EIP, CS and EFLAGS onto the existing stack, which causes a second page fault, which leads very quickly to a triple fault.

Even if the OS maps the first page of a new thread's stack, the new thread could consume 4 KB of stack space then use the SYSCALL instruction. This leaves the OS wide open to erratic failure (caused be swap space) and malicious denial of service attacks - "mov esp,0x80000000; syscall" and say goodbye to the OS.

AFAIK the only possible way this triple fault can be avoided (if allocation on demand is used for CPL=3 stacks) is by changing ESP so that it points to a page that is present. This must be done before the stack is used by the kernel.

For long mode the rules change - in this case you can set it up so that the page fault handler does switch to a special stack (using the IST or "Interrupt Stack Table"), so the problem can be avoided.


Cheers,

Brendan

Re:Fastest possible kernel API call?

Posted: Mon Nov 28, 2005 5:44 am
by AR
Brendan wrote:As soon as the kernel does a PUSH or CALL the CPU tries to start the page fault handler, but the CPU is already in CPL=0 so it does not switch stacks. Instead it tries to put an error code, EIP, CS and EFLAGS onto the existing stack, which causes a second page fault, which leads very quickly to a triple fault.
You can use a task gate for the page fault interrupt to avoid that. Probably slower but would most likely occur less often then system calls although benchmarking would be required to determine if it truly balances itself out for the given OS design.

Re:Fastest possible kernel API call?

Posted: Mon Nov 28, 2005 6:34 am
by Candy
Brendan wrote: For my OS a CPL=3 thread's stack uses allocation on demand so that it grows as needed. When the thread is first spawned there is no pages allocated for the stack, and the very first PUSH or CALL instruction causes a page fault.

Now imagine if the newly spawned thread uses SYSCALL as one of it's first instructions (i.e. before a PUSH or CALL), which is very common for my code. In this case the SYSCALL works and the kernel begins running at CPL=0 with ESP pointing to a "not present" page.

As soon as the kernel does a PUSH or CALL the CPU tries to start the page fault handler, but the CPU is already in CPL=0 so it does not switch stacks. Instead it tries to put an error code, EIP, CS and EFLAGS onto the existing stack, which causes a second page fault, which leads very quickly to a triple fault.
This is why I consider 32-bit code deprecated for my OS. It complicates the code very much, whereas given the probable release date of my OS, it's quite unnecessary for the complexity to be added.

Yes, that's pretty arrogant and pessimistic about my OS, but if limiting the OS new technology makes it a lot cleaner and usable, and if I consider it very likely that my target user base will have that kind of computer, it's no problem.
Even if the OS maps the first page of a new thread's stack, the new thread could consume 4 KB of stack space then use the SYSCALL instruction. This leaves the OS wide open to erratic failure (caused be swap space) and malicious denial of service attacks - "mov esp,0x80000000; syscall" and say goodbye to the OS.
No fixed allocation / limit is ever a solution, except for things that can't be solved another way (logfiles).
AFAIK the only possible way this triple fault can be avoided (if allocation on demand is used for CPL=3 stacks) is by changing ESP so that it points to a page that is present. This must be done before the stack is used by the kernel.
Try a task gate or such for your page fault handler.
For long mode the rules change - in this case you can set it up so that the page fault handler does switch to a special stack (using the IST or "Interrupt Stack Table"), so the problem can be avoided.
Long mode are my rules.

Re:Fastest possible kernel API call?

Posted: Mon Nov 28, 2005 6:57 am
by Pype.Clicker
Candy wrote:
AFAIK the only possible way this triple fault can be avoided (if allocation on demand is used for CPL=3 stacks) is by changing ESP so that it points to a page that is present. This must be done before the stack is used by the kernel.
Try a task gate or such for your page fault handler.
[tt]>_<[/tt] i wouldn't be doing that ... first of all, because you'll suffer strongly severe performances penalty for every missing page, and secondly because you cannot prevent the CPU from reloading CR3 when it joins task gates ... that's probably a bad idea.

Moreover, when task gates are concerned, you cannot re-enter the gate, that is nasty things are to happen if your page-fault handling code triggers a page fault too, while a common trap gate will handle this softly (at least as long as there's room on the interrupt stack)

You might wish a task gate for doublefault, stack fault and task fault, though.

Re:Fastest possible kernel API call?

Posted: Mon Nov 28, 2005 7:13 am
by Candy
if your page-fault handling code triggers a page fault too, while a common trap gate will handle this softly
IIRC, I recall reading that a page fault during the handling of a page fault is also a double fault. I agree on the idea that it could be too much overhead, but therefore I'm all the more biased to the amd64 approach.