Page 1 of 3

The cost of a system call

Posted: Fri May 04, 2012 7:42 pm
by gerryg400
I've been doing some testing this morning and thought someone might be interested in the results. My machine is a Core2 Quad at 2.83 GHz. I have a 'null' system call that I used to do the tests in long mode. It does the following.

i) begins in the C lib passes 3 parameters to the kernel
ii) enters the kernel via syscall or int $0x20
iii) kernel saves all 16 GP regs
iv) uses swapgs to retrieve kernel stack, core id etc.
v) switches to kernel stack
vi) moves into kernel C code through my kernel system call mechanism and increments a counter
vii) reverses the above and returns to user mode.

The scheduler is not called.

Average times are as follows

Code: Select all

int $0x20/iretq  - 587 ns
syscall/sysret   - 449 ns

int $0x20/sysret - 506 ns
syscall/iretq    - 530 ns
What this means is that per call -

Code: Select all

syscall instead of int $0x20 saves 57 ns
sysret instead of iretq saves      81 ns
total saving                      138 ns
The other thing that this shows is that you can mix and match between the 2 mechanisms. You can use sysret to return from hardware interrupts to ring3. I also tested using syscall from ring 1 and that works fine as long as you use iretq to return.

Re: The cost of a system call

Posted: Sat May 05, 2012 4:38 am
by bluemoon
Here is my result, test with QEMU.
The OS has only launched single process (kthreads are idle priority so will not be switched, but PIC timer will still fire which should affect very little)
The kernel and application is compiled with -O2
Note that my syscall will only preserve registers according to AMD ABI, but not 16 like yours.

Code: Select all

PID[1]: Hello from user application
Call    Start : 4378D42C
          End : 4381FCF7
      Elapsed : 928CB (600267 cycles)
      Average : 3C (60 cycles)
Syscall Start : 4381FDF4
          End : 43CFDD63
      Elapsed : 4DDF6F (5103471 cycles)
      Average : 1FE (510 cycles)
Related code:

Code: Select all

    // test call
    __asm volatile ("rdtsc; rdtsc\n" : "=a"(cstart_lo), "=d"(cstart_hi));
    for ( int i=0; i<10000; i++ ) {
        call_null();
    }
    __asm volatile ("rdtsc\n" : "=a"(cend_lo), "=d"(cend_hi));

    // test sysacall
    __asm volatile ("rdtsc; rdtsc\n" : "=a"(start_lo), "=d"(start_hi));
    for ( int i=0; i<10000; i++ ) {
        syscall_null();
    }
    __asm volatile ("rdtsc\n" : "=a"(end_lo), "=d"(end_hi));
userland interface:

Code: Select all

call_null:
    ret
syscall_null:
    xor     eax, eax
    syscall
    ret
syscall in kernel:

Code: Select all

; Max 5 parameters: rdi rsi rdx r9 r8
_syscall_stub:
    cmp     rsp, APPADDR_PROCESS_STACK
    jae     .fault

    push    rcx                         ; ring3 rip
    push    r11                         ; rflags

    mov     r11, qword syscall_table
    mov     rcx, r9                     ; 4th parameter
    
    cmp     eax, 12
    jbe     .1
    mov     edi, eax
    xor     eax, eax
.1:
   
    call    qword [r11+rax*8]

    pop     r11
    pop     rcx
    db 0x48 ; REX.w
    sysret
null call is a C function with just return 0;

Re: The cost of a system call

Posted: Sat May 05, 2012 5:29 am
by rdos
gerryg400 wrote: i) begins in the C lib passes 3 parameters to the kernel
ii) enters the kernel via syscall or int $0x20
iii) kernel saves all 16 GP regs
iv) uses swapgs to retrieve kernel stack, core id etc.
v) switches to kernel stack
vi) moves into kernel C code through my kernel system call mechanism and increments a counter
vii) reverses the above and returns to user mode.

The scheduler is not called.

Average times are as follows

Code: Select all

int $0x20/iretq  - 587 ns
syscall/sysret   - 449 ns

int $0x20/sysret - 506 ns
syscall/iretq    - 530 ns
This is a lot more than what I measured on this processor in 32-bit mode with call gates in RDOS.

Results from the other thread:

near: 51.6 million calls per second
gate: 13.4 million calls per second
sysenter: 10.5 million calls per second

The call gate version takes about 75ns. I haven't measured the sysenter/sysexit version yet (edit: just below 100ns, and thus slower). And this is the full overhead as there is nothing else involved in calling kernel functions (other than loading the appropriate registers for the call in case the function has parameters). At the kernel side, no registers are saved unless they are used.

Re: The cost of a system call

Posted: Sat May 05, 2012 5:35 am
by rdos
bluemoon wrote:Here is my result, test with QEMU.
Its not reliable to use QEMU for these kind of performance tests. You must do them on real hardware.

Additionally, you should not use idealised code, but you should rather compile it and validate syscalls like you would do in a production release of your OS/application.

Re: The cost of a system call

Posted: Sat May 05, 2012 5:40 am
by bluemoon
I just started porting my OS to 64bit last week and it just finally worked yesterday, so what I can do is run it in QEMU for now. sure someday I'll try it on real hardware.
The numbers are for references only, don't take it too serious.

Re: The cost of a system call

Posted: Sat May 05, 2012 11:06 am
by bluemoon
oops that was mistake, should be:

Code: Select all

__asm volatile ("xor eax, eax; cpuid; rdtsc" : "=a"(cstart_lo), "=d"(cstart_hi) :: "ebx", "ecx");
The idea is to make sure rdtsc is not executed out of order.

Then, the result become:

Code: Select all

  PID[1]: Hello from user application
Call    Start : 45409DEE
          End : 48B61737
      Elapsed : 3757949 (58030409 cycles)
      Average : 3A (58 cycles)
Syscall Start : 48B618EA
          End : 61EB4055
      Elapsed : 1935276B (422913899 cycles)
      Average : 1A6 (422 cycles)
  PID[1]: Hello again! counter=0
  PID[1]: Hello again! counter=1
  PID[1]: Hello again! counter=2
  PID[1]: Hello again! counter=3
  KMAIN : Clean zombie process: FFFFFFFF:8012A500

Re: The cost of a system call

Posted: Sat May 05, 2012 11:43 am
by turdus
gerryg400 wrote:iii) kernel saves all 16 GP regs
Good testing, but I think your results were influenced by this. One of the advantage of using syscall is no need for saving all the registers. You only have to save rcx and r11. Gives considerable performance boost.

Here's how I do it. KMEM_userspace points to current TCB, which happens to be TSS as well. This is the prologue:

Code: Select all

		cli
if INTSYSCALL
		clsavectx
		sub			qword [MEM_userspace+24h], KERNELSTACKSIZE
else
		mov			qword [MEM_userspace+tcb.userrip], rcx
		pushf
		pop			qword [MEM_userspace+tcb.userflags]
		//restore previous r11 from local variables stack (pushed on caller side)
		mov			r11, qword [r15]
end if
		//bound check
		cmp			qword [MEM_userspace+24h], MEM_userspace+tcb.acl_end
		jb			@f
		sti
@@:
And this is the epilogue:

Code: Select all

if INTSYSCALL
		clloadctx
		iretq
else
		mov			r11, qword [MEM_userspace+tcb.userflags]
		mov			rcx, qword [MEM_userspace+tcb.userrip]
		//force interrupt enable
		bts			r11, 9
		sysretq
end if
Maybe yield is interesting too:

Code: Select all

if INTSYSCALL
		//no clloadctx, we want registers changed
		add			rsp, 16*8
		add			qword [MEM_userspace+24h], KERNELSTACKSIZE
		clsavectx
		//switch page tables and refresh cr3
		call			sys.arch.thread.thswitch
		clloadctx
else
		int			SCHEDTMR_INT
end if
Hope it was useful.

Re: The cost of a system call

Posted: Sat May 05, 2012 2:46 pm
by gerryg400
I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs.

As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ?

Re: The cost of a system call

Posted: Sun May 06, 2012 9:16 am
by Brendan
Hi,
gerryg400 wrote:I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs.
I've always thought that, because syscall/sysret doesn't do some things that are likely to be necessary (like switching ESP to a kernel stack), it isn't directly comparable to software interrupts or call gates (or SYSENTER) because a kernel typically needs to add more instructions to a syscall handler that wouldn't have been necessary for software interrupts or call gates.

For an extreme example; because ESP/RSP isn't switched and the CPU doesn't push anything on the stack while at CPL=3, user space code could do "mov rsp, SOMEWHERE_IN_KERNEL_SPACE" and then "SYSCALL" and trick the kernel into trashing itself or modifying kernel data. To guard against that, the kernel has to save RSP somewhere and load RSP with a "known good" value before anything is pushed on the stack (either by the SYSCALL handler itself or by the CPU if an NMI or machine check exception occurs).

Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting).

For worst case, you'd need to deal with malicious user space code that does something like this:

Code: Select all

    mov eax,0
    mov ds,eax
    mov es,eax
    mov fs,eax
    mov gs,eax
    mov esp,SOMEWHERE_IN_KERNEL_SPACE
    syscall
gerryg400 wrote:As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ?
Yes.

For a fair comparison that isn't overly effected by OS design, you'd want to compare:
  • software interrupt with nothing more than IRET
  • call gate with nothing more than RETF
  • SYSENTER with nothing more than SYSEXIT
  • the minimum safe SYSCALL handler
The result won't be entirely OS neutral as different OSs will take different approaches for the minimum safe SYSCALL handler (e.g. the "mov esp, *something*" part).

I'd also suggest that the caller's code size also be taken into account. SYSCALL and "INT n" both cost 2 bytes. For SYSENTER, the caller needs to store "return EIP" and "return ESP" somewhere (likely EDX and ECX), so even though SYSENTER is only 2 bytes itself it's probably going to cost 6 or more bytes. For 32-bit code, call gates are going to cost a minimum of 6 bytes (using a 16-bit address size override prefix to avoid the full 32-bit offset that's ignored anyway).

I'd expect that SYSENTER would end up being the winner for performance (for frequently executed pieces of code), and SYSCALL and software interrupts would tie for code size (for infrequently used code).


Cheers,

Brendan

Re: The cost of a system call

Posted: Sun May 06, 2012 10:00 am
by bluemoon
Brendan wrote: For worst case, you'd need to deal with malicious user space code that does something like this:

Code: Select all

    mov eax,0
    mov ds,eax
    mov es,eax
    mov fs,eax
    mov gs,eax
    mov esp,SOMEWHERE_IN_KERNEL_SPACE
    syscall
If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing.
I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack.

Re: The cost of a system call

Posted: Sun May 06, 2012 11:25 am
by Brendan
Hi,
bluemoon wrote:
Brendan wrote:For worst case, you'd need to deal with malicious user space code that does something like this:

Code: Select all

    mov eax,0
    mov ds,eax
    mov es,eax
    mov fs,eax
    mov gs,eax
    mov esp,SOMEWHERE_IN_KERNEL_SPACE
    syscall
If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing.
The example was for a 32-bit protected mode kernel (otherwise it would've been "mov rsp,SOMEWHERE_IN_KERNEL_SPACE" ;-) ).
bluemoon wrote:I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack.
If the syscall handler reuses the application's user stack, be very careful with your page fault handler. If the syscall handler's RSP (inherited from user space) ends up pointing to a "not present" page (either because that's where the caller left it, or because the kernel pushed enough on the stack to cross from a present page into a not present page), then the CPU won't try to switch to a different stack when trying to start the page fault exception handler (no privilege level transition) and will generate a double fault. To avoid that you'd probably need to use IST for the page fault handler (and ensure that page faults never nest), or use IST for the double fault handler.

Also, "cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider:

Code: Select all

    mov rsp,0x00000008
    syscall
Cheers,

Brendan

Re: The cost of a system call

Posted: Sun May 06, 2012 11:31 am
by bluemoon
Brendan wrote:The example was for a 32-bit protected mode kernel (otherwise it would've been "mov rsp,SOMEWHERE_IN_KERNEL_SPACE" ;-) ).[/code]
according to intel manual, syscall in 32-bit or compatibility mode trigger #UD.
Brendan wrote: Also, "cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider:

Code: Select all

    mov rsp,0x00000008
    syscall
That's why I said "is the application legal address range, and have enough room".

edit: I did an experiment with this:

Code: Select all

syscall_null:
    xor     eax, eax
    mov     ds, ax
    mov     es, ax
    mov     fs, ax
    mov     gs, ax
    mov     rbx, rsp
    mov     rsp,0x00000008
    syscall
    mov     rsp, rbx
    ret
And this is catched by #PF within syscall handler, which I have a chance to terminate this abnormal process.

Code: Select all

  INT0E : #PF Page Fault Exception. RIP:FFFFFFFF:80104AE9 CODE:2 ADDR:00000000:00000000
        : PML4[0] PDPT[0] PD[0] PT[0]
    #PF : Access to unallocated memory. CODE: 2
        : ADDR: 00000000:00000000 PTE[0]: 00000000:00000000
By the way, you are correct on the #PF issue which I overlooked.

Re: The cost of a system call

Posted: Sun May 06, 2012 11:59 am
by rdos
Brendan wrote:Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting).
I don't remember the parameters for SYSCALL, but at least for SYSENTER it is possible to make 100% certain that an application cannot modify kernel data or malfunctions because of an invalid kernel stack.

I do it like this:

1. Kernel ESP MSR is loaded with the current thread stack offset from TSS (by taking base + size of SS0) whenever a new thread is scheduled. This takes care of the nesting issue as ESP is not loaded manually in kernel.

2. When using the stack reference from the application, address it with the ds or es register, and let ds and es for applications only cover application address space. This will make the sysenter entry-point code protection fault if a stack reference to kernel space is provided. In long mode, this doesn't work (limits are not used), and so the pointer needs to be checked with software.

Re: The cost of a system call

Posted: Sun May 06, 2012 2:01 pm
by Brynet-Inc
bluemoon wrote:according to intel manual, syscall in 32-bit or compatibility mode trigger #UD.
SYSCALL/SYSRET are from AMD, which does support them in 32-bit mode. Intel only supports them in 64-bit mode.

Re: The cost of a system call

Posted: Sun May 06, 2012 2:25 pm
by Cognition
Generally if you were to use SYSCALL in long mode you simply swapgs and load in a known good pointer.

Code: Select all

user_enter_syscall64:
     swapgs
     mov rax, [gs:KSTACK_OFFSET]
     mov [gs:USTACK_OFFSET], rsp
     mov rsp, rax
     ...
     mov rsp, [gs:USTACK_OFFSET]
     swapgs
     sysret
This is making the assumption you could at least clobber RAX initially as you'll probably return some value in it later. You could also do similar things for protected mode.

Code: Select all

user_enter_syscall32:
   mov ax, PROC_SPECIFIC_DATA_SEG
   mov gs, ax
   mov eax, [gs:KSTACK_OFFSET]
   mov [ss:eax+4], esp
   mov esp, eax
   ...
   pop gs
   pop esp
   sysret
Here the user space GS value is assumed to be determinable from some other structure (thread info for example), which should work out since it's usually used for thread specific data anyways. To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.