The cost of a system call

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

The cost of a system call

Post by gerryg400 »

I've been doing some testing this morning and thought someone might be interested in the results. My machine is a Core2 Quad at 2.83 GHz. I have a 'null' system call that I used to do the tests in long mode. It does the following.

i) begins in the C lib passes 3 parameters to the kernel
ii) enters the kernel via syscall or int $0x20
iii) kernel saves all 16 GP regs
iv) uses swapgs to retrieve kernel stack, core id etc.
v) switches to kernel stack
vi) moves into kernel C code through my kernel system call mechanism and increments a counter
vii) reverses the above and returns to user mode.

The scheduler is not called.

Average times are as follows

Code: Select all

int $0x20/iretq  - 587 ns
syscall/sysret   - 449 ns

int $0x20/sysret - 506 ns
syscall/iretq    - 530 ns
What this means is that per call -

Code: Select all

syscall instead of int $0x20 saves 57 ns
sysret instead of iretq saves      81 ns
total saving                      138 ns
The other thing that this shows is that you can mix and match between the 2 mechanisms. You can use sysret to return from hardware interrupts to ring3. I also tested using syscall from ring 1 and that works fine as long as you use iretq to return.
If a trainstation is where trains stop, what is a workstation ?
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: The cost of a system call

Post by bluemoon »

Here is my result, test with QEMU.
The OS has only launched single process (kthreads are idle priority so will not be switched, but PIC timer will still fire which should affect very little)
The kernel and application is compiled with -O2
Note that my syscall will only preserve registers according to AMD ABI, but not 16 like yours.

Code: Select all

PID[1]: Hello from user application
Call    Start : 4378D42C
          End : 4381FCF7
      Elapsed : 928CB (600267 cycles)
      Average : 3C (60 cycles)
Syscall Start : 4381FDF4
          End : 43CFDD63
      Elapsed : 4DDF6F (5103471 cycles)
      Average : 1FE (510 cycles)
Related code:

Code: Select all

    // test call
    __asm volatile ("rdtsc; rdtsc\n" : "=a"(cstart_lo), "=d"(cstart_hi));
    for ( int i=0; i<10000; i++ ) {
        call_null();
    }
    __asm volatile ("rdtsc\n" : "=a"(cend_lo), "=d"(cend_hi));

    // test sysacall
    __asm volatile ("rdtsc; rdtsc\n" : "=a"(start_lo), "=d"(start_hi));
    for ( int i=0; i<10000; i++ ) {
        syscall_null();
    }
    __asm volatile ("rdtsc\n" : "=a"(end_lo), "=d"(end_hi));
userland interface:

Code: Select all

call_null:
    ret
syscall_null:
    xor     eax, eax
    syscall
    ret
syscall in kernel:

Code: Select all

; Max 5 parameters: rdi rsi rdx r9 r8
_syscall_stub:
    cmp     rsp, APPADDR_PROCESS_STACK
    jae     .fault

    push    rcx                         ; ring3 rip
    push    r11                         ; rflags

    mov     r11, qword syscall_table
    mov     rcx, r9                     ; 4th parameter
    
    cmp     eax, 12
    jbe     .1
    mov     edi, eax
    xor     eax, eax
.1:
   
    call    qword [r11+rax*8]

    pop     r11
    pop     rcx
    db 0x48 ; REX.w
    sysret
null call is a C function with just return 0;
rdos
Member
Member
Posts: 3306
Joined: Wed Oct 01, 2008 1:55 pm

Re: The cost of a system call

Post by rdos »

gerryg400 wrote: i) begins in the C lib passes 3 parameters to the kernel
ii) enters the kernel via syscall or int $0x20
iii) kernel saves all 16 GP regs
iv) uses swapgs to retrieve kernel stack, core id etc.
v) switches to kernel stack
vi) moves into kernel C code through my kernel system call mechanism and increments a counter
vii) reverses the above and returns to user mode.

The scheduler is not called.

Average times are as follows

Code: Select all

int $0x20/iretq  - 587 ns
syscall/sysret   - 449 ns

int $0x20/sysret - 506 ns
syscall/iretq    - 530 ns
This is a lot more than what I measured on this processor in 32-bit mode with call gates in RDOS.

Results from the other thread:

near: 51.6 million calls per second
gate: 13.4 million calls per second
sysenter: 10.5 million calls per second

The call gate version takes about 75ns. I haven't measured the sysenter/sysexit version yet (edit: just below 100ns, and thus slower). And this is the full overhead as there is nothing else involved in calling kernel functions (other than loading the appropriate registers for the call in case the function has parameters). At the kernel side, no registers are saved unless they are used.
Last edited by rdos on Sat May 05, 2012 6:22 am, edited 3 times in total.
rdos
Member
Member
Posts: 3306
Joined: Wed Oct 01, 2008 1:55 pm

Re: The cost of a system call

Post by rdos »

bluemoon wrote:Here is my result, test with QEMU.
Its not reliable to use QEMU for these kind of performance tests. You must do them on real hardware.

Additionally, you should not use idealised code, but you should rather compile it and validate syscalls like you would do in a production release of your OS/application.
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: The cost of a system call

Post by bluemoon »

I just started porting my OS to 64bit last week and it just finally worked yesterday, so what I can do is run it in QEMU for now. sure someday I'll try it on real hardware.
The numbers are for references only, don't take it too serious.
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: The cost of a system call

Post by bluemoon »

oops that was mistake, should be:

Code: Select all

__asm volatile ("xor eax, eax; cpuid; rdtsc" : "=a"(cstart_lo), "=d"(cstart_hi) :: "ebx", "ecx");
The idea is to make sure rdtsc is not executed out of order.

Then, the result become:

Code: Select all

  PID[1]: Hello from user application
Call    Start : 45409DEE
          End : 48B61737
      Elapsed : 3757949 (58030409 cycles)
      Average : 3A (58 cycles)
Syscall Start : 48B618EA
          End : 61EB4055
      Elapsed : 1935276B (422913899 cycles)
      Average : 1A6 (422 cycles)
  PID[1]: Hello again! counter=0
  PID[1]: Hello again! counter=1
  PID[1]: Hello again! counter=2
  PID[1]: Hello again! counter=3
  KMAIN : Clean zombie process: FFFFFFFF:8012A500
User avatar
turdus
Member
Member
Posts: 496
Joined: Tue Feb 08, 2011 1:58 pm

Re: The cost of a system call

Post by turdus »

gerryg400 wrote:iii) kernel saves all 16 GP regs
Good testing, but I think your results were influenced by this. One of the advantage of using syscall is no need for saving all the registers. You only have to save rcx and r11. Gives considerable performance boost.

Here's how I do it. KMEM_userspace points to current TCB, which happens to be TSS as well. This is the prologue:

Code: Select all

		cli
if INTSYSCALL
		clsavectx
		sub			qword [MEM_userspace+24h], KERNELSTACKSIZE
else
		mov			qword [MEM_userspace+tcb.userrip], rcx
		pushf
		pop			qword [MEM_userspace+tcb.userflags]
		//restore previous r11 from local variables stack (pushed on caller side)
		mov			r11, qword [r15]
end if
		//bound check
		cmp			qword [MEM_userspace+24h], MEM_userspace+tcb.acl_end
		jb			@f
		sti
@@:
And this is the epilogue:

Code: Select all

if INTSYSCALL
		clloadctx
		iretq
else
		mov			r11, qword [MEM_userspace+tcb.userflags]
		mov			rcx, qword [MEM_userspace+tcb.userrip]
		//force interrupt enable
		bts			r11, 9
		sysretq
end if
Maybe yield is interesting too:

Code: Select all

if INTSYSCALL
		//no clloadctx, we want registers changed
		add			rsp, 16*8
		add			qword [MEM_userspace+24h], KERNELSTACKSIZE
		clsavectx
		//switch page tables and refresh cr3
		call			sys.arch.thread.thswitch
		clloadctx
else
		int			SCHEDTMR_INT
end if
Hope it was useful.
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: The cost of a system call

Post by gerryg400 »

I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs.

As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ?
If a trainstation is where trains stop, what is a workstation ?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The cost of a system call

Post by Brendan »

Hi,
gerryg400 wrote:I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs.
I've always thought that, because syscall/sysret doesn't do some things that are likely to be necessary (like switching ESP to a kernel stack), it isn't directly comparable to software interrupts or call gates (or SYSENTER) because a kernel typically needs to add more instructions to a syscall handler that wouldn't have been necessary for software interrupts or call gates.

For an extreme example; because ESP/RSP isn't switched and the CPU doesn't push anything on the stack while at CPL=3, user space code could do "mov rsp, SOMEWHERE_IN_KERNEL_SPACE" and then "SYSCALL" and trick the kernel into trashing itself or modifying kernel data. To guard against that, the kernel has to save RSP somewhere and load RSP with a "known good" value before anything is pushed on the stack (either by the SYSCALL handler itself or by the CPU if an NMI or machine check exception occurs).

Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting).

For worst case, you'd need to deal with malicious user space code that does something like this:

Code: Select all

    mov eax,0
    mov ds,eax
    mov es,eax
    mov fs,eax
    mov gs,eax
    mov esp,SOMEWHERE_IN_KERNEL_SPACE
    syscall
gerryg400 wrote:As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ?
Yes.

For a fair comparison that isn't overly effected by OS design, you'd want to compare:
  • software interrupt with nothing more than IRET
  • call gate with nothing more than RETF
  • SYSENTER with nothing more than SYSEXIT
  • the minimum safe SYSCALL handler
The result won't be entirely OS neutral as different OSs will take different approaches for the minimum safe SYSCALL handler (e.g. the "mov esp, *something*" part).

I'd also suggest that the caller's code size also be taken into account. SYSCALL and "INT n" both cost 2 bytes. For SYSENTER, the caller needs to store "return EIP" and "return ESP" somewhere (likely EDX and ECX), so even though SYSENTER is only 2 bytes itself it's probably going to cost 6 or more bytes. For 32-bit code, call gates are going to cost a minimum of 6 bytes (using a 16-bit address size override prefix to avoid the full 32-bit offset that's ignored anyway).

I'd expect that SYSENTER would end up being the winner for performance (for frequently executed pieces of code), and SYSCALL and software interrupts would tie for code size (for infrequently used code).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: The cost of a system call

Post by bluemoon »

Brendan wrote: For worst case, you'd need to deal with malicious user space code that does something like this:

Code: Select all

    mov eax,0
    mov ds,eax
    mov es,eax
    mov fs,eax
    mov gs,eax
    mov esp,SOMEWHERE_IN_KERNEL_SPACE
    syscall
If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing.
I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The cost of a system call

Post by Brendan »

Hi,
bluemoon wrote:
Brendan wrote:For worst case, you'd need to deal with malicious user space code that does something like this:

Code: Select all

    mov eax,0
    mov ds,eax
    mov es,eax
    mov fs,eax
    mov gs,eax
    mov esp,SOMEWHERE_IN_KERNEL_SPACE
    syscall
If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing.
The example was for a 32-bit protected mode kernel (otherwise it would've been "mov rsp,SOMEWHERE_IN_KERNEL_SPACE" ;-) ).
bluemoon wrote:I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack.
If the syscall handler reuses the application's user stack, be very careful with your page fault handler. If the syscall handler's RSP (inherited from user space) ends up pointing to a "not present" page (either because that's where the caller left it, or because the kernel pushed enough on the stack to cross from a present page into a not present page), then the CPU won't try to switch to a different stack when trying to start the page fault exception handler (no privilege level transition) and will generate a double fault. To avoid that you'd probably need to use IST for the page fault handler (and ensure that page faults never nest), or use IST for the double fault handler.

Also, "cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider:

Code: Select all

    mov rsp,0x00000008
    syscall
Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: The cost of a system call

Post by bluemoon »

Brendan wrote:The example was for a 32-bit protected mode kernel (otherwise it would've been "mov rsp,SOMEWHERE_IN_KERNEL_SPACE" ;-) ).[/code]
according to intel manual, syscall in 32-bit or compatibility mode trigger #UD.
Brendan wrote: Also, "cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider:

Code: Select all

    mov rsp,0x00000008
    syscall
That's why I said "is the application legal address range, and have enough room".

edit: I did an experiment with this:

Code: Select all

syscall_null:
    xor     eax, eax
    mov     ds, ax
    mov     es, ax
    mov     fs, ax
    mov     gs, ax
    mov     rbx, rsp
    mov     rsp,0x00000008
    syscall
    mov     rsp, rbx
    ret
And this is catched by #PF within syscall handler, which I have a chance to terminate this abnormal process.

Code: Select all

  INT0E : #PF Page Fault Exception. RIP:FFFFFFFF:80104AE9 CODE:2 ADDR:00000000:00000000
        : PML4[0] PDPT[0] PD[0] PT[0]
    #PF : Access to unallocated memory. CODE: 2
        : ADDR: 00000000:00000000 PTE[0]: 00000000:00000000
By the way, you are correct on the #PF issue which I overlooked.
rdos
Member
Member
Posts: 3306
Joined: Wed Oct 01, 2008 1:55 pm

Re: The cost of a system call

Post by rdos »

Brendan wrote:Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting).
I don't remember the parameters for SYSCALL, but at least for SYSENTER it is possible to make 100% certain that an application cannot modify kernel data or malfunctions because of an invalid kernel stack.

I do it like this:

1. Kernel ESP MSR is loaded with the current thread stack offset from TSS (by taking base + size of SS0) whenever a new thread is scheduled. This takes care of the nesting issue as ESP is not loaded manually in kernel.

2. When using the stack reference from the application, address it with the ds or es register, and let ds and es for applications only cover application address space. This will make the sysenter entry-point code protection fault if a stack reference to kernel space is provided. In long mode, this doesn't work (limits are not used), and so the pointer needs to be checked with software.
User avatar
Brynet-Inc
Member
Member
Posts: 2426
Joined: Tue Oct 17, 2006 9:29 pm
Libera.chat IRC: brynet
Location: Canada
Contact:

Re: The cost of a system call

Post by Brynet-Inc »

bluemoon wrote:according to intel manual, syscall in 32-bit or compatibility mode trigger #UD.
SYSCALL/SYSRET are from AMD, which does support them in 32-bit mode. Intel only supports them in 64-bit mode.
Image
Twitter: @canadianbryan. Award by smcerm, I stole it. Original was larger.
Cognition
Member
Member
Posts: 191
Joined: Tue Apr 15, 2008 6:37 pm
Location: Gotham, Batmanistan

Re: The cost of a system call

Post by Cognition »

Generally if you were to use SYSCALL in long mode you simply swapgs and load in a known good pointer.

Code: Select all

user_enter_syscall64:
     swapgs
     mov rax, [gs:KSTACK_OFFSET]
     mov [gs:USTACK_OFFSET], rsp
     mov rsp, rax
     ...
     mov rsp, [gs:USTACK_OFFSET]
     swapgs
     sysret
This is making the assumption you could at least clobber RAX initially as you'll probably return some value in it later. You could also do similar things for protected mode.

Code: Select all

user_enter_syscall32:
   mov ax, PROC_SPECIFIC_DATA_SEG
   mov gs, ax
   mov eax, [gs:KSTACK_OFFSET]
   mov [ss:eax+4], esp
   mov esp, eax
   ...
   pop gs
   pop esp
   sysret
Here the user space GS value is assumed to be determinable from some other structure (thread info for example), which should work out since it's usually used for thread specific data anyways. To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.
Reserved for OEM use.
Post Reply