Page 1 of 3
The cost of a system call
Posted: Fri May 04, 2012 7:42 pm
by gerryg400
I've been doing some testing this morning and thought someone might be interested in the results. My machine is a Core2 Quad at 2.83 GHz. I have a 'null' system call that I used to do the tests in long mode. It does the following.
i) begins in the C lib passes 3 parameters to the kernel
ii) enters the kernel via syscall or int $0x20
iii) kernel saves all 16 GP regs
iv) uses swapgs to retrieve kernel stack, core id etc.
v) switches to kernel stack
vi) moves into kernel C code through my kernel system call mechanism and increments a counter
vii) reverses the above and returns to user mode.
The scheduler is not called.
Average times are as follows
Code: Select all
int $0x20/iretq - 587 ns
syscall/sysret - 449 ns
int $0x20/sysret - 506 ns
syscall/iretq - 530 ns
What this means is that per call -
Code: Select all
syscall instead of int $0x20 saves 57 ns
sysret instead of iretq saves 81 ns
total saving 138 ns
The other thing that this shows is that you can mix and match between the 2 mechanisms. You can use sysret to return from hardware interrupts to ring3. I also tested using syscall from ring 1 and that works fine as long as you use iretq to return.
Re: The cost of a system call
Posted: Sat May 05, 2012 4:38 am
by bluemoon
Here is my result, test with QEMU.
The OS has only launched single process (kthreads are idle priority so will not be switched, but PIC timer will still fire which should affect very little)
The kernel and application is compiled with -O2
Note that my syscall will only preserve registers according to AMD ABI, but not 16 like yours.
Code: Select all
PID[1]: Hello from user application
Call Start : 4378D42C
End : 4381FCF7
Elapsed : 928CB (600267 cycles)
Average : 3C (60 cycles)
Syscall Start : 4381FDF4
End : 43CFDD63
Elapsed : 4DDF6F (5103471 cycles)
Average : 1FE (510 cycles)
Related code:
Code: Select all
// test call
__asm volatile ("rdtsc; rdtsc\n" : "=a"(cstart_lo), "=d"(cstart_hi));
for ( int i=0; i<10000; i++ ) {
call_null();
}
__asm volatile ("rdtsc\n" : "=a"(cend_lo), "=d"(cend_hi));
// test sysacall
__asm volatile ("rdtsc; rdtsc\n" : "=a"(start_lo), "=d"(start_hi));
for ( int i=0; i<10000; i++ ) {
syscall_null();
}
__asm volatile ("rdtsc\n" : "=a"(end_lo), "=d"(end_hi));
userland interface:
Code: Select all
call_null:
ret
syscall_null:
xor eax, eax
syscall
ret
syscall in kernel:
Code: Select all
; Max 5 parameters: rdi rsi rdx r9 r8
_syscall_stub:
cmp rsp, APPADDR_PROCESS_STACK
jae .fault
push rcx ; ring3 rip
push r11 ; rflags
mov r11, qword syscall_table
mov rcx, r9 ; 4th parameter
cmp eax, 12
jbe .1
mov edi, eax
xor eax, eax
.1:
call qword [r11+rax*8]
pop r11
pop rcx
db 0x48 ; REX.w
sysret
null call is a C function with just
return 0;
Re: The cost of a system call
Posted: Sat May 05, 2012 5:29 am
by rdos
gerryg400 wrote:
i) begins in the C lib passes 3 parameters to the kernel
ii) enters the kernel via syscall or int $0x20
iii) kernel saves all 16 GP regs
iv) uses swapgs to retrieve kernel stack, core id etc.
v) switches to kernel stack
vi) moves into kernel C code through my kernel system call mechanism and increments a counter
vii) reverses the above and returns to user mode.
The scheduler is not called.
Average times are as follows
Code: Select all
int $0x20/iretq - 587 ns
syscall/sysret - 449 ns
int $0x20/sysret - 506 ns
syscall/iretq - 530 ns
This is a lot more than what I measured on this processor in 32-bit mode with call gates in RDOS.
Results from the other thread:
near: 51.6 million calls per second
gate: 13.4 million calls per second
sysenter: 10.5 million calls per second
The call gate version takes about 75ns. I haven't measured the sysenter/sysexit version yet (edit: just below 100ns, and thus slower). And this is the full overhead as there is nothing else involved in calling kernel functions (other than loading the appropriate registers for the call in case the function has parameters). At the kernel side, no registers are saved unless they are used.
Re: The cost of a system call
Posted: Sat May 05, 2012 5:35 am
by rdos
bluemoon wrote:Here is my result, test with QEMU.
Its not reliable to use QEMU for these kind of performance tests. You must do them on real hardware.
Additionally, you should not use idealised code, but you should rather compile it and validate syscalls like you would do in a production release of your OS/application.
Re: The cost of a system call
Posted: Sat May 05, 2012 5:40 am
by bluemoon
I just started porting my OS to 64bit last week and it just finally worked yesterday, so what I can do is run it in QEMU for now. sure someday I'll try it on real hardware.
The numbers are for references only, don't take it too serious.
Re: The cost of a system call
Posted: Sat May 05, 2012 11:06 am
by bluemoon
oops that was mistake, should be:
Code: Select all
__asm volatile ("xor eax, eax; cpuid; rdtsc" : "=a"(cstart_lo), "=d"(cstart_hi) :: "ebx", "ecx");
The idea is to make sure rdtsc is not executed out of order.
Then, the result become:
Code: Select all
PID[1]: Hello from user application
Call Start : 45409DEE
End : 48B61737
Elapsed : 3757949 (58030409 cycles)
Average : 3A (58 cycles)
Syscall Start : 48B618EA
End : 61EB4055
Elapsed : 1935276B (422913899 cycles)
Average : 1A6 (422 cycles)
PID[1]: Hello again! counter=0
PID[1]: Hello again! counter=1
PID[1]: Hello again! counter=2
PID[1]: Hello again! counter=3
KMAIN : Clean zombie process: FFFFFFFF:8012A500
Re: The cost of a system call
Posted: Sat May 05, 2012 11:43 am
by turdus
gerryg400 wrote:iii) kernel saves all 16 GP regs
Good testing, but I think your results were influenced by this. One of the advantage of using syscall is no need for saving all the registers. You only have to save rcx and r11. Gives considerable performance boost.
Here's how I do it. KMEM_userspace points to current TCB, which happens to be TSS as well. This is the prologue:
Code: Select all
cli
if INTSYSCALL
clsavectx
sub qword [MEM_userspace+24h], KERNELSTACKSIZE
else
mov qword [MEM_userspace+tcb.userrip], rcx
pushf
pop qword [MEM_userspace+tcb.userflags]
//restore previous r11 from local variables stack (pushed on caller side)
mov r11, qword [r15]
end if
//bound check
cmp qword [MEM_userspace+24h], MEM_userspace+tcb.acl_end
jb @f
sti
@@:
And this is the epilogue:
Code: Select all
if INTSYSCALL
clloadctx
iretq
else
mov r11, qword [MEM_userspace+tcb.userflags]
mov rcx, qword [MEM_userspace+tcb.userrip]
//force interrupt enable
bts r11, 9
sysretq
end if
Maybe yield is interesting too:
Code: Select all
if INTSYSCALL
//no clloadctx, we want registers changed
add rsp, 16*8
add qword [MEM_userspace+24h], KERNELSTACKSIZE
clsavectx
//switch page tables and refresh cr3
call sys.arch.thread.thswitch
clloadctx
else
int SCHEDTMR_INT
end if
Hope it was useful.
Re: The cost of a system call
Posted: Sat May 05, 2012 2:46 pm
by gerryg400
I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs.
As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ?
Re: The cost of a system call
Posted: Sun May 06, 2012 9:16 am
by Brendan
Hi,
gerryg400 wrote:I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs.
I've always thought that, because syscall/sysret doesn't do some things that are likely to be necessary (like switching ESP to a kernel stack), it isn't directly comparable to software interrupts or call gates (or SYSENTER) because a kernel typically needs to add more instructions to a syscall handler that wouldn't have been necessary for software interrupts or call gates.
For an extreme example; because ESP/RSP isn't switched and the CPU doesn't push anything on the stack while at CPL=3, user space code could do "mov rsp, SOMEWHERE_IN_KERNEL_SPACE" and then "SYSCALL" and trick the kernel into trashing itself or modifying kernel data. To guard against that, the kernel has to save RSP somewhere and load RSP with a "known good" value before anything is pushed on the stack (either by the SYSCALL handler itself or by the CPU if an NMI or machine check exception occurs).
Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting).
For worst case, you'd need to deal with malicious user space code that does something like this:
Code: Select all
mov eax,0
mov ds,eax
mov es,eax
mov fs,eax
mov gs,eax
mov esp,SOMEWHERE_IN_KERNEL_SPACE
syscall
gerryg400 wrote:As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ?
Yes.
For a fair comparison that isn't overly effected by OS design, you'd want to compare:
- software interrupt with nothing more than IRET
- call gate with nothing more than RETF
- SYSENTER with nothing more than SYSEXIT
- the minimum safe SYSCALL handler
The result won't be entirely OS neutral as different OSs will take different approaches for the minimum safe SYSCALL handler (e.g. the "
mov esp, *something*" part).
I'd also suggest that the caller's code size also be taken into account. SYSCALL and "INT n" both cost 2 bytes. For SYSENTER, the caller needs to store "return EIP" and "return ESP" somewhere (likely EDX and ECX), so even though SYSENTER is only 2 bytes itself it's probably going to cost 6 or more bytes. For 32-bit code, call gates are going to cost a minimum of 6 bytes (using a 16-bit address size override prefix to avoid the full 32-bit offset that's ignored anyway).
I'd expect that SYSENTER would end up being the winner for performance (for frequently executed pieces of code), and SYSCALL and software interrupts would tie for code size (for infrequently used code).
Cheers,
Brendan
Re: The cost of a system call
Posted: Sun May 06, 2012 10:00 am
by bluemoon
Brendan wrote:
For worst case, you'd need to deal with malicious user space code that does something like this:
Code: Select all
mov eax,0
mov ds,eax
mov es,eax
mov fs,eax
mov gs,eax
mov esp,SOMEWHERE_IN_KERNEL_SPACE
syscall
If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing.
I still need to check
cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack.
Re: The cost of a system call
Posted: Sun May 06, 2012 11:25 am
by Brendan
Hi,
bluemoon wrote:Brendan wrote:For worst case, you'd need to deal with malicious user space code that does something like this:
Code: Select all
mov eax,0
mov ds,eax
mov es,eax
mov fs,eax
mov gs,eax
mov esp,SOMEWHERE_IN_KERNEL_SPACE
syscall
If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing.
The example was for a 32-bit protected mode kernel (otherwise it would've been "mov
rsp,SOMEWHERE_IN_KERNEL_SPACE"
).
bluemoon wrote:I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack.
If the syscall handler reuses the application's user stack, be very careful with your page fault handler. If the syscall handler's RSP (inherited from user space) ends up pointing to a "not present" page (either because that's where the caller left it, or because the kernel pushed enough on the stack to cross from a present page into a not present page), then the CPU won't try to switch to a different stack when trying to start the page fault exception handler (no privilege level transition) and will generate a double fault. To avoid that you'd probably need to use IST for the page fault handler (and ensure that page faults never nest), or use IST for the double fault handler.
Also, "
cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider:
Cheers,
Brendan
Re: The cost of a system call
Posted: Sun May 06, 2012 11:31 am
by bluemoon
Brendan wrote:The example was for a 32-bit protected mode kernel (otherwise it would've been "mov
rsp,SOMEWHERE_IN_KERNEL_SPACE"
).[/code]
according to intel manual, syscall in 32-bit or compatibility mode trigger #UD.
Brendan wrote:
Also, "
cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider:
That's why I said "is the application legal address range, and have enough room".
edit: I did an experiment with this:
Code: Select all
syscall_null:
xor eax, eax
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov rbx, rsp
mov rsp,0x00000008
syscall
mov rsp, rbx
ret
And this is catched by #PF within syscall handler, which I have a chance to terminate this abnormal process.
Code: Select all
INT0E : #PF Page Fault Exception. RIP:FFFFFFFF:80104AE9 CODE:2 ADDR:00000000:00000000
: PML4[0] PDPT[0] PD[0] PT[0]
#PF : Access to unallocated memory. CODE: 2
: ADDR: 00000000:00000000 PTE[0]: 00000000:00000000
By the way, you are correct on the #PF issue which I overlooked.
Re: The cost of a system call
Posted: Sun May 06, 2012 11:59 am
by rdos
Brendan wrote:Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting).
I don't remember the parameters for SYSCALL, but at least for SYSENTER it is possible to make 100% certain that an application cannot modify kernel data or malfunctions because of an invalid kernel stack.
I do it like this:
1. Kernel ESP MSR is loaded with the current thread stack offset from TSS (by taking base + size of SS0) whenever a new thread is scheduled. This takes care of the nesting issue as ESP is not loaded manually in kernel.
2. When using the stack reference from the application, address it with the ds or es register, and let ds and es for applications only cover application address space. This will make the sysenter entry-point code protection fault if a stack reference to kernel space is provided. In long mode, this doesn't work (limits are not used), and so the pointer needs to be checked with software.
Re: The cost of a system call
Posted: Sun May 06, 2012 2:01 pm
by Brynet-Inc
bluemoon wrote:according to intel manual, syscall in 32-bit or compatibility mode trigger #UD.
SYSCALL/SYSRET are from AMD, which does support them in 32-bit mode. Intel only supports them in 64-bit mode.
Re: The cost of a system call
Posted: Sun May 06, 2012 2:25 pm
by Cognition
Generally if you were to use SYSCALL in long mode you simply swapgs and load in a known good pointer.
Code: Select all
user_enter_syscall64:
swapgs
mov rax, [gs:KSTACK_OFFSET]
mov [gs:USTACK_OFFSET], rsp
mov rsp, rax
...
mov rsp, [gs:USTACK_OFFSET]
swapgs
sysret
This is making the assumption you could at least clobber RAX initially as you'll probably return some value in it later. You could also do similar things for protected mode.
Code: Select all
user_enter_syscall32:
mov ax, PROC_SPECIFIC_DATA_SEG
mov gs, ax
mov eax, [gs:KSTACK_OFFSET]
mov [ss:eax+4], esp
mov esp, eax
...
pop gs
pop esp
sysret
Here the user space GS value is assumed to be determinable from some other structure (thread info for example), which should work out since it's usually used for thread specific data anyways. To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself.