System call implementation
System call implementation
Hello !
I have searched the wiki and the forum for threads on system call implementation, but I did not find what I was looking for.
Context: x86 protected mode + paging, kernel ring 0, user process ring 3, kernel is not mapped in user space (might be a problem in the near future).
Currently the only thing I know how to do (thanks to you ) is using interrupts for system calls.
However I would like to have a clean API. I plan to create a library to encapsulate the system calls and link all user code against this library (currently interruption are hardcoded in user code).
From what I have read there is another method using the "syscall/sysenter/sysexit" (sorry if I mixed up those). But how could I implement it ?
From what I understood, I should make a kernel page that contains those entry points available to other process, and then I could use those entry points instead of interrupts.
But how do I access those entry points ? May I use some library trick to be able to link directly to it ?
I understand very well the interrupt implementation of system call, but I don't see how to do it properly with the syscall instructions.
Do you have any pointer or information about it ?
PS: I had some difficulties to make myself clear in this message, if it is not understandable, I will try to rewrite it properly in a few hours.
I have searched the wiki and the forum for threads on system call implementation, but I did not find what I was looking for.
Context: x86 protected mode + paging, kernel ring 0, user process ring 3, kernel is not mapped in user space (might be a problem in the near future).
Currently the only thing I know how to do (thanks to you ) is using interrupts for system calls.
However I would like to have a clean API. I plan to create a library to encapsulate the system calls and link all user code against this library (currently interruption are hardcoded in user code).
From what I have read there is another method using the "syscall/sysenter/sysexit" (sorry if I mixed up those). But how could I implement it ?
From what I understood, I should make a kernel page that contains those entry points available to other process, and then I could use those entry points instead of interrupts.
But how do I access those entry points ? May I use some library trick to be able to link directly to it ?
I understand very well the interrupt implementation of system call, but I don't see how to do it properly with the syscall instructions.
Do you have any pointer or information about it ?
PS: I had some difficulties to make myself clear in this message, if it is not understandable, I will try to rewrite it properly in a few hours.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Sysenter and Sysexit form a pair: userspace calls sysenter, the kernel calls sysexit. So do syscall/sysret
Sysenter and sysexit work around a set of MSRs:
- SYSENTER_CS_MSR
- SYSENTER_EIP_MSR
- SYSENTER_ESP_MSR
when sysenter is called, the kernel loads cs eip and esp from the MSRs, and calculates SS as CS+8. the other GPRs remain the same and can be used for arguments.
when the kernel is done, it calls sysexit, which loads CS (SYSENTER_CS_MSR + 16 + 3) and SS (SYSENTER_CS_MSR + 24 + 3)
EIP and ESP are set to ECX and EDX respectively.
To make sure you dont break things the GDT needs to be ordered like this:
(other descriptors, maybe just the null descriptor)
32bit kernel code
32bit kernel data
32bit user code
32bit user data
(other descriptors, if appropriate)
which all need to have a base of 0 and the maximum (i.e. no) limit.
Furthermore, the userspace application needs to set ECX and EDX prior to a sysenter, and the kernel needs to preserve these. Of course an alternate scheme could also be used.
Be aware that you might need to reload the ESP MSR during task switch.
Syscall/sysret is similar, but it automatically saves/restores EIP into ECX while ignoring ESP entirely (which is thus up to the kernel to do). Be aware that you'll end up with an inconsistent stack after a syscall. (Similarly you'll have to bring the stack into the same inconsistent state before sysret).
Sysenter/sysexit is well documented in the Intel docs (Volume 2), For syscall, you should get the AMD manual (volume 3)
Sysenter and sysexit work around a set of MSRs:
- SYSENTER_CS_MSR
- SYSENTER_EIP_MSR
- SYSENTER_ESP_MSR
when sysenter is called, the kernel loads cs eip and esp from the MSRs, and calculates SS as CS+8. the other GPRs remain the same and can be used for arguments.
when the kernel is done, it calls sysexit, which loads CS (SYSENTER_CS_MSR + 16 + 3) and SS (SYSENTER_CS_MSR + 24 + 3)
EIP and ESP are set to ECX and EDX respectively.
To make sure you dont break things the GDT needs to be ordered like this:
(other descriptors, maybe just the null descriptor)
32bit kernel code
32bit kernel data
32bit user code
32bit user data
(other descriptors, if appropriate)
which all need to have a base of 0 and the maximum (i.e. no) limit.
Furthermore, the userspace application needs to set ECX and EDX prior to a sysenter, and the kernel needs to preserve these. Of course an alternate scheme could also be used.
Be aware that you might need to reload the ESP MSR during task switch.
Syscall/sysret is similar, but it automatically saves/restores EIP into ECX while ignoring ESP entirely (which is thus up to the kernel to do). Be aware that you'll end up with an inconsistent stack after a syscall. (Similarly you'll have to bring the stack into the same inconsistent state before sysret).
Sysenter/sysexit is well documented in the Intel docs (Volume 2), For syscall, you should get the AMD manual (volume 3)
Perhaps I can shed some light on the who-calls-what. Let's take printf() as an example, and a system using int 80.
User writes an application that includes <stdio.h> and uses printf().
Compiler compiles that application, and links in the clib containing the printf() code.
printf() parses the format string, processes the parameters, and comes up with a string (char *) pointing to the code to-be-printed. But the actual output has to be done through the kernel.
printf() calls write(), which is actually a stub that places the address of the string in a register, the file descriptor (usually an int) from the FILE struct in another register, and a number telling the handler for interrupt 80 what kernel function is actually required. Then it triggers int 80, waits for the interrupt handler (i.e., the kernel) to return, and does any necessary register cleanup required to return like just another C function.
printf() does its own cleanup, and returns to the user.
There is no special provisions necessary for either printf() or write(). You could make it a static / dynamic / shared library, or whatever. You could use syscall or sysenter instead of int 80 (as Combuster described).
The point is that you need some way to trigger a kernel function (i.e., jumping from userspace to kernelspace and passing a couple of arguments), and the kernel must be able to "see" whatever any pointer arguments point to.
That's the magic: Quite often, the user will pass a pointer to some buffer. The kernel must use the same page mapping as the user, or things will break. There are several ways you could do this, but the "canon" way is to map the kernel functions into the user's page mapping, so the kernel can use the user's mapping for executing its functionality.
----
Note: This is theory. I didn't implement any such thing myself, so if I triggered some BS alarm, I ask the more experienced coders here to yell loudly.
User writes an application that includes <stdio.h> and uses printf().
Compiler compiles that application, and links in the clib containing the printf() code.
printf() parses the format string, processes the parameters, and comes up with a string (char *) pointing to the code to-be-printed. But the actual output has to be done through the kernel.
printf() calls write(), which is actually a stub that places the address of the string in a register, the file descriptor (usually an int) from the FILE struct in another register, and a number telling the handler for interrupt 80 what kernel function is actually required. Then it triggers int 80, waits for the interrupt handler (i.e., the kernel) to return, and does any necessary register cleanup required to return like just another C function.
printf() does its own cleanup, and returns to the user.
There is no special provisions necessary for either printf() or write(). You could make it a static / dynamic / shared library, or whatever. You could use syscall or sysenter instead of int 80 (as Combuster described).
The point is that you need some way to trigger a kernel function (i.e., jumping from userspace to kernelspace and passing a couple of arguments), and the kernel must be able to "see" whatever any pointer arguments point to.
That's the magic: Quite often, the user will pass a pointer to some buffer. The kernel must use the same page mapping as the user, or things will break. There are several ways you could do this, but the "canon" way is to map the kernel functions into the user's page mapping, so the kernel can use the user's mapping for executing its functionality.
----
Note: This is theory. I didn't implement any such thing myself, so if I triggered some BS alarm, I ask the more experienced coders here to yell loudly.
Every good solution is obvious once you've found it.
Here's some tips for designing some API on both sides of the fence to help with dealing this stuff, and ease experimentation:
First, kernel side:
You are going to pass your arguments either directly or indirectly in registers. Even if you pass them in user-space stack, that's still pointed to by a register (%esp) so what I would do is inside kernel write some dispatch system that figure out which system call, and calls the correct handler with the user-sent arguments properly moved to kernel C stack (=normal C arguments, assuming C kernel) so that they are just normal arguments. Pass the operation code in some register (I'll assume %eax) to have this single dispatch point.
In handler prototype you can call pointers to user data something like
If said struct doesn't really exists, it'll be impossible to dereference that unintentionally without a cast, and it makes it simple to find all places in source that you actually are dealing with unsafe pointers. You can then write a set of kernel helper functions that know how to check that those pointers are valid. Say:
You can then use those to copy data from userspace to kernel space. If you want to reference it directly, you could add something like unsafe_peek/poke or whatever. Using separate functions means that if you want to change something like your memory management such that you need to change your validation code, you'll only need to modify one place.
You can extend the dispatcher function to deal with dynamic registration of new system calls or whatever you want, and you only have to modify the logic in the dispatcher function. But there's an added benefit: it doesn't matter how you call into the dispatch function, as long as you can give it the user registers in some sort of a struct (or whatever). So you can implement syscalls by far-calls, sysenter, interrupts, whatever... even all of them at once, and if you want to change that later, you only need to modify the code that calls the dispatcher (and returns back to userspace with the result).
As for userspace side of the fence, you can build one function (or inline assembler macro or whatever you happen to prefer) which knows how to do a system call with a given set of parameters. To rest of your userspace code, this should look like a normal function. You then include the macro and/or link the function to any code that uses system calls.
Now, if you decide to change the mechanism of system call, all you need to do on the userside is to modify this function to use the new method.
So... since you say you know how to do this all with an interrupt, I suggest you implement a setup similar to the above description (if you don't have it yet), check that it works, and either play with alternatives until you get them working, or just put it in the bin called "my kernel doesn't rely on a specific system call mechanism, and I have one mechanism that works, so I can go on and do something more productive first and come back and switch the mechanism if it seems later I would gain something".
Oh, and with some conditional checks on both sides of the fence, the same binary can use sysenter/exit (well, both Intel and AMD solutions for it, don't remember how much difference there was) on systems that can deal with that, and fall back to interrupts on systems that can not.
First, kernel side:
You are going to pass your arguments either directly or indirectly in registers. Even if you pass them in user-space stack, that's still pointed to by a register (%esp) so what I would do is inside kernel write some dispatch system that figure out which system call, and calls the correct handler with the user-sent arguments properly moved to kernel C stack (=normal C arguments, assuming C kernel) so that they are just normal arguments. Pass the operation code in some register (I'll assume %eax) to have this single dispatch point.
In handler prototype you can call pointers to user data something like
Code: Select all
typedef struct unsafe_data * unsafe_pointer;
Code: Select all
/* Copy from kernel to userpace memory */
int copy_unsafe_k2u(void *, unsafe_pointer *, int len);
/* Copy from userspace to kernel memory */
int copy_unsafe_u2k(unsafe_pointer *, void *, int len);
You can extend the dispatcher function to deal with dynamic registration of new system calls or whatever you want, and you only have to modify the logic in the dispatcher function. But there's an added benefit: it doesn't matter how you call into the dispatch function, as long as you can give it the user registers in some sort of a struct (or whatever). So you can implement syscalls by far-calls, sysenter, interrupts, whatever... even all of them at once, and if you want to change that later, you only need to modify the code that calls the dispatcher (and returns back to userspace with the result).
As for userspace side of the fence, you can build one function (or inline assembler macro or whatever you happen to prefer) which knows how to do a system call with a given set of parameters. To rest of your userspace code, this should look like a normal function. You then include the macro and/or link the function to any code that uses system calls.
Now, if you decide to change the mechanism of system call, all you need to do on the userside is to modify this function to use the new method.
So... since you say you know how to do this all with an interrupt, I suggest you implement a setup similar to the above description (if you don't have it yet), check that it works, and either play with alternatives until you get them working, or just put it in the bin called "my kernel doesn't rely on a specific system call mechanism, and I have one mechanism that works, so I can go on and do something more productive first and come back and switch the mechanism if it seems later I would gain something".
Oh, and with some conditional checks on both sides of the fence, the same binary can use sysenter/exit (well, both Intel and AMD solutions for it, don't remember how much difference there was) on systems that can deal with that, and fall back to interrupts on systems that can not.
The real problem with goto is not with the control transfer, but with environments. Properly tail-recursive closures get both right.
Thanks for this implementation tip.
My worry was precisely how to implement it properly. Yet I have an interrupt syscall handler, like the 0x80 on linux, but ugly and not really usable from userspace without hacks... but at least it works.
Having a common interface for any syscall method is a good idea I think. Tell me if I am wrong, but I think something similar is done on linux: the int 0x80- is always available, but if you wish you may use sysenter/sysexit as well.
I will follow your suggestion on trying it on the interrupt method first.
I was mostly interested in sysenter/sysexit to make a benchmark of it, because I have a lot of IPC syscalls, therefore a small gain is definitely worth if. Maybe I have written the 1st OS that does (almost) nothing, but does it slowly
However I discovered some bugs in my paging code (hmm at least I think it comes from it)... I go back to work, one day it will work
My worry was precisely how to implement it properly. Yet I have an interrupt syscall handler, like the 0x80 on linux, but ugly and not really usable from userspace without hacks... but at least it works.
Having a common interface for any syscall method is a good idea I think. Tell me if I am wrong, but I think something similar is done on linux: the int 0x80- is always available, but if you wish you may use sysenter/sysexit as well.
I will follow your suggestion on trying it on the interrupt method first.
I was mostly interested in sysenter/sysexit to make a benchmark of it, because I have a lot of IPC syscalls, therefore a small gain is definitely worth if. Maybe I have written the 1st OS that does (almost) nothing, but does it slowly
However I discovered some bugs in my paging code (hmm at least I think it comes from it)... I go back to work, one day it will work
Yeah linux has something a bit like that, as does more or less any serious OS except some microkernel systems. If there are very few operations (like I have just send and receive) then using separate interrupt vectors for them frees one more register for other stuff, which can outweight the benefit of using slightly faster system call method.
Then again, somebody said that a true microkernel is basicly a kernel with all portable and general removed. (Say, to port a microkernel system to a new architecture, the best bet is to write a new microkernel that's functionally compatible with the old one but runs on the new architecture.)
Anyway, if you have a microkernel system where IPC is the bottleneck, then I'd first try to remove anything unnecessary from the IPC path, and only then optimize the actual system call mechanism once it seems your IPC path otherwise is near optimal and your IPC design is simple and sound. It's also worth trying to design an IPC method that can avoid unnecessary round trips in the first place (which is surprisingly complicated, btw).
Simply touching too many kernel structures tends to poison caches and hence slow stuff down, so your best bet is to try to identify the most common operations (like sending a short message) and try to remove everything unnecessary from that path, then conditionally branch into the more general case if it's necessary.
Then again, somebody said that a true microkernel is basicly a kernel with all portable and general removed. (Say, to port a microkernel system to a new architecture, the best bet is to write a new microkernel that's functionally compatible with the old one but runs on the new architecture.)
Anyway, if you have a microkernel system where IPC is the bottleneck, then I'd first try to remove anything unnecessary from the IPC path, and only then optimize the actual system call mechanism once it seems your IPC path otherwise is near optimal and your IPC design is simple and sound. It's also worth trying to design an IPC method that can avoid unnecessary round trips in the first place (which is surprisingly complicated, btw).
Simply touching too many kernel structures tends to poison caches and hence slow stuff down, so your best bet is to try to identify the most common operations (like sending a short message) and try to remove everything unnecessary from that path, then conditionally branch into the more general case if it's necessary.
The real problem with goto is not with the control transfer, but with environments. Properly tail-recursive closures get both right.
- Kevin McGuire
- Member
- Posts: 843
- Joined: Tue Nov 09, 2004 12:00 am
- Location: United States
- Contact:
RE: System Call Implementation
-- copied from http://en.wikipedia.org/wiki/System_call --
Implementing system calls requires a control transfer which involves some sort of architecture specific feature. A typical way to implement this is to use a software interrupt or trap. Interrupts transfer control to the kernel so software simply needs to set up some register with the system call number they want and execute the software interrupt. Linux uses this implementation on x86 where the system call number is placed in the EAX register before interrupt 0x80 is executed.
For many RISC processors this is the only feasible implementation, but CISC architectures such as x86 support additional techniques. One example is SYSCALL/SYSRET which is very similar to SYSENTER/SYSEXIT (the two mechanisms were created by Intel and AMD independently, but do basically the same thing). These are "fast" control transfer instructions that are designed to quickly transfer control to the kernel for a system call without the overhead of an interrupt.
An older x86 mechanism is called a call gate and is a way for a program to literally call a kernel function directly using a safe control transfer mechanism the kernel sets up in advance. This approach has been unpopular presumably due to the requirement of a far calls which uses x86 segmentation and the resulting lack of portability it causes, and existence of the faster instructions mentioned above.
>> Context: x86 protected mode + paging, kernel ring 0, user process ring
>> 3, kernel is not mapped in user space (might be a problem in the near
>> future).
(problem)
The problem here is that the interrupt service routine will need to be mapped into user space, or it will not work (very easily).
You have to have some code switch the address space so you can call or jump into the kernel function. I like this code to be privileged, so I generally map my kernel into each user space process. The SYSENTER and SYSEXIT do not support changing the CPU's cr3 register. If the current virtual address space will not reference a kernel function then it will crash.
From reading the latter link above. I noticed linux creates one page of memory shared with all other processes. This page contains some code to perform the actual SYSENTER and SYSEXIT.
You _should_ be able to include some code here to switch address spaces. As long as this shared page remains at the exact same virtual address in kernel mode and all user mode processes. This is most likely where you can also complicate things, since you should notice.. below..Kernel also setups system call entry/exit points for user processes. Kernel creates a single page in the memory and attaches it to all processes' address space when they are loaded into memory. This page contains the actual implementation of the system call entry/exit mechanism.
Here linux calls this 4096 page that has the SYSENTER and SYSEXIT has a function in it being referenced as __kernel_vsyscall. Then also it notes that the address is not fixed which means this 4096 page is not fixed in memory.Initiation: Userland processes (or C library on their behalf) call __kernel_vsyscall to execute system calls. Address of __kernel_vsyscall is not fixed.
The SYSENTER and SYSEXIT seem to be faster on certain Pentium machines. The AMD machines apparently implement a SYSCALL and SYSRET somehow. I am not sure if there is some dramatic difference or what.
I think you are about to step into a unforgiving world of troubleshooting and endless hours trying to find a way to keep from mapping the kernel and user process in the same address space. It is also very _simple_ to map them together, and saves a lot of clock cycles!
During my search for more details on sysenter/sysexit syscall/sysret, I found this link pretty interesting: http://www.sandpile.org/post/msgs/20003633.htm
More details on syscall implementation in Solaris can be found here:
http://blogs.sun.com/rab/entry/x86_syscall_primer
L4 strategy ?
http://www.pagetable.com/?p=9
More details on linux syscalls:
http://manugarg.googlepages.com/systemc ... ux2_6.html
I think I will put everything in the wiki...
More details on syscall implementation in Solaris can be found here:
http://blogs.sun.com/rab/entry/x86_syscall_primer
L4 strategy ?
http://www.pagetable.com/?p=9
More details on linux syscalls:
http://manugarg.googlepages.com/systemc ... ux2_6.html
I think I will put everything in the wiki...
I have made a first draft of a wiki page, feel free to modify it.
http://www.osdev.org/wiki/System_Calls
Regards,
Ineo
http://www.osdev.org/wiki/System_Calls
Regards,
Ineo