Page 1 of 2
SYSENTER/SYSEXIT
Posted: Tue Jul 24, 2007 3:10 pm
by JamesM
I'm about to implement syscalls for the latest incarnation of my OS (complete rewrite in C++), and thought about using the x86's fast syscall instructions. They get you in to ring0 and out again as fast as possible, without saving any register state. That's not a problem, because generally the only time you need to save register state is when you're switching process (pre-empting). Otherwise, the compiler knows it's making a call so it expects most registers to be clobbered (exception is ebx for some reason, and ebp).
They seem ideal, just wondering if anyone else has used them? Because I know linux doesn't and haven't heard much chat about them...
Also, as a quick aside, does anyone know how to make a pure-asm member function? like e.g.
Code: Select all
class blah {
int func()
{ __asm("blah");}
};
The problem with that code is the compiler will make something like this:
Code: Select all
_mY3MANGLED_func_333:
// compiler initialise stack frame
mov ebp, esp
push ...
blah
leave
ret
I don't want it to make the preamble and prologue, I want to control it myself. Anyone know any way to do that? (Never actually come across a need for it before!)
cheers,
JamesM
Re: SYSENTER/SYSEXIT
Posted: Tue Jul 24, 2007 4:41 pm
by pcmattman
JamesM wrote:They seem ideal, just wondering if anyone else has used them? Because I know linux doesn't and haven't heard much chat about them...
According to
this article there is a 266% increase in speed (for an example system call) from the standard 'int 0x80' to the SYSENTER/SYSEXIT method.
To use these, you'll need to access MSRs. Google it, you'll find hundreds of ways to do this.
The MSRs you need are numbers 174-176. 174 holds the
code segment of the system call handler, 175 holds the
ESP of the system call, and 176 holds the
EIP of the system call.
On my WinXP machine these three MSRs are set to:
Code: Select all
lkd> rdmsr 174
msr[174] = 00000000`00000008
lkd> rdmsr 175
msr[175] = 00000000`f8951000
lkd> rdmsr 176
msr[176] = 00000000`804de6f0
One note - below is an example set of functions and their methods (taken from the above article, based on Windows)
Code: Select all
Kernel Function Name Call style Exit instruction
KiSystemCallExit 'int 2e' iretd
KiSystemCallExit2 SYSENTER SYSEXIT
KiSystemCallExit3 SYSCALL SYSRETURN
Note that SYSENTER/SYSEXIT is the Intel way, and SYSCALL/SYSRETURN is the AMD way. You'll need to use a CPUID command to find out which one to use.
Basically, when you want to run the system call, you setup the arguments and then execute 'sysenter' (or 'syscall').
I've grossly oversimplified this but I hope it gets the idea across.
Posted: Tue Jul 24, 2007 5:18 pm
by Kevin McGuire
I am pretty sure you already know this, but I am going to go over it anyway. One of the problem is that changing flags for GCC can cause enormous differences in emitted machine instructions. Pretty much right there you have a problem of having the correct preamble and prolog that GCC expects. Mainly, you have no _easy_ way to generate the code it wants due to a flick of the hat of what might be going from
-O0 to
-O3. Using
-O3 will allow the compiler to start reordering instructions and this includes moving something from the preamble and prolog of the function somewhere else into the guts of it if need be. Using -fomit-frame-pointer you can see:
Code: Select all
08048430 <_ZN2cA1fEv>:
8048430: 8b 44 24 04 mov 0x4(%esp),%eax
8048434: c7 00 01 00 00 00 movl $0x1,(%eax)
804843a: c3 ret
804843b: 90 nop
804843c: 8d 74 26 00 lea 0x0(%esi),%esi
Code: Select all
0804841c <_ZN2cA1fEv>:
804841c: 55 push %ebp
804841d: 89 e5 mov %esp,%ebp
804841f: 8b 45 08 mov 0x8(%ebp),%eax
8048422: c7 00 01 00 00 00 movl $0x1,(%eax)
8048428: 5d pop %ebp
8048429: c3 ret
Now I am about to suggestion a sludgy work around since I am assuming that the code is generates is not the actual problem, but instead the code that is executed first is the problem. I am deducing my assumption from your noting of system calls which can derive from interrupts or a interrupt like mechanism provided by the special system call instructions.
This sludgy work around is really just a method to wrap an all virtual methods in a class with your own
prolog and
epilog code, while preserving the code GCC emits. The only pitfall is if the C++ ABI changes for GCC and it starts passing the
this pointer with another mechanism besides the first argument on the stack, or something similar so this is at your own risk.
This code creates a new virtual table for the class instance, and modifies the specified virtual functions to call the wrapper first.
example.cc
Code: Select all
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdarg.h>
#include <malloc.h>
class cA
{
public:
uint32_t val;
cA()
{
val = 10;
}
~cA()
{
val = 20;
}
virtual void fa(){val = 1; printf("called me\n");}
virtual void fb(){val = 1;}
virtual void fc(){val = 1;}
virtual void fd(){val = 1;}
};
extern uint8_t class_vfunc_wrapper_start;
extern uint8_t class_vfunc_wrapper_end;
extern uint8_t class_vfunc_wrapper_loadreg_prologcall;
extern uint8_t class_vfunc_wrapper_loadreg_membercall;
extern uint8_t class_vfunc_wrapper_loadreg_epilogcall;
void wrap_class_vfuncs(uintptr_t instance, uintptr_t prolog, uintptr_t epilog, uint32_t vfc, uint8_t count, ...)
{
uintptr_t *vte = (uintptr_t*)((uintptr_t*)instance)[0];
va_list ap;
va_start(ap, count);
// create a new virtual table.
uint32_t *nvt = new uint32_t[vfc];
for(uint32_t x = 0; x < vfc; ++x)
{
nvt[x] = vte[x];
}
// set new virtual table in instance.
((uintptr_t*)instance)[0] = (uintptr_t)nvt;
for(; count > 0; --count)
{
// yeild thirty-two bit index from byte index.
uint32_t index = (va_arg(ap, uintptr_t) >> 2);
// we found function's address in the vtable, create a wrapper instance.
uint8_t *wrapperCode = (uint8_t*)memalign(32, (uintptr_t)&class_vfunc_wrapper_end - (uintptr_t)&class_vfunc_wrapper_start);
// copy wrapper code into wrapper instance.
memcpy(wrapperCode, &class_vfunc_wrapper_start, ((uintptr_t)&class_vfunc_wrapper_end) - ((uintptr_t)&class_vfunc_wrapper_start));
printf("wrapper code size:%x\n", ((uintptr_t)&class_vfunc_wrapper_end) - ((uintptr_t)&class_vfunc_wrapper_start));
// set wrapper instance call values (very hacky)
*(uint32_t*)(((uintptr_t)&class_vfunc_wrapper_loadreg_prologcall - (uintptr_t)&class_vfunc_wrapper_start) + (uintptr_t)wrapperCode + 1) = prolog;
*(uint32_t*)(((uintptr_t)&class_vfunc_wrapper_loadreg_epilogcall - (uintptr_t)&class_vfunc_wrapper_start) + (uintptr_t)wrapperCode + 1) = epilog;
*(uint32_t*)(((uintptr_t)&class_vfunc_wrapper_loadreg_membercall - (uintptr_t)&class_vfunc_wrapper_start) + (uintptr_t)wrapperCode + 1) = vte[index];
printf("vte:%x\n", vte[index]);
nvt[index] = (uintptr_t)wrapperCode;
printf("wrapped class member function, wrapper function address %x\n", wrapperCode);
}
return;
}
typedef void (*pmf)(void);
extern uint8_t example_prolog;
extern uint8_t example_epilog;
int main()
{
cA *a = new cA();
wrap_class_vfuncs((uintptr_t)a, (uintptr_t)&example_prolog, (uintptr_t)&example_epilog, 4, 1, &cA::fa);
a->fa();
a->fa();
return 1;
}
This is the wrapper code, and it includes a seperate function for the prolog and epilog. You do not have to use a seperate function, but I included them just for show.
example.s
Code: Select all
.global class_vfunc_wrapper_start
.global class_vfunc_wrapper_end
.global class_vfunc_wrapper_loadreg_prologcall
.global class_vfunc_wrapper_loadreg_membercall
.global class_vfunc_wrapper_loadreg_epilogcall
class_vfunc_wrapper_start:
class_vfunc_wrapper_loadreg_prologcall:
movl $0, %ebx
call *%ebx
class_vfunc_wrapper_loadreg_membercall:
movl $0, %ebx
movl 4(%esp), %eax
push %eax
call *%ebx
pop %edx
push %eax
class_vfunc_wrapper_loadreg_epilogcall:
movl $0, %ebx
call *%ebx
pop %eax
ret
class_vfunc_wrapper_end:
.global example_prolog
.global example_epilog
example_prolog:
ret
example_epilog:
ret
g++ example.cc example.s -o example
i dont think
Posted: Tue Jul 24, 2007 8:07 pm
by com1
i dont think i would use them. i mean, thats just me. when you write a process table you have to save the registers and stack data, and you also have treads running in registers. if you made your own syscalls you wouldn't have to worry about the CPU directly handling them, am i right?
Naked
Posted: Wed Jul 25, 2007 12:11 am
by Mark139
I can't remember the exact details, but I'm sure there is a "naked" keyword for implementing pure ASM function bodies.
Re: i dont think
Posted: Wed Jul 25, 2007 12:34 am
by pcmattman
com1 wrote:i dont think i would use them. i mean, thats just me. when you write a process table you have to save the registers and stack data, and you also have treads running in registers. if you made your own syscalls you wouldn't have to worry about the CPU directly handling them, am i right?
The idea of using these commands is that you avoid the time-consuming process of saving all the registers just to run what is most likely to be a very small amount of code.
SYSENTER and SYSCALL basically allow a ring3 process to call ring0 code and still have all the speed advantages of a typical 'call'.
This is the main reason why you can get such a speed increase using SYSENTER (or SYSCALL).
As an endnote, I'll give another reason why not to use an interrupt...
I am writing code at the moment for userspace processes to talk to each other (IPC). I have in the mattiseRecvMessage function a loop that queries the kernel to find out if there is a message ready for reading. This is an int 0x80 call, and means that I never get to reschedule again. Why? ISRs disable interrupts on entry - can you see why?
If I use SYSENTER/SYSCALL I can treat it as though I'm calling a function in kernel space (actually, I pretty much am doing just that).
Posted: Wed Jul 25, 2007 12:58 am
by JamesM
Thanks for the replies guys.
pcmattman wrote:Google it, you'll find hundreds of ways to do this.
I'm less worried about the implementation of it (I worked out / googled how to do it anyway), more the reasons why/why not people have/have not used them.
@kevin:
Wow, thats some seriously cludgy-looking code. I'll pore over it later, and google vtable modification (I hadn't thought about doing it that way) and yes, your assumption was 100% correct.
@com1:
Sorry, one of the things I forgot to mention is that I would still be using interrupts for task switches and fork()/clone(). Reasons for which are of course you need to save the register state, and the way my fork() works is it modifies the user's stack so that on IRET it jumps somewhere else. So I need IRET to do it! These would be for 'light' syscalls, like write/read etc.
@mark139:
Yes, you are right, there is a __attribute__((naked)), however I had already read the manual about this and it doesn't work for x86 architectures. (I tested it and the manual wasn't wrong, sadly
)
Thanks for the replies guys.
JamesM
Posted: Wed Jul 25, 2007 1:27 am
by os64dev
I develop in 64 bit long mode and do use syscall/sysret and they work like a charm and because 64-bit requires flat memory model i don't see and use for interrupt based system calls.
as for the pure asm method i would use:
Code: Select all
class test {
public:
int MyAsmMethod(void);
int MyCppMethod(void);
};
int test::MyCppMethod(void) {
//- this is the c++ method
return(1);
}
asm (
".global __ZN4test11MyAsmMethodEv;"
"__ZN4test11MyAsmMethodEv:"
" movl $2, %eax;"
" ret;"
);
#include <iostream>
using namespace std;
int main(void) {
test x;
cout << x.MyAsmMethod() << endl;
cout << x.MyCppMethod() << endl;
}
This is rather dodgy code as it depends heavily on the function name generation, but it works
Posted: Wed Jul 25, 2007 7:08 am
by JamesM
os64dev:
Yes, I thought about doing it that way, but I haven't as yet with my constant googling found a way to programmatically mangle identifier names the way g++ does. If there was an inbuilt GXX_MANGLE macro or something it would be dead easy...
JamesM
Posted: Wed Jul 25, 2007 3:14 pm
by com1
to pcmattman: are you using message passing? semaphores are so annoying due to the critical region stuff...[/quote]
Posted: Thu Jul 26, 2007 12:07 am
by os64dev
JamesM wrote:os64dev:
Yes, I thought about doing it that way, but I haven't as yet with my constant googling found a way to programmatically mangle identifier names the way g++ does. If there was an inbuilt GXX_MANGLE macro or something it would be dead easy...
JamesM
Well... i also tried it with virtual members and then everything stops working due to the vtable issues. Maybe it is possible to declare a friend function for an object wich takes a pointer to that object as a parameter and thus has full acces to the object. That function can be inline assembly then.
Code: Select all
typedef class object {
friend void doStuff(object *);
}
asm ( ".global doStuff: ret");
Posted: Thu Jul 26, 2007 12:52 am
by JamesM
The way I've elected to do it is to use 2 macros in each member function that should be in kernel mode.
Code: Select all
#define START_KERNEL \
u32int was_ring_3 = 0;\
if (isring3()) { \
was_ring_3=1; asmStartKernel(); \
}
#define END_KERNEL \
if (was_ring_3) { \
asmEndKernel(curProcess->esp0); \
}
void Class::MemberFunc()
{
START_KERNEL
code...
END_KERNEL
}
Pseudocode:
Code: Select all
asmStartKernel:
ebx <- ebp ; put the current stackframe base pointer in ebx
ecx <- esp ; and the current stack pointer
eax <- .my_tmp_label ; EIP to jump to in eax.
sysenter
.my_tmp_label:
ret
; the location of this symbol is in the SYSENTER_EIP_MSR.
asmStartKernelHandler:
push ebx ; push a pointer to the user stack base.
; copy the stack frame from the user stack using ecx, ebx.
jmp eax
asmEndKernel:
eax <- [esp+4] ; eax = first argument = esp0 = kernel stack base.
esp <- [eax] ; esp = dereference of esp0 = the pushed stackpointer.
pop ebp ; restore the base pointer
pop edx ; put the return addr in edx
mov ecx, esp ; put the return esp in ecx for sysexit
sysexit
That code is probably wrong, I'm at work now and just wrote it out from memory! And the naming convention etc is completely different (worse) to what I use in my actual code.
You think it'll work?
JamesM
Posted: Thu Jul 26, 2007 1:33 am
by Kevin McGuire
I do not see why you do not just bring all code paths into the kernel from a interrupt or syscall instruction through one shared point, as you might want to save the current thread state and it makes no sense making your code more complex from writing redundant code over and over.
That way you should not even have to think about hacking about the class prolog and epilog?
Posted: Thu Jul 26, 2007 1:59 am
by JamesM
Hi Kevin:
If I want to pre-empt the process or thread, that will be done through a shared interrupt. Also the yield() syscall will go through the shared interrupt. The point here is that I'm making a microkernel, so IPC and fast syscalls are very important. There is a mechanism in the CPU (sysenter/exit) for 'lite' syscalls, why not use them?
And, I don't think the code I posted looks *so* bad. It's a little hacky, but no more stack fiddlage than in an interrupt handler. Plus the code looks more compact in the version I wrote last night at home.
This is the second incarnation of my kernel, I wanted to try something new - it looks quite promising tbh. The only 'cludge' seems to be a way to prettily macro-ify the ring switch.
JamesM
Posted: Fri Jul 27, 2007 2:25 am
by Kevin McGuire
I just looked at the SYSENTER instruction, and it appears that you would indeed be working with a shared entry point.
So I have to ask then why do you need to have assembly stubs for your class? Why not have everything written in one function before the call to a class method?
call_table:
dd class_member0
dd class_member1
system_enter:
.... setup kernel stacks, save state, ...
push CLASS_THIS_POINTER
mov call_table, %eax
mov (%eax,%ebx,4), %ebx
call %ebx
.... resume saved state ...
sysexit
I should not have to rewrite or fiddle with any of the prolog and epilog for class members by doing it this way.