OSDev.org

Posted: **Tue Jul 24, 2007 3:10 pm**

I'm about to implement syscalls for the latest incarnation of my OS (complete rewrite in C++), and thought about using the x86's fast syscall instructions. They get you in to ring0 and out again as fast as possible, without saving any register state. That's not a problem, because generally the only time you need to save register state is when you're switching process (pre-empting). Otherwise, the compiler knows it's making a call so it expects most registers to be clobbered (exception is ebx for some reason, and ebp).

They seem ideal, just wondering if anyone else has used them? Because I know linux doesn't and haven't heard much chat about them...

Also, as a quick aside, does anyone know how to make a pure-asm member function? like e.g.

Code: Select all

class blah {
  int func()
  { __asm("blah");}
};

The problem with that code is the compiler will make something like this:

Code: Select all

_mY3MANGLED_func_333:
  // compiler initialise stack frame
  mov ebp, esp
  push ...

  blah

  leave
  ret

I don't want it to make the preamble and prologue, I want to control it myself. Anyone know any way to do that? (Never actually come across a need for it before!)

cheers,

JamesM

Posted: **Tue Jul 24, 2007 4:41 pm**

JamesM wrote:They seem ideal, just wondering if anyone else has used them? Because I know linux doesn't and haven't heard much chat about them...

According to this article there is a 266% increase in speed (for an example system call) from the standard 'int 0x80' to the SYSENTER/SYSEXIT method.

To use these, you'll need to access MSRs. Google it, you'll find hundreds of ways to do this.

The MSRs you need are numbers 174-176. 174 holds the code segment of the system call handler, 175 holds the ESP of the system call, and 176 holds the EIP of the system call.

On my WinXP machine these three MSRs are set to:

Code: Select all

lkd> rdmsr 174
msr[174] = 00000000`00000008
lkd> rdmsr 175
msr[175] = 00000000`f8951000
lkd> rdmsr 176
msr[176] = 00000000`804de6f0

One note - below is an example set of functions and their methods (taken from the above article, based on Windows)

Code: Select all

Kernel Function Name 	Call style 	Exit instruction
KiSystemCallExit 	'int 2e' 	iretd
KiSystemCallExit2 	SYSENTER 	SYSEXIT
KiSystemCallExit3 	SYSCALL 	SYSRETURN

Note that SYSENTER/SYSEXIT is the Intel way, and SYSCALL/SYSRETURN is the AMD way. You'll need to use a CPUID command to find out which one to use.

Basically, when you want to run the system call, you setup the arguments and then execute 'sysenter' (or 'syscall').

I've grossly oversimplified this but I hope it gets the idea across.

Posted: **Tue Jul 24, 2007 5:18 pm**

I am pretty sure you already know this, but I am going to go over it anyway. One of the problem is that changing flags for GCC can cause enormous differences in emitted machine instructions. Pretty much right there you have a problem of having the correct preamble and prolog that GCC expects. Mainly, you have no _easy_ way to generate the code it wants due to a flick of the hat of what might be going from -O0 to -O3. Using -O3 will allow the compiler to start reordering instructions and this includes moving something from the preamble and prolog of the function somewhere else into the guts of it if need be. Using -fomit-frame-pointer you can see:

Code: Select all

08048430 <_ZN2cA1fEv>:
 8048430:       8b 44 24 04             mov    0x4(%esp),%eax
 8048434:       c7 00 01 00 00 00       movl   $0x1,(%eax)
 804843a:       c3                      ret
 804843b:       90                      nop
 804843c:       8d 74 26 00             lea    0x0(%esi),%esi

Code: Select all

0804841c <_ZN2cA1fEv>:
 804841c:       55                      push   %ebp
 804841d:       89 e5                   mov    %esp,%ebp
 804841f:       8b 45 08                mov    0x8(%ebp),%eax
 8048422:       c7 00 01 00 00 00       movl   $0x1,(%eax)
 8048428:       5d                      pop    %ebp
 8048429:       c3                      ret

Now I am about to suggestion a sludgy work around since I am assuming that the code is generates is not the actual problem, but instead the code that is executed first is the problem. I am deducing my assumption from your noting of system calls which can derive from interrupts or a interrupt like mechanism provided by the special system call instructions.

This sludgy work around is really just a method to wrap an all virtual methods in a class with your own prolog and epilog code, while preserving the code GCC emits. The only pitfall is if the C++ ABI changes for GCC and it starts passing the this pointer with another mechanism besides the first argument on the stack, or something similar so this is at your own risk.

This code creates a new virtual table for the class instance, and modifies the specified virtual functions to call the wrapper first.

example.cc

Code: Select all

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdarg.h>
#include <malloc.h>

class cA
{
	public:
	uint32_t val;
	cA()
	{
		val = 10;
	}
	~cA()
	{
		val = 20;
	}
	virtual void fa(){val = 1; printf("called me\n");}
	virtual void fb(){val = 1;}
	virtual void fc(){val = 1;}
	virtual void fd(){val = 1;}
};

extern uint8_t class_vfunc_wrapper_start;
extern uint8_t class_vfunc_wrapper_end;
extern uint8_t class_vfunc_wrapper_loadreg_prologcall;
extern uint8_t class_vfunc_wrapper_loadreg_membercall;
extern uint8_t class_vfunc_wrapper_loadreg_epilogcall;
void wrap_class_vfuncs(uintptr_t instance, uintptr_t prolog, uintptr_t epilog, uint32_t vfc, uint8_t count, ...)
{
	uintptr_t *vte = (uintptr_t*)((uintptr_t*)instance)[0];
	va_list ap;
	va_start(ap, count);

	// create a new virtual table.
	uint32_t *nvt = new uint32_t[vfc];
	for(uint32_t x = 0; x < vfc; ++x)
	{
		nvt[x] = vte[x];
	}
	// set new virtual table in instance.
	((uintptr_t*)instance)[0] = (uintptr_t)nvt;

	for(; count > 0; --count)
	{
		// yeild thirty-two bit index from byte index.
		uint32_t index = (va_arg(ap, uintptr_t) >> 2);
		// we found function's address in the vtable, create a wrapper instance.
		uint8_t *wrapperCode = (uint8_t*)memalign(32, (uintptr_t)&class_vfunc_wrapper_end - (uintptr_t)&class_vfunc_wrapper_start);
		// copy wrapper code into wrapper instance.
		memcpy(wrapperCode, &class_vfunc_wrapper_start, ((uintptr_t)&class_vfunc_wrapper_end) - ((uintptr_t)&class_vfunc_wrapper_start));
		printf("wrapper code size:%x\n", ((uintptr_t)&class_vfunc_wrapper_end) - ((uintptr_t)&class_vfunc_wrapper_start));
		// set wrapper instance call values (very hacky)
		*(uint32_t*)(((uintptr_t)&class_vfunc_wrapper_loadreg_prologcall - (uintptr_t)&class_vfunc_wrapper_start) + (uintptr_t)wrapperCode + 1) = prolog;
		*(uint32_t*)(((uintptr_t)&class_vfunc_wrapper_loadreg_epilogcall - (uintptr_t)&class_vfunc_wrapper_start) + (uintptr_t)wrapperCode + 1) = epilog;
		*(uint32_t*)(((uintptr_t)&class_vfunc_wrapper_loadreg_membercall - (uintptr_t)&class_vfunc_wrapper_start) + (uintptr_t)wrapperCode + 1) = vte[index];
		printf("vte:%x\n", vte[index]);
		nvt[index] = (uintptr_t)wrapperCode;
		printf("wrapped class member function, wrapper function address %x\n", wrapperCode);
	}
	return;
}

typedef void (*pmf)(void);

extern uint8_t example_prolog;
extern uint8_t example_epilog;

int main()
{
	cA *a = new cA();
	wrap_class_vfuncs((uintptr_t)a, (uintptr_t)&example_prolog, (uintptr_t)&example_epilog, 4, 1, &cA::fa);
	a->fa();
	a->fa();
	return 1;
}

This is the wrapper code, and it includes a seperate function for the prolog and epilog. You do not have to use a seperate function, but I included them just for show.
example.s

Code: Select all

.global class_vfunc_wrapper_start
.global class_vfunc_wrapper_end
.global class_vfunc_wrapper_loadreg_prologcall
.global class_vfunc_wrapper_loadreg_membercall
.global class_vfunc_wrapper_loadreg_epilogcall
class_vfunc_wrapper_start:
class_vfunc_wrapper_loadreg_prologcall:
movl $0, %ebx
call *%ebx
class_vfunc_wrapper_loadreg_membercall:
movl $0, %ebx
movl 4(%esp), %eax
push %eax
call *%ebx
pop %edx
push %eax
class_vfunc_wrapper_loadreg_epilogcall:
movl $0, %ebx
call *%ebx
pop %eax
ret
class_vfunc_wrapper_end:

.global example_prolog
.global example_epilog
example_prolog:
ret
example_epilog:
ret

g++ example.cc example.s -o example

Posted: **Tue Jul 24, 2007 8:07 pm**

i dont think i would use them. i mean, thats just me. when you write a process table you have to save the registers and stack data, and you also have treads running in registers. if you made your own syscalls you wouldn't have to worry about the CPU directly handling them, am i right?

Posted: **Wed Jul 25, 2007 12:11 am**

I can't remember the exact details, but I'm sure there is a "naked" keyword for implementing pure ASM function bodies.

Posted: **Wed Jul 25, 2007 12:34 am**

com1 wrote:i dont think i would use them. i mean, thats just me. when you write a process table you have to save the registers and stack data, and you also have treads running in registers. if you made your own syscalls you wouldn't have to worry about the CPU directly handling them, am i right?

The idea of using these commands is that you avoid the time-consuming process of saving all the registers just to run what is most likely to be a very small amount of code.

SYSENTER and SYSCALL basically allow a ring3 process to call ring0 code and still have all the speed advantages of a typical 'call'.

This is the main reason why you can get such a speed increase using SYSENTER (or SYSCALL).

As an endnote, I'll give another reason why not to use an interrupt...

I am writing code at the moment for userspace processes to talk to each other (IPC). I have in the mattiseRecvMessage function a loop that queries the kernel to find out if there is a message ready for reading. This is an int 0x80 call, and means that I never get to reschedule again. Why? ISRs disable interrupts on entry - can you see why?

If I use SYSENTER/SYSCALL I can treat it as though I'm calling a function in kernel space (actually, I pretty much am doing just that).

Posted: **Wed Jul 25, 2007 12:58 am**

Thanks for the replies guys.

pcmattman wrote:Google it, you'll find hundreds of ways to do this.

I'm less worried about the implementation of it (I worked out / googled how to do it anyway), more the reasons why/why not people have/have not used them.

@kevin:

Wow, thats some seriously cludgy-looking code. I'll pore over it later, and google vtable modification (I hadn't thought about doing it that way) and yes, your assumption was 100% correct.

@com1:

Sorry, one of the things I forgot to mention is that I would still be using interrupts for task switches and fork()/clone(). Reasons for which are of course you need to save the register state, and the way my fork() works is it modifies the user's stack so that on IRET it jumps somewhere else. So I need IRET to do it! These would be for 'light' syscalls, like write/read etc.

@mark139:
Yes, you are right, there is a __attribute__((naked)), however I had already read the manual about this and it doesn't work for x86 architectures. (I tested it and the manual wasn't wrong, sadly

)

Thanks for the replies guys.

JamesM

Posted: **Wed Jul 25, 2007 1:27 am**

I develop in 64 bit long mode and do use syscall/sysret and they work like a charm and because 64-bit requires flat memory model i don't see and use for interrupt based system calls.

as for the pure asm method i would use:

Code: Select all

class test {
  public:
    int MyAsmMethod(void);
    int MyCppMethod(void);
};

int test::MyCppMethod(void) {
    //- this is the c++ method
	return(1);
}

asm (
".global __ZN4test11MyAsmMethodEv;"
"__ZN4test11MyAsmMethodEv:"
"    movl       $2, %eax;"
"    ret;"
);

#include <iostream>
using namespace std;

int main(void) {
    test        x;

    cout << x.MyAsmMethod() << endl;
    cout << x.MyCppMethod() << endl;
}

This is rather dodgy code as it depends heavily on the function name generation, but it works

Posted: **Wed Jul 25, 2007 7:08 am**

os64dev:

Yes, I thought about doing it that way, but I haven't as yet with my constant googling found a way to programmatically mangle identifier names the way g++ does. If there was an inbuilt GXX_MANGLE macro or something it would be dead easy...

JamesM

Posted: **Wed Jul 25, 2007 3:14 pm**

to pcmattman: are you using message passing? semaphores are so annoying due to the critical region stuff...[/quote]

Posted: **Thu Jul 26, 2007 12:07 am**

JamesM wrote:os64dev:

Yes, I thought about doing it that way, but I haven't as yet with my constant googling found a way to programmatically mangle identifier names the way g++ does. If there was an inbuilt GXX_MANGLE macro or something it would be dead easy...

JamesM

Well... i also tried it with virtual members and then everything stops working due to the vtable issues. Maybe it is possible to declare a friend function for an object wich takes a pointer to that object as a parameter and thus has full acces to the object. That function can be inline assembly then.

Code: Select all

typedef class object {
friend void doStuff(object *);
}

asm ( ".global doStuff: ret");

Posted: **Thu Jul 26, 2007 12:52 am**

The way I've elected to do it is to use 2 macros in each member function that should be in kernel mode.

Code: Select all

#define START_KERNEL \
  u32int was_ring_3 = 0;\
  if (isring3()) { \
    was_ring_3=1; asmStartKernel(); \
  }

#define END_KERNEL \
  if (was_ring_3) { \
    asmEndKernel(curProcess->esp0); \
  }

void Class::MemberFunc()
{
  START_KERNEL

  code...

  END_KERNEL
}

Pseudocode:

Code: Select all

asmStartKernel:
  ebx <- ebp  ; put the current stackframe base pointer in ebx
  ecx <- esp  ; and the current stack pointer
  eax <- .my_tmp_label ; EIP to jump to in eax.
  sysenter
.my_tmp_label:
  ret

; the location of this symbol is in the SYSENTER_EIP_MSR.
asmStartKernelHandler:
  push ebx ; push a pointer to the user stack base.
  ; copy the stack frame from the user stack using ecx, ebx.
  jmp eax

asmEndKernel:
  eax <- [esp+4]  ; eax  = first argument = esp0 = kernel stack base.
  esp <- [eax]     ; esp = dereference of esp0 = the pushed stackpointer.
  pop ebp ; restore the base pointer
  pop edx ; put the return addr in edx
  mov ecx, esp ; put the return esp in ecx for sysexit
  sysexit

That code is probably wrong, I'm at work now and just wrote it out from memory! And the naming convention etc is completely different (worse) to what I use in my actual code.

You think it'll work?

JamesM

Posted: **Thu Jul 26, 2007 1:33 am**

I do not see why you do not just bring all code paths into the kernel from a interrupt or syscall instruction through one shared point, as you might want to save the current thread state and it makes no sense making your code more complex from writing redundant code over and over.

That way you should not even have to think about hacking about the class prolog and epilog?

Posted: **Thu Jul 26, 2007 1:59 am**

Hi Kevin:

If I want to pre-empt the process or thread, that will be done through a shared interrupt. Also the yield() syscall will go through the shared interrupt. The point here is that I'm making a microkernel, so IPC and fast syscalls are very important. There is a mechanism in the CPU (sysenter/exit) for 'lite' syscalls, why not use them?

And, I don't think the code I posted looks *so* bad. It's a little hacky, but no more stack fiddlage than in an interrupt handler. Plus the code looks more compact in the version I wrote last night at home.

This is the second incarnation of my kernel, I wanted to try something new - it looks quite promising tbh. The only 'cludge' seems to be a way to prettily macro-ify the ring switch.

JamesM

Posted: **Fri Jul 27, 2007 2:25 am**

I just looked at the SYSENTER instruction, and it appears that you would indeed be working with a shared entry point.

So I have to ask then why do you need to have assembly stubs for your class? Why not have everything written in one function before the call to a class method?

call_table:
dd class_member0
dd class_member1

system_enter:
.... setup kernel stacks, save state, ...
push CLASS_THIS_POINTER
mov call_table, %eax
mov (%eax,%ebx,4), %ebx
call %ebx
.... resume saved state ...
sysexit

I should not have to rewrite or fiddle with any of the prolog and epilog for class members by doing it this way.

OSDev.org

SYSENTER/SYSEXIT

SYSENTER/SYSEXIT

Re: SYSENTER/SYSEXIT

i dont think

Naked

Re: i dont think