Kernel requests via page faults

rdos · Post by **rdos** » Sun Sep 26, 2010 4:04 am

Owen wrote:
rdos wrote:I think the fastest way to do syscalls on x86 is to allocate a callgate with every entrypoint. This will leave all CPU-registers available (no need to use & copy the stack in most (all) cases). It doesn't need to setup function numbers on entry, and it doesn't need decoding functions in the kernel, and eventually to do a call / jmp to the real entrypoint. The only drawback is that GDT selectors are a limited resource.
By the time you have even passed through a call gate on a modern processor, you could pretty much have been through syscall/sysret twice.

Perhaps, but speed is more important on older processors, and even if you can execute the syscall/sysret twice, that is typically not even close to the whole procedure of setting up the call, syscall, decoding it and jumping to the final destination. I've seen the coding/decoding code both in DOS and Windows, and it's terribly slow. In the past people have minimized syscalls just because they are so terribly slow. I never do that because my syscalls are fast, and I can set up my compiler (Open Watcom) to use the registers the call is defined to use and thus eliiminate the intermediate step of loading registers from the stack which I needed with Borland's compiler. In essence, the call will go directly from C/C++ to kernel with no coding/decoding overhead.

rdos · Post by **rdos** » Sun Sep 26, 2010 4:13 am

lemonyii wrote:however, nice idea.
its good for non-assembly programming, and easy to implement. but i didnt consider the speed.
any way, it's just an entrance , it varies from different platforms.
my opinion is, keep the central part of code unchanged, and choose the most practical (fastest, easiest decoding, least exceptions......) entrance on the platform.
and of course, we may have many entrance, but we dont need it i think.

Yes, there is a need to look at the whole sequence, not just the switch from user to kernel. Speed should be meassured from the last useful instruction in C/C++ until the first useful instruction in the device-driver. One possibly advantage of using pagefaults would be similar to using callgates in that the handler destination could be coded somewhere directly and this could eliminate the usual coding/decoding logic of syscall/sysexit.

Owen · Post by **Owen** » Sun Sep 26, 2010 9:05 am

rdos wrote:
Owen wrote:
rdos wrote:I think the fastest way to do syscalls on x86 is to allocate a callgate with every entrypoint. This will leave all CPU-registers available (no need to use & copy the stack in most (all) cases). It doesn't need to setup function numbers on entry, and it doesn't need decoding functions in the kernel, and eventually to do a call / jmp to the real entrypoint. The only drawback is that GDT selectors are a limited resource.
By the time you have even passed through a call gate on a modern processor, you could pretty much have been through syscall/sysret twice.
Perhaps, but speed is more important on older processors, and even if you can execute the syscall/sysret twice, that is typically not even close to the whole procedure of setting up the call, syscall, decoding it and jumping to the final destination. I've seen the coding/decoding code both in DOS and Windows, and it's terribly slow. In the past people have minimized syscalls just because they are so terribly slow. I never do that because my syscalls are fast, and I can set up my compiler (Open Watcom) to use the registers the call is defined to use and thus eliiminate the intermediate step of loading registers from the stack which I needed with Borland's compiler. In essence, the call will go directly from C/C++ to kernel with no coding/decoding overhead.

GCC knows how to inline my system calls too. More importantly, my syscall dispatch code comes down to little more than

Check that the system call number doesn't exceed the maximum legal value
Jump to the correct syscall
Return

OK; theres a little more complexity than this: I have to SWAPGS to get the kernel per-CPU information (but this is a given either way) and re-enable interrupts, but the overhead is far less than that of a call gate.

From memory, the syscall code looks like

Code: Select all

syscallEntry64:
    swapgs
    mov %gs:TCB_RSP0_OFFSET, %rsp
    sti
    cmpq $SYSCALL_COUNT, %rax
    jlt _badSyscallNumber
    mov syscallVec(%rax, 4, 0), %rax
    call *%rax
    swapgs
    sysretq

I could probably streamline things by defining an inline function which does something along the lines of

Code: Select all

static inline syscallReturn1(_FMK_ksysparam retV)
{
    asm volatile("swapgs; sysretq" :: "a"(retV.as_int));
    __builtin_unreachable();
}

and jumping directly to the syscall functions. In fact, I'll probably move to doing this; it makes returning multiple values trivial too.

Register pressure is a complete non-issue; one should be re-evaluating things if they need that many parameters (particularly when developing an asynchronous microkernel!)

rdos · Post by **rdos** » Mon Sep 27, 2010 2:06 am

Owen wrote:GCC knows how to inline my system calls too. More importantly, my syscall dispatch code comes down to little more than

Check that the system call number doesn't exceed the maximum legal value

Jump to the correct syscall

Return
OK; theres a little more complexity than this: I have to SWAPGS to get the kernel per-CPU information (but this is a given either way) and re-enable interrupts, but the overhead is far less than that of a call gate.

From memory, the syscall code looks like
Code: Select all
syscallEntry64:
    swapgs
    mov %gs:TCB_RSP0_OFFSET, %rsp
    sti
    cmpq $SYSCALL_COUNT, %rax
    jlt _badSyscallNumber
    mov syscallVec(%rax, 4, 0), %rax
    call *%rax
    swapgs
    sysretq
I could probably streamline things by defining an inline function which does something along the lines of
Code: Select all
static inline syscallReturn1(_FMK_ksysparam retV)
{
    asm volatile("swapgs; sysretq" :: "a"(retV.as_int));
    __builtin_unreachable();
}
and jumping directly to the syscall functions. In fact, I'll probably move to doing this; it makes returning multiple values trivial too.

Register pressure is a complete non-issue; one should be re-evaluating things if they need that many parameters (particularly when developing an asynchronous microkernel!)

OK, that looks pretty good. At least compared to DOS/Windows. However, I would want to see the complete code, which includes how the compiler codes the call in your application, and if something special is required in the device-driver.

Lets take readfile call as an example. It is defined like this for C/C++ code:

int RDOSAPI RdosReadFile(int Handle, void *Buf, int Size);

The call-side macro looks like this:

#pragma aux RdosReadFile = \
CallGate_read_file \
ValidateEax \
parm [ebx] [edi] [ecx] \
value [eax];

This tells the compiler to load handle into ebx, buffer into edi and size into ecx and to return bytes read into eax.

It will typically expand to something like this:

mov ebx,filehandle
mov edi,buffer
mov ecx,size
call far 0xE800:0x00000000 ; selector will be dynamically allocated on first call, it is just an example.
jnc read_ok
;
xor eax,eax

read_ok:

The device-driver will register its entry-point with "usergate-manager", and it will contain no extra code in the entry portion. It doesn't need to validate parameters (it will use the user-mode es selector to access the buffer, and if it fails, the code faults. Userlevel cannot pass pointers to kernel-space because the user-mode-es register does not map kernel). The handle will be "dereferenced" by "handle manager", but this is typically a fast procedure.

Owen · Post by **Owen** » Mon Sep 27, 2010 2:14 am

The standard system call wrapper is:

Code: Select all

_FMK_AMD64_SCDEF _FMK_sysparam _FMK_SyscallR(_FMK_syscall sc,
                        _FMK_u8 argc, const _FMK_sysparam argv[],
                        _FMK_u8 retc, const _FMK_sysretn  retv[]
)
{
    _FMK_u8 i;
    _FMK_sysparam rv;
    register _FMK_sysparam arg0 _FMK_asmreg(rdi);
    register _FMK_sysparam arg1 _FMK_asmreg(rsi);
    register _FMK_sysparam arg2 _FMK_asmreg(rdx);
    register _FMK_sysparam arg3 _FMK_asmreg(rcx);

    switch(__builtin_constant_p(argc) ? argc : _FMK_SC_MAX_ARGS) {
        case 4: arg3 = argv[3];
        case 3: arg2 = argv[2];
        case 2: arg1 = argv[1];
        case 1: arg0 = argv[0];
    }

    switch(__builtin_constant_p(retc) ? retc : _FMK_SC_MAX_ARGS) {
        case 1:
            __asm__ __volatile__("int $0xFF"
                :  "=a"(rv),
                   "=D"(*retv[0])
                :  "0" (sc),
                   "1" (arg0),
                   "S" (arg1),
                   "d" (arg2),
                   "c" (arg3)
                : "memory"
            ); break;

        case 2:
            __asm__ __volatile__("int $0xFF"
                :  "=a"(rv),
                   "=D"(*retv[0]),
                   "=S"(*retv[1])
                :  "0" (sc),
                   "1" (arg0),
                   "2" (arg1),
                   "d" (arg2),
                   "c" (arg3)
                : "memory"
            ); break;

        case 3:
            __asm__ __volatile__("int $0xFF"
                :  "=a"(rv),
                   "=D"(*retv[0]),
                   "=S"(*retv[1]),
                   "=d"(*retv[2])
                :  "0" (sc),
                   "1" (arg0),
                   "2" (arg1),
                   "3" (arg2),
                   "c" (arg3)
                : "memory"
            ); break;

        case 4:
            __asm__ __volatile__("int $0xFF"
                :  "=a"(rv),
                   "=D"(*retv[0]),
                   "=S"(*retv[1]),
                   "=d"(*retv[2]),
                   "=c"(*retv[3])
                :  "0" (sc),
                   "1" (arg0),
                   "2" (arg1),
                   "3" (arg2),
                   "4" (arg3)
                : "memory"
            ); break;
    }

    return rv;
}

_FMK_AMD64_SCDEF is normally "static inline"

Normally, this is called from code like

Code: Select all

FMK_result FMK_KDebugOutChar(char c)
{
    return (FMK_result) 
        _FMK_Syscall(_FMK_SC_DebugOut, 
                     _FMK_SC_ARGC(1, 0, 0, 0, 0, 0), 
                     _FMK_SC_ARGV(_FMK_SCA_U8(c))
                    );
}

(_FMK_Syscall is very similar except only returns in RAX, rather than in additional registers)

The general generated code is

Code: Select all

FMK_KDebugOutChar:
    mov $_FKM_SC_DebugOut, %eax
    syscall
    ret

rdos · Post by **rdos** » Mon Sep 27, 2010 2:39 am

You would need some serious revision of this code once you come closer to "production stage", because user-level can pass any kind of garbage to the device-driver. It can even trash your kernel by deliberatly or accidentally putting addresses to kernel-space data structures in the parameters.

Part of my strategy is to minimize pointers and data-structures passed from userlevel to kernel, as these either need some hardware protection (the use of segreg:offset where segreg can never access kernel data), or some kind of parameter validation in software in the entry portion of the device-driver. For 64-bit code, the only option is software validation since segmentation is not supported.

gerryg400 · Post by **gerryg400** » Mon Sep 27, 2010 3:19 am

rdos wrote:You would need some serious revision of this code once you come closer to "production stage", because user-level can pass any kind of garbage to the device-driver. It can even trash your kernel by deliberatly or accidentally putting addresses to kernel-space data structures in the parameters.

Part of my strategy is to minimize pointers and data-structures passed from userlevel to kernel, as these either need some hardware protection (the use of segreg:offset where segreg can never access kernel data), or some kind of parameter validation in software in the entry portion of the device-driver. For 64-bit code, the only option is software validation since segmentation is not supported.

Assuming an upper-half kernel, pointer validation only requires a simple comparison. Surely that is not an issue.

rdos · Post by **rdos** » Mon Sep 27, 2010 3:46 am

gerryg400 wrote:Assuming an upper-half kernel, pointer validation only requires a simple comparison. Surely that is not an issue.

In the example above there are many pointers that need validation. It is also easy to forget some validation since these pointers are "hidden".

gerryg400 · Post by **gerryg400** » Mon Sep 27, 2010 4:04 am

rdos wrote:In the example above there are many pointers that need validation. It is also easy to forget some validation since these pointers are "hidden".

Remember that we are looking at the user side of the system call here. No validation is needed there at all.

Owen · Post by **Owen** » Tue Sep 28, 2010 7:11 am

gerryg400 wrote:
rdos wrote:In the example above there are many pointers that need validation. It is also easy to forget some validation since these pointers are "hidden".
Remember that we are looking at the user side of the system call here. No validation is needed there at all.

Indeed. It also occurs to me that I need to swivel some registers to avoid syscall trampling them. There is a version of this code for each supported platform (At present, AMD64 and ARM)

gerryg400 · Post by **gerryg400** » Tue Sep 28, 2010 7:15 am

I need to swivel some registers

Swivel ? What do you mean ?

Owen · Post by **Owen** » Tue Sep 28, 2010 7:28 am

gerryg400 wrote:
I need to swivel some registers
Swivel ? What do you mean ?

RCX, for example, is used by syscall for passing the return address, so can't be used for parameter passing. The RCX slot will be swapped for a different callee-clobber register, and RCX added to the clobber list. The same kind of note would apply to R11, if it were involved in parameter passing.

The stub for many parameter system calls, when not inlined, would then look like:

Code: Select all

FMK_ABigSyscallStub:
    mov $_FMK_SC_ABigSyscall, %eax
    mov %rcx, %r8
    syscall
    ret

R8 would normally be the 5th parameter slot; this makes accessing the parameter from the kernel side of the system call trivial. FMK system calls should probably never go above 4 arguments, and if they do anyway, then the interface will be changed. I do not guarantee system call ABI stability (Touching either libFMK or the raw system call ABI is verboten; nothing says either will maintain long term stability and nothing says FusionOS will always run on top of FMK either)

gerryg400 · Post by **gerryg400** » Tue Sep 28, 2010 7:41 am

Okay. That reminds me. I'm still using

Code: Select all

int

to enter my kernel. Must fix that!

JamesM · Post by **JamesM** » Tue Sep 28, 2010 7:59 am

rdos wrote:I think the fastest way to do syscalls on x86 is to allocate a callgate with every entrypoint. This will leave all CPU-registers available (no need to use & copy the stack in most (all) cases). It doesn't need to setup function numbers on entry, and it doesn't need decoding functions in the kernel, and eventually to do a call / jmp to the real entrypoint. The only drawback is that GDT selectors are a limited resource.

IIRC the fastest way to do a syscall on 64-bit x86 is SYSCALL/SYSRET, as they've been specially optimised for this case.

rdos · Post by **rdos** » Wed Sep 29, 2010 3:08 am

JamesM wrote:
rdos wrote:I think the fastest way to do syscalls on x86 is to allocate a callgate with every entrypoint. This will leave all CPU-registers available (no need to use & copy the stack in most (all) cases). It doesn't need to setup function numbers on entry, and it doesn't need decoding functions in the kernel, and eventually to do a call / jmp to the real entrypoint. The only drawback is that GDT selectors are a limited resource.
IIRC the fastest way to do a syscall on 64-bit x86 is SYSCALL/SYSRET, as they've been specially optimised for this case.

Except for exceptions, it is more or less the only way on 64-bit. However, I don't target 64-bit, rather 386+ processors (IA32) only, and I have no plans to switch to 64-bit. There is no need for 64-bit code on embedded platforms.

OSDev.org

Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults

Re: Kernel requests via page faults