OSDev.org

Posted: **Tue Nov 22, 2016 11:17 am**

Actually, even drivers rarely need a rebuild, as most are available as loadable modules today. Windows (especially prior to 7) has to reload the kernel for driver updates more often than Linux generally does, which is one of the reasons why LTS releases of Linux are often preferred for server use over Windows (though if Microsoft were to have an LTS branch which didn't perform non-critical updates as often, that might change).

Also, the majority of Linux users are running a binary distro such as Debian (which includes Ubuntu and derivatives of that such as Mint and Elementary) or Fedora, which means that the process isn't significantly different from that of Windows or MacOS - and if the distro uses the kernel hot-swapping, even a kernel update won't necessarily require a reboot (though not all of them use those yet); I don't know offhand if Windows or MacOS can kernel-hot-swap, though if Windows can it's probably only since 8 (which means that they are neck-and-neck on that).

Oh, and they had the update system set up for general applications as well as the system well before Windows did (though I think MacOS had them both beat, I'm not sure) - prior to Windows 8, the built in update system only applied to Windows itself, Visual Studio, and Microsoft Office. To be fair, setting up a system for downloadable updates is a lot easier when you aren't messing about with a lot of conflicting corporate interests, but even so, Microsoft could easily have set up a store system for other software vendors much sooner than they did.

Sorry, that was a digression.

Even for users of source distros such as Gentoo and Arch, or the handful who run without a distro and build everything manually, they generally don't need to recompile every time there is an update, unless they like living on the bleeding, frequently crash edge. Generally speaking, both for desktop and servers, most users will go with a binary distro unless they have such high performance demands that they have to squeeze every cycle and byte out of the system that they can (actually, in those cases they would general not update frequently, as updates mean re-tuning and often add bloat).

I personally prefer source distribution as a concept (though slim binaries or some other sort of AOT/JIT split would be even better IMO), but when I was using Gentoo, the implementation of the updates was appallingly poor; I would get conflicts on an almost weekly basis that would have to manually resolved, and the fact that it recompiles everything each and every time, rather than checking the modules with diff or touch, made even minor updates drag on for hours. I switched to Mint earlier this year out of frustration over that, and while the efficiency of the code has suffered, the time I save not having to wrestle with updates more than compensates. I would probably use Gentoo on a second system if I had one, but right now I need the one laptop I own in working order too much to rely on it.

Again, digressing.

The point is, kernel recompiles are something a certain class of Linux users talk about, often in glowing terms, but in reality, the majority of Linux installations never do any.

Posted: **Tue Nov 22, 2016 3:34 pm**

Context switching will be amortized in the call processing. The kernel stacks contain equivalent number of registers saved in call frames anyway. Still, message passing can be used to avoid entering the kernel for asynchronous requests, some objects can be relegated to user mode (e.g. synchronization primitives), thread scheduling can operate from user mode within the bounds of the process (user level threads). Interrupts are similar in terms of their overhead, so if the number of interrupts dominates the number of system calls, the latter are not priority for optimization.

But the real overhead is in the preliminary processing. For example:

Generic argument checks for the operation.
Mandatory access control.
Buffer mangement. Buffers are locked, physical memory pages are recorded. Or buffers are copied to kernel space memory.
Driver specific argument checks.
Discretionary access control.
Quota checks.
Quality of service.

Those steps are separate calls, with prologues and epilogues. This is before virtualization, where the request is passed again from one guest to another where some other checks will be performed. Basically, performing a system call is like waving goobye in a sense.

Posted: **Tue Nov 22, 2016 4:58 pm**

Octocontrabass wrote:
rdos wrote:No, sysenter isn't more efficient on older processors (if you count the intermediate code to load registers in user space and decode functions in kernel space).
Have you done benchmarks? If that turns out to be true, you can detect those processors and use INT instead of SYSENTER.

Yes. Generally, on older AMD processors (and mostly new too), call gates outperformed sysenter. I think the only clear case where sysenter outperformed call gates was Intel Atom.

Octocontrabass wrote:
rdos wrote:it requires assigning syscalls numbers (otherwise you cannot decode them from a single entry-point)
How are you identifying system calls in your OS?

They are identified in the invalid instruction. When they are patched to call gates, the gate goes directly to the entry-point, so there is no need for any decoding. When they are patched to sysenter, the number in the instruction is pushed onto the user stack.

Octocontrabass wrote:
rdos wrote:Then, of course, every new syscalls needs to be assigned a unique number, it must be added to the decoder in kernel, and the decode has to know where the handler procedure is located, all which are highly unwanted.
How do you set up system calls without knowing where they are?

The provider of the syscall handler registers the entry-point in a table at boot time.

Posted: **Tue Nov 22, 2016 10:19 pm**

rdos wrote:
Octocontrabass wrote:How are you identifying system calls in your OS?
They are identified in the invalid instruction. When they are patched to call gates, the gate goes directly to the entry-point, so there is no need for any decoding. When they are patched to sysenter, the number in the instruction is pushed onto the user stack.

Which invalid instruction are you using?

rdos wrote:
Octocontrabass wrote:How do you set up system calls without knowing where they are?
The provider of the syscall handler registers the entry-point in a table at boot time.

What provides system calls aside from your kernel? What kind of system calls are typically provided that way?

Posted: **Wed Nov 23, 2016 4:25 am**

Octocontrabass wrote:
rdos wrote:
Octocontrabass wrote:How are you identifying system calls in your OS?
They are identified in the invalid instruction. When they are patched to call gates, the gate goes directly to the entry-point, so there is no need for any decoding. When they are patched to sysenter, the number in the instruction is pushed onto the user stack.
Which invalid instruction are you using?

In the beginning, I used 0F 0B xx. I then changed it to use a far-call to the null selector (call 0003:gate-nr), with the offset being the syscall id. The latter is more debugger-friendly as the debugger knows the length of the instruction and that it is a call. Also, when patched to a call gate, the instruction is almost the same.

Octocontrabass wrote:
rdos wrote:
Octocontrabass wrote:How do you set up system calls without knowing where they are?
The provider of the syscall handler registers the entry-point in a table at boot time.
What provides system calls aside from your kernel? What kind of system calls are typically provided that way?

Drivers will provide syscalls directly to applications. It's only a minority of the syscalls that are provided by the kernel module. I also have a similar registration model for registering entry points to kernel-level services. These also use far-calls to the null selector (call far 0002:gate-nr). They are patched to direct calls instead since there is no privilege level changed involved there so call gates are not necessary. This model means I never link a huge kernel image, rather there is a small kernel module about 60k in size, a dozen of more or less required drivers + optional drivers for specific hardware.

Posted: **Wed Nov 23, 2016 9:03 am**

rdos wrote:Drivers will provide syscalls directly to applications.

How do you ensure each system call is assigned a unique number? Does this make applications dependent on specific drivers?

Posted: **Wed Nov 23, 2016 9:41 am**

Octocontrabass wrote:
rdos wrote:Drivers will provide syscalls directly to applications.
How do you ensure each system call is assigned a unique number? Does this make applications dependent on specific drivers?

I have an include file that contains all the numbers, and then I have one assembler file that defines macros for all the syscalls, including documentation for which registers are used for input and output. In addition to that, there is also a C/C++ include file that defines the syscalls as externals, and one file for OpenWatcom and one for GCC that defines the syscalls as macros.

I once started with number 1, and I'm currently at 610. A few have been dropped, but I don't reuse the numbers. A feature with using registers, and not stack frames, is that if a syscall is not registered, a default handler that just returns with carry flag set will run instead, and then the function will silently just fail.

For most devices, there is a main driver that virtualizes the function to applications, and exports the API (often using handles) to applications. Then the drivers that implements the functions will register with the provider driver. However, a few syscalls are polymorphic, and can be implemented in different ways using different drivers. Of course, then only one of the drivers can be loaded.

Posted: **Sat Nov 26, 2016 4:18 am**

Schol-R-LEA wrote:
octacone wrote:Now lets talk about system calls:
Did I understand them correctly:
1. A function of mine gets called for some reason
2. That function utilizes some forbidden code
3. That function goes through a system call handler
4. That handler has the power to run the code it received
Not necessarily "forbidden", just "provided by the kernel". System calls are used to communicate with the kernel for a number of reasons, and while many of those reasons are because it requires privileged instructions, or because the kernel is isolating a service from the application for reasons of security and/or stability, many system calls are just requests for something that happens to reside in the kernel (such as the IPC primitives).

System calls are the interface to the kernel, and several functions in the C standard library are primarily portable wrappers around system calls, though they usually also clean up the result in some way for the client-programmer; this is even more true of the system-specific libraries. For example, in Unix/Linux, the sbrk() function usually just invokes the system call of the same name and returns the result directly. The standard malloc(), in turn, uses a system call such as sbrk() to get a block of memory from the system, usually significantly larger than the one actually needed so it can do some process-local memory management rather than have to repeat the system call each time, and while it is more than just a wrapper, the heart of it's behavior is calling the kernel memory manager at need. Similar statements apply to all of the I/O functions, at least in a monolithic system (microkernels run most if not all drivers in user mode, so the kernel's involvement is limited to managing IPC and queuing, and hybrids usually do the same for a subset of drivers).

And, of course, anything that is a service of the kernel itself, such as when an application voluntarily surrenders the CPU to wait on something (via something like sleep() or wait()), that requires a system call too, even if no privileged instructions are actually used in the scheduler.

octacone wrote: 5. That handler messes up something called ESP (optional: and my imaginary multitasking system stops working)
ESP is the Extended Stack Pointer, the 32-bit version of the stack pointer, SP (the long mode equivalent is RSP). It keeps track of the top of the current stack.

Now, I am assuming you know this, but just to be clear: the hardware stack is a region of memory that is used for storing temporary values in a last in, first out order, meaning that the stack pointer holds the address of the most recent element added to the stack (the "top" of the stack, though because x86 stacks grow downward, it is actually the lowest address in the stack that is currently in use). When an item is 'pushed' onto the stack, the stack pointer is incremented by one system word (or rather, decremented, as it is growing downward), and the new value is stored in that location (the actual order in which this is done is not particularly relevant for most purposes, and may even differ from model to model of the CPU). To 'pop' a value off of the stack, you copy the value and then roll back (increment, in this case) the stack pointer.

In the x86, this is used mainly for three purposes: to store the return address (the instruction one after the CALL instruction) of a function call, to store a 'frame' or 'activation record' holding the local variables of each function, and to hold some or all of the arguments to a called function. The reason that the CALL instruction stores the return address is so that the RET doesn't have to be hard-coded with a return address, making it possible to call the same function from several places over the course of the process; RET implicitly pops the top of the stack and uses that value as the return address.

The reason that the stack frame is used to hold arguments and local variables is similar: to give the function have a temporary location for its values which can be automatically cleared when the function returns just by resetting the stack pointer. If a function requires any arguments, most calling conventions require that the caller push the arguments onto the stack in a specified order before the CALL instruction, which puts them in a place where the callee can find them. If the function requires any local variables, they are pushed onto the stack at the start of the function.

Each function has its own frame or base pointer, which indicates where its arguments and locals are to be found. When a function starts, the base pointer (EBP in this case, Extended Base Pointer) of the caller needs to be pushed onto the stack first, putting it just below the return value. Then, the stack pointer is copied to the base pointer, which sets the base of the stack frame for the current function. The frame pointer is then is used to give a reference point for the arguments above it and the local variables below it. By using an offset up into the stack for the arguments, and down into the frame for the locals, it let's the function have a temporary location for its values which can be automatically cleared when the function returns just by resetting the stack pointer back to the value frame pointer's value, and popping the old frame pointer back into EBP.

I said all of this just to make sure we are on the same page in this regard. As I said, I expect you knew this, but I wanted to make sure we were in agreement on the terminology and so forth.

Now, one of the things that happens in a context switch (regardless of whether it is a system call or an ordinary function call) is that the context (the state of the registers at the time of the call) has to be saved before the called operation runs, and restored after is is finished (modulo whatever the return of the operation is) so that the caller can use the registers without trashing them. In a function call, it only needs to save the registers it actually uses, and if a return value is passed through a register (as is the case in most x86 calling conventions) then those don't need to be saved regardless.

In either a system call or an interrupt trap, the state of the running process - usually including all of its registers, regardless of how they are used - needs to be saved, not to the stack, but to a process record, and the process's scheduler state needs to be updated to something like "paused", waiting", or "sleeping" (in many cases, it will have to wait on some long operation such as a disk read, but if it is a request that the kernel can service immediately it might be marked as something like "paused" to show that it was the last running process, instead). This includes the stack and stack frame, so if the system call doesn't save ESP, then the process will go off the rails once the system call returns, hence the problem you mention.

Thanks for explaining this deeply! +1

Does Microsoft use the same technique? What if the number of arguments is larger than the number of registers available in x86?

Posted: **Sat Nov 26, 2016 10:09 am**

octacone wrote:What if the number of arguments is larger than the number of registers available in x86?

Then you can use a pointer to a struct.

OSDev.org

System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode

Re: System Calls and User Mode