Brendan wrote:1) If device drivers have some strange dependency on the kernel stack's location, then device drivers are broken.
Not so. Hardware takes care of stack checking, as the stack-pointer will roll around when the stack is full, and then generate a stack fault exception. The device-drivers normally don't care about the stack. Some parts of kernel related to exception handling, and creating kernel threads do care about the stack, but otherwise it doesn't matter where it is located. OTOH, many things depend on ESP being less than 0x10000, so it is not possible to have a kernel stack with a 32-bit offset. In order to use a 32-bit stack pointer, quite a few things need to be changed, especially in the kernel. Additionally, I don't want to use software validation of the stack. I like hardware validation better.
Brendan wrote:2) There's many differences between CPUs that cause "if(feature_supported) then". There's 2 ways to avoid the branching in critical places - conditional code that enables/disables things at compile time, and duplicating code (e.g. 8 different pieces of task switch code; where the kernel uses "call [address_of_task_switch_code]" or a function pointer, so all the branching only happens once when you decide which version of the code to use and not each time the code is run). If your kernel doesn't already do this (e.g. testing if FXSAVE should be used, if AVX is present, if there's multiple CPUs, etc), then your kernel is broken.
I do this at several places. For instance, I use this method when selecting lock-types. When I run on a single core I don't need spinlocks, so those will become nops when there is only one core. Some other of these issues are solved by picking a specific driver (for instance PIC vs APIC). However, even adding these things at least requires a call / ret sequence, which slows things down. I certainly wouldn't want different compilations of the kernel.
Brendan wrote:4) If your kernel is so broken that it actually needs stack limits checks in release versions (e.g. rather than just in debug builds) then your kernel is broken.
I so no reason why to disable stack checking (done by hardware) in production release. Silent stack faults in kernel should not be allowed to be ignored in a production release. These are causes of panics and reboots in the production release.
Brendan wrote:5) I'd assume you aren't running applications that use SYSENTER during boot; and that you only run applications that use SYSENTER after paging is enabled. The part of your kernel that's used after paging is enabled probably uses something like "CS.base = 0x12345678, offset in segment >= 0". If you can't easily change this to "CS.base = 0, offset in segment >= 0x12345678" then your kernel is broken. Notice that even though the segment base is zero the kernel won't be using any area of the virtual address space reserved for virtual8086.
For the moment, the kernel is a 16-bit module, and that's the primary reason why it cannot have CS.base = 0. An additional reason is that it has CS.operandsize = 16, which also means that CS must be reloaded. However, it would be possible to recompile it as a 32-bit module, and possibly also give it a non-zero offset. I'm sure there would be some issues in such a move, but it would be possible. However, the primary reason why I might change it to a 32-bit module is because of size constraints, not because of possible SYSENTER support.
Brendan wrote:6) If you can't change a far jump in the kernel (so it jumps to a different address when switching from 16-bit to 32-bit) then your kernel is broken.
I suppose I could define a new level where the old 16-bit kernel just chains to a new 32-bit kernel.
Brendan wrote:If new applications are compiled to use SYSENTER directly (without the hideous "applications attempt to call the kernel and generate an exception and the exception handler patches the caller" mess) then your hideous "applications attempt to call the kernel and generate an exception and the exception handler patches the caller" mess is irrelevant for SYSENTER.
It's not a mess. It is called "binary compability"
But I suspect you've never heard about such a concept in the *nix world?
Brendan wrote:If new applications are compiled to use SYSENTER directly, then even if the CPU doesn't support SYSENTER and you have to emulate it, it'd probably still be faster than your hideous "applications attempt to call the kernel and generate an exception and the exception handler patches the caller" mess (especially for the first few times a call is made - modern CPUs don't handle self-modifying code well).
The patching procedure works on all processors I know about, and it is SMP safe. It also has no negative performance aspects as it is only done once for each occurance of a syscall.
Another way to stay within the current size-contraints of the syscall code might be this:
Code: Select all
; patched code
push table_index ; 5 bytes (table index << 4)
call SYSENTER_PROC ; 5 bytes (total of 10 bytes)
; at some global position (not in the executable)
SYSENTER_PROC Proc near
push eax
push ecx
push edx
sysenter
sysenter_pos:
pop edx
pop ecx
pop eax
ret 4
SYSENTER_PROC End
; in kernel
app_index = 16
app_eax = 8
app_ecx = 4
app_edx = 0
sysenter_entry:
load_task_ss_esp ; load task ss:esp in some way
push ecx ; put application stack on kernel stack
cmp edx,OFFSET sysenter_pos ; check that sysenter was used in a proper manner
jnz sysenter_fail ; go if not
mov eax,ds:[ecx].app_index ; get table index from application stack (flat ss no longer present here, but ds is the flat selector
; of the application, and thus has the correct mapping)
cmp eax,cs:sys_tab_size ; check that index is reasonable
jae sysenter_fail ; go if not
push dword ptr cs:[eax].sys_tab
push dword ptr cs:[eax+4].sys_tab ; put destination on stack
mov eax,ds:[ecx].app_eax
mov edx,ds:[ecx].app_edx
mov ecx,ds:[ecx].app_ecx
retf32 ; jump to syscall handler
sysenter_fail:
int 3
Brendan wrote:You also don't need to add a new syscall interface for close to 500 syscalls all at the same time. You could start with one kernel API function, then add another one next week, then add all the rest eventually. Also, based on everything I've heard about RDOS so far, I'd also assume that 99% of the existing kernel API functions are badly designed mistakes; and creating a new/alternative syscall interface would give you a chance to fix all the problems with the existing syscall interface without breaking compatibility (old software can still use the old syscall interface, while new software moves to the new syscall interface).
I'm quite content with the current syscall interface.
Brendan wrote:Also, the new syscall interface wouldn't (and shouldn't) be limited to SYSENTER only. The normal way that sane people do it is to have "eax = function number" and a call table; where the kernel's SYSENTER handler does something like "call [functionTable+eax*4]" then SYSEXIT, the kernel's SYSCALL handler does something like "call [functionTable+eax*4]" then SYSRET, the kernel's software interrupt handler does something like "call [functionTable+eax*4]" then IRET, etc. This would also make it easier to support 64-bit applications one day.
This is the really ancient way of doing syscalls that goes back to DOS and other terrible OSes. I left this way of doing it (along with IOCTL) 20 years ago in favor of my current interface, and I'll never go back to the ancient mess again. My interface garantees binary compability, as existing syscalls may only be changed in ways that doesn't break backward compability. At the server end, device-drivers (or kernel) registers the entry-points to the kernel, and then the patcher creates the call-gates "on the fly" and patches them into user-space. It could also patch the above sysenter code "on the fly" when sysenter is supported, and the device-driver can handle it. Additionally, my syscall interface can gracefully handled unimplemented syscalls, by patching a default-handler that just returns with CY. All syscalls use CY to indicate success / failure.
In fact, your above table is easy to implement by using the already present gate number cache. However, I would more likely implement it in a similar way as the gate descriptors. When the patcher is invoked, it would check if the current syscall is already in the table, and if it is, it would generate the code to push the existing index on the stack. If not, it would add a new entry to the table, and push the index to that entry. That way, the table would be compact and only contain references to used syscalls.