fast (short?) switch between compat mode and 64 bit mode

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: fast (short?) switch between compat mode and 64 bit mode

Post by Octocontrabass »

rdos wrote:When processor runs in long mode, the loader instead will patch to a syscall instruction with the number placed on the user mode stack since long mode doesn't support call gates.
Long mode does support call gates. What it doesn't support is direct far calls - you would have to use indirect ones instead. (But SYSCALL and SYSENTER are much faster than call gates, so I would actually get rid of call gates in 32-bit code too.)
rdos wrote:The problem in relation to GCC is that parameters must be passed in registers, and so I need to define function prototypes that load the correct registers and then do the syscall.
I'm not sure I understand the problem here. GCC allows you to pass parameters in registers to inline assembly.
rdos
Member
Member
Posts: 3296
Joined: Wed Oct 01, 2008 1:55 pm

Re: fast (short?) switch between compat mode and 64 bit mode

Post by rdos »

Octocontrabass wrote:Long mode does support call gates.
That's not my impression. You cannot use a protected mode call gate to get from 32-bit user mode to the kernel entry point that is defined in the gate when the processor is in long mode. This only works if the processor is in protected mode.
Octocontrabass wrote:What it doesn't support is direct far calls - you would have to use indirect ones instead.
Yes, because far calls don't switch ring.
Octocontrabass wrote: (But SYSCALL and SYSENTER are much faster than call gates, so I would actually get rid of call gates in 32-bit code too.)
I don't think so. By using syscall or sysentry I will need another far call in the kernel to get to the correct server routine. I will need to push and pop a segment register (ds) and load ds with the call table segment. I need to inspect the user mode call stack to get the syscall number, and the caller will need to push it on the stack. The server routine returns with a far return, adding yet another segment register load. This method requires two segment register loads on entry, two for saving & restoring a segment register, one for reading the user mode stack, one for getting the server address, one for calling the server routine, one for exiting the server routine and two for returning to user mode. That's a total of 10 segment register loads.

A call gate has two segment register loads on entry (ss and cs) and two segment register loads on exit (ss and cs). There is no other logic needed, rather the server routine is called directly.

In summary, I'm convinced that call gates are faster with older processors, but probably also with more modern. Needing twice as many segment register loads and a dozen or so additional operations cannot be faster, even if syscall / sysenter has some optimizations.

Edit: There actually is more to it. SYSCALL is long mode only, and so cannot be used in protected mode. SYSENTER setup a flat CS & SS in kernel, and won't save neither the user mode cs nor the user mode ss. This means that the caller must save CS & SS on the user mode stack and on return these must be restored. The kernel stack must also be loaded from the thread control block, which requires an additional three or so segment register loads. Since sysleave uses ECX & EDX to return, these must also be saved on the user mode stack and loaded prior to calling the server routine. After the call, ECX & EDX must be saved on the user mode stack, adding another three segment register loads. So, the final result is 16 vs 4 segment register loads. :-)
Octocontrabass wrote:
rdos wrote:The problem in relation to GCC is that parameters must be passed in registers, and so I need to define function prototypes that load the correct registers and then do the syscall.
I'm not sure I understand the problem here. GCC allows you to pass parameters in registers to inline assembly.
The problem is that this is not standardized and so GCC and OW need different code. Two versions to keep up to date rather than only one.
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: fast (short?) switch between compat mode and 64 bit mode

Post by Octocontrabass »

rdos wrote:You cannot use a protected mode call gate [...] when the processor is in long mode.
Correct. You can only use long mode call gates in long mode.
rdos wrote:This method requires two segment register loads on entry, [...] and two for returning to user mode.
SYSCALL, SYSENTER, SYSRET, and SYSEXIT do not perform segment register loads. That's part of the reason why they're so fast!
rdos wrote:Needing twice as many segment register loads and a dozen or so additional operations cannot be faster, even if syscall / sysenter has some optimizations.
Perhaps you should prove it with some benchmarks.
rdos wrote:SYSCALL is long mode only, and so cannot be used in protected mode.
SYSCALL works in protected mode on AMD CPUs. (SYSENTER doesn't work in long mode on AMD CPUs.)
rdos wrote:SYSENTER setup a flat CS & SS in kernel, and won't save neither the user mode cs nor the user mode ss. This means that the caller must save CS & SS on the user mode stack and on return these must be restored.
SYSCALL/SYSENTER/SYSRET/SYSEXIT are designed for a flat address space.
rdos
Member
Member
Posts: 3296
Joined: Wed Oct 01, 2008 1:55 pm

Re: fast (short?) switch between compat mode and 64 bit mode

Post by rdos »

Octocontrabass wrote:
rdos wrote:You cannot use a protected mode call gate [...] when the processor is in long mode.
Correct. You can only use long mode call gates in long mode.
Which is quite useless when the kernel is running in compatibility mode.

Still, when I tried to implement long mode applications, I used SYSCALL. The kernel side of SYSCALL needs to load a kernel stack, and to make this at least somewhat efficient I had to use a flat kernel stack with guard pages. I could load the linear address of the kernel stack through the thread control block. I also had to optimize the dispatch of the server routine by mapping the syscall table to a fixed location and using that location from long mode. Then I needed to alias pointers using paging. All of these things to some degree compromised the integrity of my kernel. The flat kernel stack is only enabled if the long mode loader is present to avoid the integrity issue of a flat stack that can reference anything.
Octocontrabass wrote:
rdos wrote:This method requires two segment register loads on entry, [...] and two for returning to user mode.
SYSCALL, SYSENTER, SYSRET, and SYSEXIT do not perform segment register loads. That's part of the reason why they're so fast!
That's also the reason why they suck when used with segmentation.
Octocontrabass wrote:
rdos wrote:Needing twice as many segment register loads and a dozen or so additional operations cannot be faster, even if syscall / sysenter has some optimizations.
Perhaps you should prove it with some benchmarks.
I think this should be obvious. At the caller side, a far call must be done to save cs. You must save ss too on the user mode stack. With call gates, this context is saved on the kernel stack. At the kernel side, another far call must be done to reach the server routine. Additionally, sysenter uses a fixed kernel stack and disables interrupts, and so a thread-stack must be loaded before interrupts can be enabled again. Each CPU core will need it's own sysenter kernel stack. Call gates automatically loads the kernel stack from the TSS.
Octocontrabass wrote:
rdos wrote:SYSENTER setup a flat CS & SS in kernel, and won't save neither the user mode cs nor the user mode ss. This means that the caller must save CS & SS on the user mode stack and on return these must be restored.
SYSCALL/SYSENTER/SYSRET/SYSEXIT are designed for a flat address space.
Sure, but I don't use a flat kernel and I have two sets of user mode flat selectors. One for normal applications and one for servers. The application flat selectors have a 3GB limit and the server flat selectors have a 2GB limit. SYSEXIT will return to user mode with a 4GB limit, which will compromise kernel integrity (as well as the kernel part of servers). In fact, there is no selector with a 4GB limit that user mode can load, and so it can't reference kernel memory, neither directly nor indirectly by passing bad pointers.

Additionally, the user mode flat selectors can have a non-zero base. This will complicate things even more.

I also support segmented applications, and those work just as well with call gates, but not at all with SYSENTER.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: fast (short?) switch between compat mode and 64 bit mode

Post by xeyes »

rdos wrote:
xeyes wrote: Aren't syscall index just small integers that would work with any toolchain? Or do you also have some special designs there?
I code my syscalls as invalid instructions that have a 32-bit integer to indicate the syscall number. The instruction is then typically patched by the executable loader to a call gate.

When processor runs in long mode, the loader instead will patch to a syscall instruction with the number placed on the user mode stack since long mode doesn't support call gates. In kernel, the syscall entry point needs to find the number and dispatch the correct server procedure with a far call.

When syscalls are performed in kernel or a driver, the syscall (or driver call, which is similar) will instead be patched to a direct far call.

Because GDT selectors are a scarce resource, call gates are allocated & setup on reference from user mode only.

The problem in relation to GCC is that parameters must be passed in registers, and so I need to define function prototypes that load the correct registers and then do the syscall. These are quite different between GCC and OW.

This design is why I also can support 16-bit applications. 16-bit applications will pass 16-bit registers, and these are extended to 32-bit by the server. This is done by the server registering both a 16-bit and a 32-bit entry point (or a bimodal, in case this extension is not needed). DOS applications are supported by aliasing a selector, and long mode by the use of paging. Thus, I don't need any additional translation layers and I will not be tied-up to a specific stack content.
Wow that's a sophiscated system which can handle so many situations.

Why the dynamic patching though, are the call gates also not fixed and can change dynamically?
rdos wrote: The application flat selectors have a 3GB limit
I need to give this a try, my kernel sometimes run with user DS because these are not set on all paths into the kernel. Most of the time a user DS doesn't cause any exceptions, making it diffcult to track down all cases of this.
rdos
Member
Member
Posts: 3296
Joined: Wed Oct 01, 2008 1:55 pm

Re: fast (short?) switch between compat mode and 64 bit mode

Post by rdos »

xeyes wrote: Why the dynamic patching though, are the call gates also not fixed and can change dynamically?
There currently are 790 defined syscalls, but only a minority of them are used by typical applications. I could have assigned a fixed GDT selector to each syscall, but I find it more future-safe to allocate GDT selectors per syscall that is actually used. After all, there are only 8191 GDT selectors. I do assign fixed GDT code & data selectors to drivers, but the number of drivers are not so large.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: fast (short?) switch between compat mode and 64 bit mode

Post by xeyes »

rdos wrote:
xeyes wrote: Why the dynamic patching though, are the call gates also not fixed and can change dynamically?
There currently are 790 defined syscalls, but only a minority of them are used by typical applications. I could have assigned a fixed GDT selector to each syscall, but I find it more future-safe to allocate GDT selectors per syscall that is actually used. After all, there are only 8191 GDT selectors. I do assign fixed GDT code & data selectors to drivers, but the number of drivers are not so large.
So it is actually on demand rather than just dynamic per boot like ASLR? You really have many interesting designs!

Did these features envolve iteratively overtime or were they part of a 'grand blueprint' that stayed more or less the same?
Post Reply