Page 1 of 3
fast (short?) switch between compat mode and 64 bit mode
Posted: Sun Aug 07, 2022 9:25 pm
by xeyes
Suppose that the GDT is already set up with 2 CSs, one compat (0x20) and one 64bit (0x10), and LMA is on. Also that all code live in the 1st 4GB.
What would be a good way to switch between the two modes efficiently?
On the compat->64bit way I'm now using a 32bit long jump:
Code: Select all
switch:
ljmp $0x10, $continue
continue:
ret
The caller has to deal with the unbalanced stack from this, but otherwise seems to be fast and short.
On the other way (64bit->compat) though, I have 2 working versions and am not very happy with either.
V1 uses IRET, inefficient stack twiddling to prepare the 5 element frame for iretq:
Code: Select all
switch:
push $0x30 # SS
push %esp
.byte 0x48 # REX.W hack
add $8 (%esp)
pushf
push $0x20
.byte 0x48 # REX.W hack
sub $8 %esp
movl $continue, (%esp)
movl $0x0 4(%esp)
.byte 0x48
iret
continue:
ret
V2 uses RETF, still quite a bit of stack twiddling and uses a temp register since I can't set SS using an immediate value:
Code: Select all
switch_helper:
.byte 0x48
retf
switch:
push %eax
mov $0x30, %eax
mov %ax, %ss
pop %eax
push $0x20
call switch_helper
ret
There for sure are many other factors to performance like code/stack alignment and return target buffer, but I'm wondering, is there a simple solution to this leg of the switch similar to the single 32bit long jump?
Also tried to hand assemble a 64bit long jump (0xFF2C25 + 4B indirect address), this almost works. But the address is sign-(extend)ed which makes it not useful for the code/data mixture in the 0xC... range, and I'm not ready to map the top of the 48bit space down either.
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Sun Aug 07, 2022 11:25 pm
by nullplan
Shouldn't you have 32-bit and 64-bit code in different places? In different tasks with different stacks? Because then you can use LRETQ or IRETQ in both directions. But if you positively only want to switch CS, why not perform a long jump? Why am I always the only person who seems to know about indirect long jumps?
Code: Select all
/* 64->32 bit transition */
pushq $COMPAT_CS
leaq 1f(%rip), %rax
pushq %rax
ljmpq *(%rsp)
1:
.code32
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Mon Aug 08, 2022 6:28 am
by Gigasoft
Why not just:
Code: Select all
switch:
movl kss($rip), $ss
movl $0x20, 4($esp)
retf
kss:
.long 0x30
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Mon Aug 08, 2022 11:32 am
by thewrongchristian
xeyes wrote:Suppose that the GDT is already set up with 2 CSs, one compat (0x20) and one 64bit (0x10), and LMA is on. Also that all code live in the 1st 4GB.
What would be a good way to switch between the two modes efficiently?
What are you actually trying to achieve? Are you mixing 32-bit and 64-bit kernel code?
Or is this to handle 32-bit user mode processes, with a 32-bit kernel side thunk?
If this is to handle 32-bit processes, I don't understand why you'd need 32-bit kernel code, other than some glue code to translate 32-bit syscalls to native 64-bit handlers (the glue itself can be 64-bit native code.)
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Mon Aug 08, 2022 11:35 am
by Octocontrabass
xeyes wrote:What would be a good way to switch between the two modes efficiently?
Why do you want to switch between the two modes?
The architecture is designed around the idea of a 64-bit kernel that can support 32-bit (and 16-bit!) applications, so the typical mode switch is performed as part of a privilege level switch. You go from compatibility to 64-bit mode as part of the usual kernel entry points: software interrupts, call gates, SYSCALL, and SYSENTER. Returning to compatibility mode can be done using the corresponding return instruction: IRET, far RET, SYSRET, and SYSEXIT.
Mode switches that don't involve a privilege level switch tend to have inconvenient limitations (as I'm sure you've already noticed), including one limitation that's specific to AMD CPUs.
You're doing something wrong if you need hacks like this to get your assembler to generate the opcodes you want. Perhaps you forgot a .code64 directive or "q" suffix?
xeyes wrote:V2 uses RETF, still quite a bit of stack twiddling and uses a temp register since I can't set SS using an immediate value:
Why is setting SS part of the mode switch? A 32-bit (or 16-bit!) data segment works perfectly fine for SS in 64-bit mode.
nullplan wrote:Why am I always the only person who seems to know about indirect long jumps?
Most people looking for an instruction like that want to change CS while remaining in 64-bit mode, and
far jumps and calls don't support 64-bit operands on AMD CPUs, which means a typical higher-half kernel can't use them. This is one of the rare cases where that's not a problem - although you still need to adjust the code to use "ljmpl" instead of "ljmpq".
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Tue Aug 09, 2022 1:37 am
by xeyes
nullplan wrote:Why am I always the only person who seems to know about indirect long jumps?
Code: Select all
/* 64->32 bit transition */
pushq $COMPAT_CS
leaq 1f(%rip), %rax
pushq %rax
ljmpq *(%rsp)
1:
.code32
Nice idea of using ip relative offset! I didn't think of it at all. What does the 1f do though?
Based on this idea I have a shorter version now. 32b assembler doesn't understand ip relative but it can at least compute offset:
Code: Select all
switch:
.word 0x158e # ip relative mov to SS
.long ss_label - label1
label1:
.word 0x2dff # ip relative indirect ljmp
.long ptr_label - continue:
continue:
ret
ptr_label:
.long continue
.word 0x20
ss_label:
.word 0x30
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Tue Aug 09, 2022 1:39 am
by xeyes
Gigasoft wrote:Why not just:
Code: Select all
switch:
movl kss($rip), $ss
movl $0x20, 4($esp)
retf
kss:
.long 0x30
Wow, this is brilliant
It even has a balanced stack on return and saves caller from having to adjust it.
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Tue Aug 09, 2022 2:03 am
by xeyes
nullplan wrote:Shouldn't you have 32-bit and 64-bit code in different places? In different tasks with different stacks?
thewrongchristian wrote:What are you actually trying to achieve? Are you mixing 32-bit and 64-bit kernel code?
Octocontrabass wrote:Why do you want to switch between the two modes?
Long story short, my kernel is 32bit and builds under a 32bit toolchain. For experimental support of long mode, I added a thin shim that mostly enables the kernel to handle events vectored through the IDT.
I also plan to support a few more instructions, such as wider integer mul/div, using the same shim. Should be a nice speed up from software emulation of these operations.
That's why there's a need to switch modes, and hopefullly fast, this is not for changing PL or task switch.
Octocontrabass wrote:You're doing something wrong if you need hacks like this to get your assembler to generate the opcodes you want. Perhaps you forgot a .code64 directive or "q" suffix?
Yes this is not how it's supposed to be done, had to bend the toolchain slightly backwards in order to generate the needed code. I know it doesn't support the q suffix, but need to try to see whether it accepts .code64 or not, not holding my breath though.
Octocontrabass wrote: inconvenient limitations (as I'm sure you've already noticed), including one limitation that's specific to AMD CPUs.
Could you share the limitations? I'm very new to 64bit and aside from having to switch mode (and restore SS for IDT events), the only obvious limitation I noticed is that VMX instructions cause UD in compat mode, yet they do seem to work if I use the shim around them so it seems doubtful that there's any good reason for the limitation.
thewrongchristian wrote:If this is to handle 32-bit processes, I don't understand why you'd need 32-bit kernel code, other than some glue code to translate 32-bit syscalls to native 64-bit handlers (the glue itself can be 64-bit native code.)
I might attempt the reverse later, it seems feasible if I set up a 64bit user CS and iret to it. That way, the user space apps can access the extra GPRs and XMMs, even though they'd still be limited to the 4GB address space.
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Tue Aug 09, 2022 8:48 am
by rdos
While the CPU manuals are not explicit about it, segment register loads work in 64-bit mode, and as a consequence, so does far jumps to 32-bit code. You can also do far calls to 32-bit code if that is more convenient.
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Tue Aug 09, 2022 8:56 am
by rdos
thewrongchristian wrote:xeyes wrote:Suppose that the GDT is already set up with 2 CSs, one compat (0x20) and one 64bit (0x10), and LMA is on. Also that all code live in the 1st 4GB.
What would be a good way to switch between the two modes efficiently?
What are you actually trying to achieve? Are you mixing 32-bit and 64-bit kernel code?
Or is this to handle 32-bit user mode processes, with a 32-bit kernel side thunk?
If this is to handle 32-bit processes, I don't understand why you'd need 32-bit kernel code, other than some glue code to translate 32-bit syscalls to native 64-bit handlers (the glue itself can be 64-bit native code.)
I've worked on this too, and my main arguments is that I don't want 64-bit kernel code since it cannot be protected with segmentation, but I did want 64-bit user space code to run. I think it is a workable concept, although I've decided not to continue work on it since I can use all physical memory with PAE paging, and a 32-bit app can access all physical memory by mapping it to 2M pages and then can remap it through syscalls. So, there is really no reason why to support long mode applications.
The other argument was that FS drivers would benefit from using long mode, but I decided to use non-mapped physical addresses in the disc API instead, which avoids the problem of large disc caches consuming a lot of kernel memory.
So, I see absolutely no reason why I would want to move to long mode at the moment...
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Tue Aug 09, 2022 12:34 pm
by Octocontrabass
xeyes wrote:Long story short, my kernel is 32bit and builds under a 32bit toolchain. For experimental support of long mode, I added a thin shim that mostly enables the kernel to handle events vectored through the IDT.
Why not make your kernel entirely 64-bit?
xeyes wrote:That's why there's a need to switch modes, and hopefullly fast, this is not for changing PL or task switch.
Not switching modes is faster than switching modes, and switching modes as part of a privilege level change is faster than switching modes separately.
xeyes wrote:Could you share the limitations? I'm very new to 64bit and aside from having to switch mode (and restore SS for IDT events), the only obvious limitation I noticed is that VMX instructions cause UD in compat mode, yet they do seem to work if I use the shim around them so it seems doubtful that there's any good reason for the limitation.
I'm referring to things like the odd choice to make far CALL/JMP only accept memory operands in 64-bit mode. But now that you mention it, AVX512 doesn't work in 32-bit mode at all, and SSE/AVX/AVX2 are limited to only 8 registers.
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Wed Aug 10, 2022 8:43 pm
by xeyes
rdos wrote:thewrongchristian wrote:xeyes wrote:Suppose that the GDT is already set up with 2 CSs, one compat (0x20) and one 64bit (0x10), and LMA is on. Also that all code live in the 1st 4GB.
What would be a good way to switch between the two modes efficiently?
What are you actually trying to achieve? Are you mixing 32-bit and 64-bit kernel code?
Or is this to handle 32-bit user mode processes, with a 32-bit kernel side thunk?
If this is to handle 32-bit processes, I don't understand why you'd need 32-bit kernel code, other than some glue code to translate 32-bit syscalls to native 64-bit handlers (the glue itself can be 64-bit native code.)
I've worked on this too, and my main arguments is that I don't want 64-bit kernel code since it cannot be protected with segmentation, but I did want 64-bit user space code to run. I think it is a workable concept, although I've decided not to continue work on it since I can use all physical memory with PAE paging, and a 32-bit app can access all physical memory by mapping it to 2M pages and then can remap it through syscalls. So, there is really no reason why to support long mode applications.
Segmentation to protecting the kernel against (other parts) of itself? As even user space 64bit code bypasses segmentation?
rdos wrote:
The other argument was that FS drivers would benefit from using long mode, but I decided to use non-mapped physical addresses in the disc API instead, which avoids the problem of large disc caches consuming a lot of kernel memory.
My FS stack uses 32b byte offset all the way down to disk level. So it is now limited not only to 4GB partitions, but also to partitions that are fully contained within the first 4GB of the disk. Enabling 'fast 64b mul/div' seems like a good way to support bigger disks.
rdos wrote:So, I see absolutely no reason why I would want to move to long mode at the moment...
Agreed, strongly
btw. I've seen the rdos name many times in various configure.host files. Congratulations on getting into all of them! How did you make this happen? Does it have anything to do with the commercial background of the OS?
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Wed Aug 10, 2022 8:49 pm
by xeyes
Octocontrabass wrote:xeyes wrote:Long story short, my kernel is 32bit and builds under a 32bit toolchain. For experimental support of long mode, I added a thin shim that mostly enables the kernel to handle events vectored through the IDT.
Why not make your kernel entirely 64-bit?
I feel that the current state of things represent a happy mid-ground. The kernel can, and indeed sometimes do, run on 32bit CPUs. It has also gained access to selected 64bit only instructions thanks to this shim, with further possibility of supporting 64bit user space apps.
Going deeper into the 64b land doesn't seem to make the benefit/cost threshold.
Octocontrabass wrote:xeyes wrote:That's why there's a need to switch modes, and hopefullly fast, this is not for changing PL or task switch.
Not switching modes is faster than switching modes, and switching modes as part of a privilege level change is faster than switching modes separately.
Of course not switching is faster, but is there a way to enable IDT events to be dispatched to a compat mode CS?
Octocontrabass wrote:xeyes wrote:Could you share the limitations? I'm very new to 64bit and aside from having to switch mode (and restore SS for IDT events), the only obvious limitation I noticed is that VMX instructions cause UD in compat mode, yet they do seem to work if I use the shim around them so it seems doubtful that there's any good reason for the limitation.
I'm referring to things like the odd choice to make far CALL/JMP only accept memory operands in 64-bit mode.
Was also surprised. Current understanding: this is a limitation of x86 encoding. The MOD field can't represent a 8B offset, it also didn't get any extension bit in REX, so there's no way to encode the offset needed for 64b long jumps as an immediate value. Intel manual has made reference to co and ct (8B and 10B offset) in section 3.1.1.1 though.
But, disabling VMX instructions in compat mode doesn't seem to have any ISA related reasons. Maybe they just wanted to cut some corners in places where we can't easily see.
Octocontrabass wrote:But now that you mention it, AVX512 doesn't work in 32-bit mode at all, and SSE/AVX/AVX2 are limited to only 8 registers.
My take on AVX512: don't pretend to be a GPU when you are a CPU.
Regardless of that, my kernel doesn't use the FPU once fully booted. What's the point of making managing FPU states harder so that the occasional memcpys and memsets can be a bit faster?
For user space though, there can be more data crunching apps that can make good use of the extra XMMs and GPRs.
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Wed Aug 10, 2022 10:53 pm
by nullplan
xeyes wrote:Of course not switching is faster, but is there a way to enable IDT events to be dispatched to a compat mode CS?
There is not. Intel SDM, Volume 3A, page 6-17 ("interrupt handling in 64-bit mode"):
The target code segment referenced by the interrupt gate must be a 64-bit code segment (CS.L = 1, CS.D = 0). If the target is not a 64-bit code segment, a general-protection exception (#GP) is generated with the IDT vector number as the error code.
In long mode, all IDT events must be vectored into 64-bit mode. Now, you can transition to 32-bit mode very quickly, but the first instructions must be in 64-bit mode, and the interrupt frame will be a 64-bit one (so it will include SS and RSP even if no CPL change happened, and all fields will be eight bytes).
Re: fast (short?) switch between compat mode and 64 bit mode
Posted: Wed Aug 10, 2022 11:14 pm
by Octocontrabass
xeyes wrote:The kernel can, and indeed sometimes do, run on 32bit CPUs.
You can have your bootloader select the appropriate kernel binary according to the CPU capabilities instead of trying to stuff everything into a single binary.
xeyes wrote:Of course not switching is faster, but is there a way to enable IDT events to be dispatched to a compat mode CS?
No. That's why the fastest option will always be staying in 64-bit mode.
Perhaps you should try the x32 ABI. You get all the 64-bit instructions you want while staying in the 32-bit address space you've been using.
xeyes wrote:Was also surprised. Current understanding: this is a limitation of x86 encoding. The MOD field can't represent a 8B offset, it also didn't get any extension bit in REX, so there's no way to encode the offset needed for 64b long jumps as an immediate value.
Opcodes 0x9A and 0xEA - far CALL and far JMP with the destination encoded as an immediate value - don't have a MOD field, so that can't be it. AMD could have used REX.W to extend them in 64-bit mode, since they already use the 0x66 prefix to select between 16:16 and 16:32 in other modes. Instead, AMD decided those two opcodes would be invalid in 64-bit mode.
xeyes wrote:Intel manual has made reference to co and ct (8B and 10B offset) in section 3.1.1.1 though.
But the Intel manual also doesn't define any opcodes that use those encodings. That's probably some kind of editing mistake.
xeyes wrote:But, disabling VMX instructions in compat mode doesn't seem to have any ISA related reasons. Maybe they just wanted to cut some corners in places where we can't easily see.
Intel has to test every feature they add to the CPU. Why add a feature that will increase the amount of testing if hardly anyone is going to use it?