Running 64-bit code in 32-bit x86 OS

Owen · Post by **Owen** » Wed Aug 15, 2012 3:40 pm

rdos wrote:
Owen wrote:Then your kernel will die with IA-32... so, shortly after UEFI becomes ubiquitous, when you'll find every machine has 64-bit EFI that your kernel can't interact with
That the boot-loader transfers control to 64-bit mode is nothing that stops me. GRUB already transfers to 32-bit flat mode, but that doesn't stop me from switching to segmented mode. It's almost as simple to switch to protected mode from long mode as it is to switch from flat to segmented mode. What would make my kernel die is if Intel / AMD removed protected mode from their processors, but given that it took over 20 years before they eventually removed V86 mode (and then only in long mode), I don't see this happening anytime soon. And even if protected mode is removed, it will take far longer before legacy-mode in long mode is removed, and I could run the kernel in long-mode legacy mode if I have to and do changes I've mentioned in the thread.

When Windows isn't using legacy mode any more, and the firmware doesn't support booting into legacy mode, how long do you think it will take before the ACPI tables move above the 4GB barrier, or the firmware loads your kernel binary there?

Or, more likely: how long do you think it will take before the firmware's ACPI or SMM code assumes long mode and breaks horribly when you're not running in it?

And really, at that point, there is nothing stopping Intel or AMD dropping support for legacy mode. Sure, compatibility sub mode will continue to be supported; I don't understand why you don't just modify your code to run there.

Brendan · Post by **Brendan** » Wed Aug 15, 2012 11:30 pm

Hi,

rdos wrote:
Brendan wrote:How many UEFI systems did you test it on?
Just to make this clear. I don't boot into UEFI mode, but usually use GRUB legacy to boot. In one case I used GRUB 2 in Linux Feodora to boot RDOS. In all these cases VBE is working, and I don't need any PCI snooping.

Ah - I understand now. VBE worked on every UEFI system you tested, because you tested a total of zero UEFI systems.

rdos wrote:
Brendan wrote:You may be right; if the OS is crap and doesn't bother using the "global pages" feature to avoid unnecessary TLB flushes when CR3 is loaded, then switching CPU modes like this won't make the OS's "unnecessary TLB flushing" worse because the OS is already as bad as it possibly can be.
I do have global page support, but it is currently disabled because it doesn't work properly. OTOH, there is no noticable difference if the OS runs with global pages or not. You seem to greatly overestimate the evil of flushing TLBs. It doesn't cost "thousands" of cycles to flush the TLB. It's more like a syscall.

There are very few reasons why global pages won't work properly, especially if you're using paging properly (e.g. using INVLPG where possible to avoid reloading CR3). However, if you're not using paging properly (e.g. not using INVLPG to avoid reloading CR3), then global page support might not make much difference.

The initial cost of flushing TLBs is small, and easy to measure. The total cost of flushing TLBs (including the cost of every TLB miss that occurs afterwards that could've been avoided if TLBs weren't flushed) is much larger, and much harder to measure. I'm talking about the total cost of flushing TLBs, not just the initial cost that is "more like a syscall".

rdos wrote:
Brendan wrote:
rdos wrote:2. TR register is always reloaded with every thread-switch (per thread SS0 and IO-bitmaps)
OK - that's not well optimised either; so reloading the TSS during task switches doesn't make it worse.
Wrong. This is well optimized when each task has it's own kernel SS selector. You won't save a SS reload since SS will be reloaded anyway (unless the SYSENTER method is used, but as presented in an older thread, this is only faster on some CPUs).

Most tasks (e.g. applications) don't have any access to any IO ports; therefore most tasks can use the same TSS (and same IO permission bitmap), and during most task switches you only need to change the "SS0" field in that single/shared TSS. This is faster than loading a different TSS, and saves some RAM (one shared TSS consumes less memory than many TSSs).

For tasks that do have access to some IO ports; the fastest way is to have one TSS that is split across pages, such that the first part of the TSS is on one page and is shared by all tasks, and the IO permission bitmap is on the next page. This allows each different virtual address space (process) to have a different page for the same TSS's IO permission bitmap; so that changing the virtual address space automatically changes the IO permission bitmap "for free". In this case, during a task switch you'd change CR3 (if necessary) then set the "SS0" field in the TSS.

If TR register is always reloaded during task switches, then you aren't doing either of these optimisations; and because your code isn't well optimised and always loads TR during task switches anyway, there'd be no additional overhead involved with changing the TSS due to switching between protected mode tasks and long mode tasks.

rdos wrote:
Brendan wrote:OK, so the OS is already very bad at doing task switches (e.g. reloading segment registers during the task switch and not just reloading segment registers when you return to CPL=3); and because the OS is already bad it's hard to make it worse.
The kernel is not flat, and thus needs to reload segment registers. As simple as that.

Ok. The scheduler is one module and thus there's no need to change from the scheduler's segment registers to the scheduler's segment registers during a task switch. You only need to change segment registers after the task switch is done, when you return from the "scheduler module" (to another different part of the kernel, or to user space, or to whatever or wherever else).

The OS is already bad at doing task switches (e.g. reloading segment registers during the task switch and not just reloading segment registers when you return from the scheduler); and because the OS is already bad it's hard to make it worse.

rdos wrote:
Brendan wrote: It's impossible to save or restore a 64-bit process' state in 32-bit code; as 32-bit code can only access the low 32-bits of *half* the general purpose registers (and half of the 16 MMX registers, etc). To get around that you would have to do the state saving and state loading in 64-bit code. I thought you'd do it in stubs (e.g. saving the 64-bit state before passing control to the 32-bit kernel and restoring the 64-bit state after before returning to the 64-bit process), but now you're saying you won't need stubs.
Few things are impossible. Saving 64-bit state in the scheduler that normally runs in long legacy mode, is as simple as jumping to a 64-bit code chunk that does the save. For restore, it would be done by the switch code that reenters 64-bit mode.

So you're saying that it's not impossible for 32-bit code to save/restore 64-bit state (if the 32-bit code is not 32-bit code)?

rdos wrote:
Brendan wrote:If you're making modifications to the memory management specifically to support 64-bit applications, and also making modifications to the scheduler's state saving/loading to support 64-bit applications; do you have a sane reason to bother with the 32-bit kernel at all, and would it be much better in the end to write a 64-bit kernel for 64-bit applications (and then provide a "compatibility layer" in that 64-bit kernel for your crusty old 32-bit processes)?
Several sane reasons:
1. I don't want to start from scratch
2. I don't want a flat kernel
3. By the time the kernel is finished, x86-64 mode would be obsolete.

1. Given the state of your existing OS, "I don't want to start from scratch" is not a sane reason

2. Anything that sounds like "I want to continue to use the pointlessly stupid and slow and inferior segmented model that every sane person in the world abandoned many decades ago" is not a sane reason.

3. The time taken to write a nice clean "64-bit only" kernel (with no support for 32-bit tasks) is likely to take about the same time as hacking 64-bit support on top of the existing kernel and fixing all the teething problems. The extra time that you're worried about is the time that would be needed to add support for legacy 32-bit tasks to the new clean "64-bit only" kernel. For most sane OSs this extra time would be minimal because (for most sane OSs) there's no segmentation involved and the 32-bit API is almost the same as the 64-bit API anyway (just different register usage and address sizes). The point is, it's your segmented model that is causing the extra time that you're worrying about.

Now, I thought that your 32-bit kernel was able to support "flat 32-bit" tasks. If this is the case then you have a clear upgrade path - the 32-bit kernel would continue to support segmented and flat 32-bit tasks; and the new 64-bit kernel would support flat 64-bit tasks and flat 32-bit tasks; and any software that needs to run under both kernels would need to become flat 32-bit code.

What I think you're doing is allowing sunk costs (the time you've spent on "segmented model" in the past) influence your decisions, and this is causing you to make bad decisions for the future.

Cheers,

Brendan

rdos · Post by **rdos** » Thu Aug 16, 2012 12:39 am

Griwes wrote:All sane applications are written in way that allows compiling them to IA-32e instead of previous targets. Closed source applications, whose developers are neither continuing them, nor releasing 64bit versions of them, are doomed as well. It's just a matter of time, and I hope it will go faster than old that crappy stuff that disappeared after years of painful existence.

I still use my favorite CADD program that runs in 16-bit Windows because I like it better than the succeeding versions. The only reason I upgraded to Windows 7 was because I could run it at reasonable speed in virtual XP mode. All the development tools for older terminals also must run in virtual XP mode, but that is fine as long as they run there at reasonable speed. The effect of suddenly dropping support for a class of applications can have dramatic effects on sales. It is enough if a prospective customer has one application that doesn't work in the new system from him/her to think that the new system stinks.

In fact, Microsoft could only drop support for some older software in Windows 7 because of Virtual PC.

Griwes wrote: Also, that's kinda vicious cycle:
Software devs: processors and OSes are still supporting 32 bit execs, why should we bother going into 64 bit?
OS devs: software devs are still releasing 32 bit execs, we should keep supporting them.
CPU devs: OSes still use 32 bit submode, we need to keep it!

It's not vicious. It is the cycle that refrains hardware and software developpers from releasing new propietary systems with each new product line in order to keep people at osdev.org out, as well as other competitors. In short, this is why we have free software and Linux.

In addition to that, few applications benefit from moving to 64-bit, and most actually perform worse because of larger binaries. The switch to 64-bit is not necesary for most applications. It's nothing like the switch from 16-bits to 32-bits.

rdos · Post by **rdos** » Thu Aug 16, 2012 12:49 am

Owen wrote:When Windows isn't using legacy mode any more, and the firmware doesn't support booting into legacy mode, how long do you think it will take before the ACPI tables move above the 4GB barrier, or the firmware loads your kernel binary there?

None would be a problem. If ACPI tables move above the 4GB barrier, you could just map them into the lower 4GB with paging. If the kernel is loaded above 4GB, the same applies. The only addition would be a new bootloader running in long mode that could fix the environment. The same applies if PCI or some device report being located above 4GB. You just use PAE paging to map them below 4GB in the linear address space.

Owen wrote:Or, more likely: how long do you think it will take before the firmware's ACPI or SMM code assumes long mode and breaks horribly when you're not running in it?

I'd worry about that when it happens. ACPI is an interpreted language, and the C code that Intel provides compiles in many different memory models, including 32-bit small compact.

Owen wrote:And really, at that point, there is nothing stopping Intel or AMD dropping support for legacy mode. Sure, compatibility sub mode will continue to be supported; I don't understand why you don't just modify your code to run there.

I probably will provide that alternative as well. It is possible to run 32-bit applications in long mode as well. It's just faster on some machines to run them in protected mode.

rdos · Post by **rdos** » Thu Aug 16, 2012 1:05 am

Brendan wrote:Ah - I understand now. VBE worked on every UEFI system you tested, because you tested a total of zero UEFI systems.

The systems identify as UEFI. That they allow booting in non-UEFI mode is another issue.

Brendan wrote:There are very few reasons why global pages won't work properly, especially if you're using paging properly (e.g. using INVLPG where possible to avoid reloading CR3). However, if you're not using paging properly (e.g. not using INVLPG to avoid reloading CR3), then global page support might not make much difference.

After the move to SMP, CR3 reload is no longer used to flush TLBs, unless it is large regions of memory.

Brendan wrote:Ok. The scheduler is one module and thus there's no need to change from the scheduler's segment registers to the scheduler's segment registers during a task switch. You only need to change segment registers after the task switch is done, when you return from the "scheduler module" (to another different part of the kernel, or to user space, or to whatever or wherever else).

It naturally only saves / restores segment registers that needs to be saved / restored.

Brendan wrote:
rdos wrote:
Brendan wrote: It's impossible to save or restore a 64-bit process' state in 32-bit code; as 32-bit code can only access the low 32-bits of *half* the general purpose registers (and half of the 16 MMX registers, etc). To get around that you would have to do the state saving and state loading in 64-bit code. I thought you'd do it in stubs (e.g. saving the 64-bit state before passing control to the 32-bit kernel and restoring the 64-bit state after before returning to the 64-bit process), but now you're saying you won't need stubs.
Few things are impossible. Saving 64-bit state in the scheduler that normally runs in long legacy mode, is as simple as jumping to a 64-bit code chunk that does the save. For restore, it would be done by the switch code that reenters 64-bit mode.
So you're saying that it's not impossible for 32-bit code to save/restore 64-bit state (if the 32-bit code is not 32-bit code)?

I think I said that it is easy to switch to long mode (loading CS with a selector with L-bit set), and doing the save there.

Brendan wrote:3. The time taken to write a nice clean "64-bit only" kernel (with no support for 32-bit tasks) is likely to take about the same time as hacking 64-bit support on top of the existing kernel and fixing all the teething problems. The extra time that you're worried about is the time that would be needed to add support for legacy 32-bit tasks to the new clean "64-bit only" kernel. For most sane OSs this extra time would be minimal because (for most sane OSs) there's no segmentation involved and the 32-bit API is almost the same as the 64-bit API anyway (just different register usage and address sizes). The point is, it's your segmented model that is causing the extra time that you're worrying about.

Not so. I cannot suddenly release a buggy 64-bit kernel and install it live at commersial installations. That should be obvious. It would take years before a new kernel written from scratch is stable enough to be deployed. The current hardware doesn't even support x86-64 (it's an AMD Geode). That's why I have no such plans. Additionally, I have all the tools necesary to gradually test and integrate new features into the current kernel that would be lacking in a new project. I've successfully integrated SMP-support that today is stable.

Griwes · Post by **Griwes** » Thu Aug 16, 2012 1:19 am

rdos wrote:
Owen wrote:When Windows isn't using legacy mode any more, and the firmware doesn't support booting into legacy mode, how long do you think it will take before the ACPI tables move above the 4GB barrier, or the firmware loads your kernel binary there?
None would be a problem. If ACPI tables move above the 4GB barrier, you could just map them into the lower 4GB with paging. If the kernel is loaded above 4GB, the same applies. The only addition would be a new bootloader running in long mode that could fix the environment.

Wait, wait, wait - *FIX*? How can one *fix* perfectly sane and working environment?

rdos wrote:The systems identify as UEFI. That they allow booting in non-UEFI mode is another issue.

There is no guarantee that any UEFI system in the future will allow such weird way of booting (UEFI, no matter what you say about it, is superior to BIOS).

rdos · Post by **rdos** » Thu Aug 16, 2012 1:34 am

Griwes wrote:
rdos wrote:
Owen wrote:When Windows isn't using legacy mode any more, and the firmware doesn't support booting into legacy mode, how long do you think it will take before the ACPI tables move above the 4GB barrier, or the firmware loads your kernel binary there?
None would be a problem. If ACPI tables move above the 4GB barrier, you could just map them into the lower 4GB with paging. If the kernel is loaded above 4GB, the same applies. The only addition would be a new bootloader running in long mode that could fix the environment.
Wait, wait, wait - *FIX*? How can one *fix* perfectly sane and working environment?

I called it a "fix" because there is no reason whatsoever for BIOS/EFI/UEFI to locate ACPI tables or kernel above 4GB.

Griwes · Post by **Griwes** » Thu Aug 16, 2012 2:43 am

rdos wrote:
Griwes wrote:
rdos wrote:None would be a problem. If ACPI tables move above the 4GB barrier, you could just map them into the lower 4GB with paging. If the kernel is loaded above 4GB, the same applies. The only addition would be a new bootloader running in long mode that could fix the environment.
Wait, wait, wait - *FIX*? How can one *fix* perfectly sane and working environment?
I called it a "fix" because there is no reason whatsoever for BIOS/EFI/UEFI to locate ACPI tables or kernel above 4GB.

There is also no reason whatsoever for them NOT TO locate them* above 4GiB, as they are booting in long mode already...

* "them" as in "EFI/UEFI"; let's forget BIOS in discussion about future-proofness, m'kay?

AbstractYouShudNow · Post by **AbstractYouShudNow** » Fri Aug 17, 2012 9:20 am

I haven't read all, but have a quick solution.

How do you think Windows' WOW64 works ? You take an emulator ! this also allows one to run 16-Bits programs in a 64-Bits environment.

But I think you should use hardware support when available (e.g. the CPU's emulation abilities) and an emulator elsewhere.
Windows just uses an emulator, as no hardware support is available for 16-Bits in 64-Bits mode.

Do not try, however, to change the processor's mode : it is slow and needs much more work !

rdos · Post by **rdos** » Fri Aug 17, 2012 1:04 pm

AbstractYouShudNow wrote:I haven't read all, but have a quick solution.

How do you think Windows' WOW64 works ? You take an emulator ! this also allows one to run 16-Bits programs in a 64-Bits environment.

But I think you should use hardware support when available (e.g. the CPU's emulation abilities) and an emulator elsewhere.
Windows just uses an emulator, as no hardware support is available for 16-Bits in 64-Bits mode.

It is certainly possible to use an emulator to execute the video-bios mode setup-code (I've done this in the past with the emulator device-driver, but it is slow and error-prone). And this is currently the only reason why RDOS needs to support V86 mode. I find it highly unlikely that I will use DOS-applications (or Win32 applications) again when I have a native format that has full support. That would be out of courisity only in that case, or to prove that it still works.

As BIOS (and possibly VBE) might disappear in the (near?) future, this requirement might cease to exist, and then it wouldn't matter anymore.

OTOH, 16-bit protected mode still works in 64-bit mode. It is only V86 mode that is no longer supported.

rdos · Post by **rdos** » Sat Aug 18, 2012 6:59 am

Further thoughts on this relates to file system buffers. Those would benefit from being located above 4G. A possible future solution to this is to use the "micro-kernel approach", and run much of the file system code in 64-bit userland. Applications opening files could have mappings to file data in their own linear address-space, thus maintaining speed of file access.

Then I would have a mixed-bitness (16, 32, and 64), mixed kernel-type (monolithic and microkernel), just to confuse all the ortodoxy to the limit.

OSDev.org

Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS

Re: Running 64-bit code in 32-bit x86 OS