Hi,
rdos wrote:Brendan wrote:How many UEFI systems did you test it on?
Just to make this clear. I don't boot into UEFI mode, but usually use GRUB legacy to boot. In one case I used GRUB 2 in Linux Feodora to boot RDOS. In all these cases VBE is working, and I don't need any PCI snooping.
Ah - I understand now. VBE worked on every UEFI system you tested, because you tested a total of zero UEFI systems.
rdos wrote:Brendan wrote:You may be right; if the OS is crap and doesn't bother using the "global pages" feature to avoid unnecessary TLB flushes when CR3 is loaded, then switching CPU modes like this won't make the OS's "unnecessary TLB flushing" worse because the OS is already as bad as it possibly can be.
I do have global page support, but it is currently disabled because it doesn't work properly. OTOH, there is no noticable difference if the OS runs with global pages or not. You seem to greatly overestimate the evil of flushing TLBs. It doesn't cost "thousands" of cycles to flush the TLB. It's more like a syscall.
There are very few reasons why global pages won't work properly, especially if you're using paging properly (e.g. using INVLPG where possible to avoid reloading CR3). However, if you're not using paging properly (e.g. not using INVLPG to avoid reloading CR3), then global page support might not make much difference.
The initial cost of flushing TLBs is small, and easy to measure. The total cost of flushing TLBs (including the cost of every TLB miss that occurs afterwards that could've been avoided if TLBs weren't flushed) is much larger, and much harder to measure. I'm talking about the total cost of flushing TLBs, not just the initial cost that is "more like a syscall".
rdos wrote:Brendan wrote:
rdos wrote:2. TR register is always reloaded with every thread-switch (per thread SS0 and IO-bitmaps)
OK - that's not well optimised either; so reloading the TSS during task switches doesn't make it worse.
Wrong. This is well optimized when each task has it's own kernel SS selector. You won't save a SS reload since SS will be reloaded anyway (unless the SYSENTER method is used, but as presented in an older thread, this is only faster on some CPUs).
Most tasks (e.g. applications) don't have any access to any IO ports; therefore most tasks can use the same TSS (and same IO permission bitmap), and during most task switches you only need to change the "SS0" field in that single/shared TSS. This is faster than loading a different TSS, and saves some RAM (one shared TSS consumes less memory than many TSSs).
For tasks that do have access to some IO ports; the fastest way is to have one TSS that is split across pages, such that the first part of the TSS is on one page and is shared by all tasks, and the IO permission bitmap is on the next page. This allows each different virtual address space (process) to have a different page for the same TSS's IO permission bitmap; so that changing the virtual address space automatically changes the IO permission bitmap "for free". In this case, during a task switch you'd change CR3 (if necessary) then set the "SS0" field in the TSS.
If TR register is always reloaded during task switches, then you aren't doing either of these optimisations; and because your code isn't well optimised and always loads TR during task switches anyway, there'd be no additional overhead involved with changing the TSS due to switching between protected mode tasks and long mode tasks.
rdos wrote:Brendan wrote:OK, so the OS is already very bad at doing task switches (e.g. reloading segment registers during the task switch and not just reloading segment registers when you return to CPL=3); and because the OS is already bad it's hard to make it worse.
The kernel is not flat, and thus needs to reload segment registers. As simple as that.
Ok. The scheduler is one module and thus there's no need to change from the scheduler's segment registers to the scheduler's segment registers during a task switch. You only need to change segment registers after the task switch is done, when you return from the "scheduler module" (to another different part of the kernel, or to user space, or to whatever or wherever else).
The OS is already bad at doing task switches (e.g. reloading segment registers during the task switch and not just reloading segment registers when you return from the scheduler); and because the OS is already bad it's hard to make it worse.
rdos wrote:Brendan wrote:
It's impossible to save or restore a 64-bit process' state in 32-bit code; as 32-bit code can only access the low 32-bits of *half* the general purpose registers (and half of the 16 MMX registers, etc). To get around that you would have to do the state saving and state loading in 64-bit code. I thought you'd do it in stubs (e.g. saving the 64-bit state before passing control to the 32-bit kernel and restoring the 64-bit state after before returning to the 64-bit process), but now you're saying you won't need stubs.
Few things are impossible. Saving 64-bit state in the scheduler that normally runs in long legacy mode, is as simple as jumping to a 64-bit code chunk that does the save. For restore, it would be done by the switch code that reenters 64-bit mode.
So you're saying that it's not impossible for 32-bit code to save/restore 64-bit state (if the 32-bit code is not 32-bit code)?
rdos wrote:Brendan wrote:If you're making modifications to the memory management specifically to support 64-bit applications, and also making modifications to the scheduler's state saving/loading to support 64-bit applications; do you have a sane reason to bother with the 32-bit kernel at all, and would it be much better in the end to write a 64-bit kernel for 64-bit applications (and then provide a "compatibility layer" in that 64-bit kernel for your crusty old 32-bit processes)?
Several sane reasons:
1. I don't want to start from scratch
2. I don't want a flat kernel
3. By the time the kernel is finished, x86-64 mode would be obsolete.
1. Given the state of your existing OS, "
I don't want to start from scratch" is not a sane reason
2. Anything that sounds like "
I want to continue to use the pointlessly stupid and slow and inferior segmented model that every sane person in the world abandoned many decades ago" is not a sane reason.
3. The time taken to write a nice clean "64-bit only" kernel (with no support for 32-bit tasks) is likely to take about the same time as hacking 64-bit support on top of the existing kernel and fixing all the teething problems. The extra time that you're worried about is the time that would be needed to add support for legacy 32-bit tasks to the new clean "64-bit only" kernel. For most sane OSs this extra time would be minimal because (for most sane OSs) there's no segmentation involved and the 32-bit API is almost the same as the 64-bit API anyway (just different register usage and address sizes). The point is, it's your segmented model that is causing the extra time that you're worrying about.
Now, I thought that your 32-bit kernel was able to support "flat 32-bit" tasks. If this is the case then you have a clear upgrade path - the 32-bit kernel would continue to support segmented and flat 32-bit tasks; and the new 64-bit kernel would support flat 64-bit tasks and flat 32-bit tasks; and any software that needs to run under both kernels would need to become flat 32-bit code.
What I think you're doing is allowing
sunk costs (the time you've spent on "segmented model" in the past) influence your decisions, and this is causing you to make bad decisions for the future.
Cheers,
Brendan