Page table setup in protected mode in virtualized guest VM

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

Hi, I'd like to dig into some details on setting up page tables on x86_64 processor in the context of virtualized environment (Vt-x) when host doesn't support EPT or it is disabled at host level (i.e. shadow pages are used).

As far I can tell, see for instance https://intermezzos.github.io/book/firs ... aging.html, in modern OSes like Linux it is the bootloader (e.g. GRUB) that loads the kernel putting the processor in protected mode with paging disabled.

When in protected mode, initial page tables are created in order to map a few MiB of guest's virtual memory. In a virtualized system this is the guest OS's kernel performing it while the logical processor (i.e. physical core or thread) runs in VMX non-root mode under the active VMCS control. From VMM/Hypervisor viewpoint, they are just "plane writes" that the guest code does on its physical memory address (GPA) locations (note that with paging disabled guest linear addresses = guest physical addresses). In other words in this first stage having paging disabled at guest level, VMM can't know (i.e. aware of) that what guest is actually doing is writing its page table's entries. Only later, when guest loads CR3 register and enables paging by setting CR4.PG=1, these instructions executed in VMX non-root mode actually trigger VM-exits into VMM.

At this stage, VMM recognizes that those guest's writes were intended to setup guest's page table entries and marks the relevant host 4KiB pages as read-only (basically VMM clear the read bit into the the relevant PTE entries within the shadow page table page it builds).

By the way, the actual content of guest's page tables written by the guest are actually held into in-memory's VMM internal structures (they are never referenced from process's hardware MMU). Shadow page tables that VMM builds, however, do not store at all the actual content of guest's page table entries.

Does the above make sense ?
Last edited by cianfa72 on Thu May 22, 2025 5:55 am, edited 4 times in total.
Octocontrabass
Member
Member
Posts: 5805
Joined: Mon Mar 25, 2013 7:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by Octocontrabass »

cianfa72 wrote: Mon May 19, 2025 6:27 amin modern OSes like Linux it is the bootloader (e.g. GRUB) that loads the kernel putting the processor in protected mode with paging disabled.
Newer bootloaders like Limine will enable paging on the kernel's behalf. This simplifies linking the kernel, since it eliminates the need for separate startup code at a separate address. It also avoids the UEFI-specific problem where no single physical address is guaranteed to be available for startup code on every PC, so either your OS fails to boot on some PCs or the kernel startup code needs to be position-independent.
cianfa72 wrote: Mon May 19, 2025 6:27 amAt this stage, VMM recognizes that those guest's writes were intended to setup guest's page table entries and marks the relevant host 4KiB pages as read-only (basically VMM clear the read bit into the the relevant PTE entries within the shadow page table page it builds).
It's also possible for the VMM to trap TLB flushes instead of guest page table writes. I don't know if this is a good idea (maybe if the guest supports PCID?) but it's an option.
cianfa72 wrote: Mon May 19, 2025 6:27 amDoes the above make sense ?
That sounds right to me.
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

Octocontrabass wrote: Mon May 19, 2025 10:54 am Newer bootloaders like Limine will enable paging on the kernel's behalf. This simplifies linking the kernel, since it eliminates the need for separate startup code at a separate address.
Do you mean the kernel actually hands-off the creation of a minimal set of page tables to map a few MiB of virtual memory including the enabling paging (CR4.PG=1) to such a bootloader (e.g. Limine) ?
Octocontrabass
Member
Member
Posts: 5805
Joined: Mon Mar 25, 2013 7:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by Octocontrabass »

cianfa72 wrote: Tue May 20, 2025 2:20 amDo you mean the kernel actually hands-off the creation of a minimal set of page tables to map a few MiB of virtual memory including the enabling paging (CR4.PG=1) to such a bootloader (e.g. Limine) ?
Yes. (Limine usually maps more than just a few MiB, but otherwise it's minimal.)
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

Quoting from this link
The biggest difference between VT-x and AMD-V is that AMD-V provides a more complete virtualization environment. VT-x requires the VMX non-root code to run with paging enabled, which precludes hardware virtualization of real-mode code and non-paged protected-mode software. This typically only includes firmware and OS loaders, but nevertheless complicates VT-x hypervisor implementation. AMD-V does not have this restriction.
since VT-x in VMX non-root mode can't run with paging disabled (i.e. in protected mode w/o paging enabled), how on earth does the guest OS set up the initial page tables to map the few MiB of virtual memory ?
User avatar
bellezzasolo
Member
Member
Posts: 118
Joined: Sun Feb 20, 2011 2:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by bellezzasolo »

cianfa72 wrote: Wed May 21, 2025 5:56 am Quoting from this link
The biggest difference between VT-x and AMD-V is that AMD-V provides a more complete virtualization environment. VT-x requires the VMX non-root code to run with paging enabled, which precludes hardware virtualization of real-mode code and non-paged protected-mode software. This typically only includes firmware and OS loaders, but nevertheless complicates VT-x hypervisor implementation. AMD-V does not have this restriction.
since VT-x in VMX non-root mode can't run with paging disabled (i.e. in protected mode w/o paging enabled), how on earth does the guest OS set up the initial page tables to map the few MiB of virtual memory ?
The hypervisor can trap that state and use software emulation. The guest OS will prepare page tables, the hypervisor will do it's own handling in lieu of mov cr3, eax.
Whoever said you can't do OS development on Windows?
https://github.com/ChaiSoft/ChaiOS
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

bellezzasolo wrote: Wed May 21, 2025 7:09 am
cianfa72 wrote: Wed May 21, 2025 5:56 am since VT-x in VMX non-root mode can't run with paging disabled (i.e. in protected mode w/o paging enabled), how on earth does the guest OS set up the initial page tables to map the few MiB of virtual memory ?
The hypervisor can trap that state and use software emulation. The guest OS will prepare page tables, the hypervisor will do it's own handling in lieu of mov cr3, eax.
Yes however, as far as I understood, the guest OS code can create/build/write its initial page tables' entries only when running with paging disabled, i.e. guest linear/virtual addresses = guest physical addresses (otherwise let me say we would run into a chicken & egg problem :shock:)

Isn't that true ?
Octocontrabass
Member
Member
Posts: 5805
Joined: Mon Mar 25, 2013 7:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by Octocontrabass »

cianfa72 wrote: Wed May 21, 2025 5:56 amQuoting from this link
That link is outdated. In 2010, Intel added a feature to VMX called Unrestricted Guest. Unrestricted Guest allows VMX non-root mode to run with paging disabled.
cianfa72 wrote: Wed May 21, 2025 5:56 amsince VT-x in VMX non-root mode can't run with paging disabled (i.e. in protected mode w/o paging enabled), how on earth does the guest OS set up the initial page tables to map the few MiB of virtual memory ?
If the hypervisor can't use Unrestricted Guest, it can install its own page tables and hide them from the guest so the guest thinks paging is disabled. Or, as suggested by bellezzasolo, the hypervisor can use software emulation instead of VMX non-root mode when the guest wants to run with paging disabled.
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

Octocontrabass wrote: Wed May 21, 2025 9:47 pm That link is outdated. In 2010, Intel added a feature to VMX called Unrestricted Guest. Unrestricted Guest allows VMX non-root mode to run with paging disabled.
Ok, got it.
Octocontrabass wrote: Wed May 21, 2025 9:47 pm If the hypervisor can't use Unrestricted Guest, it can install its own page tables and hide them from the guest so the guest thinks paging is disabled. Or, as suggested by bellezzasolo, the hypervisor can use software emulation instead of VMX non-root mode when the guest wants to run with paging disabled.
Not sure to fully grasp it (sorry, I'm not that skilled.. :roll: .). In software emulation case, the hypervisor/VMM (e.g. qemu/kvm) does implement, let me say, a sort of "virtual processor" interpreting every guest instruction (i.e. no guest instruction directly executed by the processor in VMX non-root mode), doesn't it ?

P.s. I'm aware there are some enhancements available for emulation like static and dynamic binary translation techniques.
User avatar
bellezzasolo
Member
Member
Posts: 118
Joined: Sun Feb 20, 2011 2:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by bellezzasolo »

cianfa72 wrote: Wed May 21, 2025 11:54 pm
Octocontrabass wrote: Wed May 21, 2025 9:47 pm That link is outdated. In 2010, Intel added a feature to VMX called Unrestricted Guest. Unrestricted Guest allows VMX non-root mode to run with paging disabled.
Ok, got it.
Octocontrabass wrote: Wed May 21, 2025 9:47 pm If the hypervisor can't use Unrestricted Guest, it can install its own page tables and hide them from the guest so the guest thinks paging is disabled. Or, as suggested by bellezzasolo, the hypervisor can use software emulation instead of VMX non-root mode when the guest wants to run with paging disabled.
Not sure to fully grasp it (sorry, I'm not that skilled.. :roll: .). In software emulation case, the hypervisor/VMM (e.g. qemu/kvm) does implement, let me say, a sort of "virtual processor" interpreting every guest instruction (i.e. no guest instruction directly executed by the processor in VMX non-root mode), doesn't it ?

P.s. I'm aware there are some enhancements available for emulation like static and dynamic binary translation techniques.
That's not actually necessary for protected mode.

Assume a hypervisor running at ring 0 (for simplicity), and a guest OS at ring 3. The hypervisor can allocate a paged virtual address space at 0-MEM, which is the guest's "physical" address space. The guest OS will run just fine in this virtual space, thinking it is physical.

Now, the guest OS tries to access CR0. This is a privileged instruction, so triggers a GPF, trapping into the hypervisor. The hypervisor can return the virtual CR0 in the specified register (PE=1).

Now say the guest OS writes CR3 and CR0 to enable paging. This traps again, and the hypervisor adjusts the real paging to reflect this.
Whoever said you can't do OS development on Windows?
https://github.com/ChaiSoft/ChaiOS
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

bellezzasolo wrote: Thu May 22, 2025 9:01 amAssume a hypervisor running at ring 0 (for simplicity), and a guest OS at ring 3. The hypervisor can allocate a paged virtual address space at 0-MEM, which is the guest's "physical" address space. The guest OS will run just fine in this virtual space, thinking it is physical.

Now, the guest OS tries to access CR0. This is a privileged instruction, so triggers a GPF, trapping into the hypervisor. The hypervisor can return the virtual CR0 in the specified register (PE=1).

Now say the guest OS writes CR3 and CR0 to enable paging. This traps again, and the hypervisor adjusts the real paging to reflect this.
Ah ok, cool. I'll try to recap my understanding.

Consider a guest OS (e.g. Linux) designed to run in protected mode with paging disabled in the early stage of boot process. Suppose to run it in a deprivileged mode (ring 3) on a host system with EPT disabled (no VMX non-root mode).

The hypervisor running at ring 0 allocates the virtual address range 0-MEM within a process acting as guest "physical" address space (MEM is the size of VM "physical" address memory). When the processor executes the guest OS code (ring 3) it actually/really runs in protected mode with paging enabled. CR3 register actually points to the machine address of the 1st paging structure used by host to map the process's virtual address space (remember EPT isn't available).

Guest OS thinks it is running in protected mode with paging disabled accessing what it thinks to be physical addresses in the range 0-MEM. Then it builds the initial (guest) page tables to map the few MiB of its virtual memory.

Eventually it enables paging by loading CR3 (mov cr3, eax) and writing CR0 register (CR0.PG=1). This traps into the hypervisor that, since EPT isn't available, will load CR3 with the machine address of the guest 1st paging structure building a sort of "shadow page tables" to map guest virtual addresses into machine addresses.

As you pointed out, when guest OS code tries to access CR3 (e.g. mov eax, cr3) it will results in a trap into the hypervisor that will return the "virtual CR3" into the specified register (EAX) with the correct value (guest physical address) as expected from the guest OS.

Does it make sense ? Thank you.
User avatar
bellezzasolo
Member
Member
Posts: 118
Joined: Sun Feb 20, 2011 2:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by bellezzasolo »

cianfa72 wrote: Thu May 22, 2025 1:50 pm
bellezzasolo wrote: Thu May 22, 2025 9:01 amAssume a hypervisor running at ring 0 (for simplicity), and a guest OS at ring 3. The hypervisor can allocate a paged virtual address space at 0-MEM, which is the guest's "physical" address space. The guest OS will run just fine in this virtual space, thinking it is physical.

Now, the guest OS tries to access CR0. This is a privileged instruction, so triggers a GPF, trapping into the hypervisor. The hypervisor can return the virtual CR0 in the specified register (PE=1).

Now say the guest OS writes CR3 and CR0 to enable paging. This traps again, and the hypervisor adjusts the real paging to reflect this.
Ah ok, cool. I'll try to recap my understanding.

Consider a guest OS (e.g. Linux) designed to run in protected mode with paging disabled in the early stage of boot process. Suppose to run it in a deprivileged mode (ring 3) on a host system with EPT disabled (no VMX non-root mode).

The hypervisor running at ring 0 allocates the virtual address range 0-MEM within a process acting as guest "physical" address space (MEM is the size of VM "physical" address memory). When the processor executes the guest OS code (ring 3) it actually/really runs in protected mode with paging enabled. CR3 register actually points to the machine address of the 1st paging structure used by host to map the process's virtual address space (remember EPT isn't available).

Guest OS thinks it is running in protected mode with paging disabled accessing what it thinks to be physical addresses in the range 0-MEM. Then it builds the initial (guest) page tables to map the few MiB of its virtual memory.

Eventually it enables paging by loading CR3 (mov cr3, eax) and writing CR0 register (CR0.PG=1). This traps into the hypervisor that, since EPT isn't available, will load CR3 with the machine address of the guest 1st paging structure building a sort of "shadow page tables" to map guest virtual addresses into machine addresses.

As you pointed out, when guest OS code tries to access CR3 (e.g. mov eax, cr3) it will results in a trap into the hypervisor that will return the "virtual CR3" into the specified register (EAX) with the correct value (guest physical address) as expected from the guest OS.

Does it make sense ? Thank you.
Yep, that sounds like how I envisage it working!

The reason for the unrestricted guest stuff is that you can't run in real mode with paging enabled. Unrestricted guest allows for this case.

I don't think it's the easiest thing to emulate real mode by using 16 bit protected mode and updating the GDT on the fly. You could use V8086 in a 32 bit OS, but not 64 bit ofc.
Whoever said you can't do OS development on Windows?
https://github.com/ChaiSoft/ChaiOS
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

bellezzasolo wrote: Thu May 22, 2025 2:28 pmThe reason for the unrestricted guest stuff is that you can't run in real mode with paging enabled. Unrestricted guest allows for this case.
Ok, from Intel 64 SDM vol 3c section 27.6 Unrestricted guest feature allows for VMX non-root operation in both real mode and protected mode with paging disabled (unpaged).

I'm not sure whether modern Linux OS since its early stages actually runs in real-mode or protected mode with paging disabled.
Octocontrabass
Member
Member
Posts: 5805
Joined: Mon Mar 25, 2013 7:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by Octocontrabass »

bellezzasolo wrote: Thu May 22, 2025 9:01 amNow, the guest OS tries to access CR0. This is a privileged instruction, so triggers a GPF, trapping into the hypervisor.
Running a guest OS in ring 3 may not be as easy as you think. There are a handful of instructions a hypervisor might need to intercept that aren't privileged, so you need to come up with some other way to trap into the hypervisor when the guest tries to execute them.
cianfa72
Member
Member
Posts: 95
Joined: Sat Dec 22, 2012 12:01 pm

Re: Page table setup in protected mode in virtualized guest VM

Post by cianfa72 »

My main takeaway from this thread about "trap and emulate" technique is that guest OS is being fooled by the hypervisor. For instance in the early stage of its boot, guest OS thinks it is running it protected mode with paging disabled (in order to setup the page tables needed to map the few MiB of virtual memory). Indeed when it tries to check it, (e.g. by mov eax, cr0) the hypervisor kicks in returning into the relevant register (EAX) exactly what the guest OS should expect.

Very cool, thanks all !
Post Reply