OSDev.org

Posted: **Mon Mar 14, 2016 6:24 am**

Hi,

I’m interested in better understanding nested VMM implementation. Let’s consider a 2 level nested scenario based on Intel VT-x (root VMM (L0), guest VMM (L1) and guest VM (L2)) – see for instance Nested virtualization staring from slide 18

As far as I understand it is in charge of root VMM (regardless of VMCS shadowing feature even if available) to create in memory a VMCS instance for guest VMM (VMCS 0-1), a VMCS instance for the guest VM on behalf of guest VMM (VMCS1-2) and a (merged) VMCS structure to support guest VM directly from root VMM (VMCS0-2).

all these VMCS structures are pointed by physical address (i.e. VMPTRLD take a pointer to a physical address as operand) but the first question is:
Is VMCS1-2 stored in guest VMM physical memory (in other words VMCS1-2 created by root VMM on behalf of guest VMM is actually mapped into guest VMM physical memory pointed by VMPTRLD operand ?)

By the way I’ve a basic doubt: when VMPTRLD is executed, AFAIU, the address referenced as operand is actually loaded internally into the processor but what about operand address ?: is it interpreted actually as a real "host" physical memory address (no memory translation involved) or it could be considered as a guest VMM physical address translated using EPT page table for instance ?

Thanks

Posted: **Mon Mar 14, 2016 8:11 am**

For the guest everything must look as if it were running directly on real hardware. Almost everything else, including the answers to your questions, follows directly from that. So what is a phyiscal address on real hardware becomes a guest physical address in a virtualised environment. Host addresses are never ever visible to the guest.

Posted: **Mon Mar 14, 2016 8:29 am**

Kevin wrote:For the guest everything must look as if it were running directly on real hardware. Almost everything else, including the answers to your questions, follows directly from that. So what is a phyiscal address on real hardware becomes a guest physical address in a virtualised environment. Host addresses are never ever visible to the guest.

Sure, so VMPTRLD operand referenced by guest VMM code is actually a “guest VMM physical address”

Now, if I understand correctly, when root VMM (L0) directly launch/resume guest VM (L2) it loads into the processor (using VMPTRLD <mem operand> VMX instruction) the VMCS0-2 (host) physical address (VMCS0-2 is the result of VMCS1-2 and VMCS0-1 merging and by the way it is not mapped anywhere into guest VMM physical address space)

Does it make sense ?

Posted: **Mon Mar 14, 2016 9:07 am**

cianfa72 wrote:Sure, so VMPTRLD operand referenced by guest VMM code is actually a “guest VMM physical address”

"(L1) guest physical address", yes.

Now, if I understand correctly, when root VMM (L0) directly launch/resume guest VM (L2) it loads into the processor (using VMPTRLD <mem operand> VMX instruction) the VMCS0-2 (host) physical address (VMCS0-2 is the result of VMCS1-2 and VMCS0-1 merging and by the way it is not mapped anywhere into guest VMM physical address space)

Yes. This whole process is an implementation detail of the L0 hypervisor and invisible to both L1 and L2 guests. So in theory the L0 hypervisor could implement the emulation of VMX instructions completely differently if it wanted to be silly. For example emulate everything in software, just because. Or when it sees that the L2 guest is in Real Mode, it could run it in VM86 instead (this could actually make some sense because in the first processors with VMX, it sucked so hard that it couldn't virtualise RM guests and you had to resort to things like this).

Posted: **Mon Mar 14, 2016 10:45 am**

This whole process is an implementation detail of the L0 hypervisor and invisible to both L1 and L2 guests. So in theory the L0 hypervisor could implement the emulation of VMX instructions completely differently if it wanted to be silly. For example emulate everything in software, just because

Not sure to get the point...just to be clear I'm assuming L0 hypervisor accesses (in order to emulate L1 guest VMM VMREAD/VMWRITE to VMCS1-2 region) are directed to the same (host) physical memory region in which VMCS1-2 is mapped into L1 physical address space.

So I guess L0 hypervisor emulation of guest VMM (L1) VMX instructions (e.g. VMREAD/VMWRITE) can be done just by reading/writing fields of VMCS1-2 region (L0 hypervisor can access VMCS1-2 fields referencing L0 virtual addresses in which they are mapped by L0 page tables).

Posted: **Mon Mar 14, 2016 3:53 pm**

So I guess L0 hypervisor emulation of guest VMM (L1) VMX instructions (e.g. VMREAD/VMWRITE) can be done just by reading/writing fields of VMCS1-2 region (L0 hypervisor can access VMCS1-2 fields referencing L0 virtual addresses in which they are mapped by L0 page tables).

Logically yes, but the VMCS1-2 region does not have to exist as an actual/legitimate VMCS region because it will never be loaded/launched by the guest, its only reason is for the L0 to keep track of how the L1 guest wants to run the L2 guest.
As a result this VMCS1-2 region does not need to correspond to the physical address used by the L1 hypervisor and this region can be implemented in the L0 hypervisor as a normal memory-mapped area used for emulating VMREAD and VMWRITES. If the L1 hypervisor uses simple memory reads/writes to access the VMCS region than it's strictly his fault for doing so because the Intel manual strictly states not to.

Posted: **Mon Mar 14, 2016 4:35 pm**

alexg wrote:
So I guess L0 hypervisor emulation of guest VMM (L1) VMX instructions (e.g. VMREAD/VMWRITE) can be done just by reading/writing fields of VMCS1-2 region (L0 hypervisor can access VMCS1-2 fields referencing L0 virtual addresses in which they are mapped by L0 page tables).
Logically yes, but the VMCS1-2 region does not have to exist as an actual/legitimate VMCS region because it will never be loaded/launched by the guest, its only reason is for the L0 to keep track of how the L1 guest wants to run the L2 guest.
As a result this VMCS1-2 region does not need to correspond to the physical address used by the L1 hypervisor and this region can be implemented in the L0 hypervisor as a normal memory-mapped area used for emulating VMREAD and VMWRITES. If the L1 hypervisor uses simple memory reads/writes to access the VMCS region than it's strictly his fault for doing so because the Intel manual strictly states not to.

This is not really a problem.
The L0 hypervisor can simply deny reads and writes from / to the VMCS allocated by L1 and treat them as correct vmread / vmwrite instructions or crush the L1 because it is not playing fair. Depends on how restrictive you want to be, but the general rule when it comes to emulating instructions is that you wand to be as restrictive as the real CPU. One can even imagine a L1 hypervisor that deliberately does naughty things in order to exploit the emulation mechanism from L0.

Posted: **Tue Mar 15, 2016 3:50 am**

cianfa72 wrote:
This whole process is an implementation detail of the L0 hypervisor and invisible to both L1 and L2 guests. So in theory the L0 hypervisor could implement the emulation of VMX instructions completely differently if it wanted to be silly. For example emulate everything in software, just because
Not sure to get the point...

My point is mostly that there's nothing special about VMX. It's just another few instructions to be emulated, and you do it the same way as you emulate other instructions. The specification of the instructions, i.e. their behaviour, is what matters. How you implement that behaviour is completely up to you.

The only thing that does make VMX a bit special is that the operations performed by these instructions are rather complex and therefore not trivial to implement. But that's not a fundamental difference.

just to be clear I'm assuming L0 hypervisor accesses (in order to emulate L1 guest VMM VMREAD/VMWRITE to VMCS1-2 region) are directed to the same (host) physical memory region in which VMCS1-2 is mapped into L1 physical address space.

You can and probably should do that. You just need to be sure that you don't trust the contents of the VMCS 1-2 and check its validity each time before you use it (and be sure to avoid races with a malicious second guest CPU!)

Another option would be to keep the data outside guest memory and treat the VMCS pointer just as an ID, like alexg said. The problem there is that it allows the guest to allocate arbitrary amounts of memory in the host, which might result in a host DoS.

Posted: **Tue Mar 15, 2016 8:16 am**

cianfa72

VMPTRLD always operates on machine memory (thats memory as is in RAM).
When parent hypervisor executes VMPTRLD (L0), its operand is referenced by virtual memory pointer which holds same 64 bit value of machine memory where VMCS is located.
e.g.
; erase memory holding VMCS, set revision ID, execute vmxon, vmclear
mov rax,guest_PA_of_VMCS
mov [rsp+0x20],rax
vmptrld [rsp+0x20]

When child vm exit handler executes VMPTRLD (L1), it causes vm exit from L1 to L0. Parent hypervisor vm exit handler (L0, vm exit number 21) must obtain L1 virtual memory of vmptrld operand (thats value of guest_RSP + 0x20 + SS_base in the above sample) and L1 guest_CR3 from VMCS fields, then parse L1 paging tables and extract guest physical memory from both input params. Now L0 has guest PA (value of guest RAX which was written into memory in the above sample). If L0 runs guest without EPT, then guest PA = machine PA and points to PA of VMCS. If L0 runs guest with EPT with identity map, then machine memory = guest PA and you again have real memory of VMCS. If L0 runs guest with EPT where machine memory differs from guest PA, then L0 have to parse EPT paging tables using EPTP and guest PA and obtain real machine memory. So now you have again real machine memory of VMCS as it is in RAM. L0 have to explore whether guest PA is valid, whether guest PA is not the same as guest PA of VMXON, then map machine memory of VMCS somewhere into VA of L0 and explore whether revision ID matches, if not, then reflect back to L1 the corresponding vmx instruction error number.

VMCS shadowing:
I suggest to develop nesting also without this feature, e.g. to support Ivy Bridge and older CPUs. On Haswell and newer, when you enable VMCS shadowing, then no vm exits when L1 executes VMREAD/VMWRITE (then they access VMCS shadow instead of doing VM exits). VMCS shadowing has better performance, but not so much as expected - see the picture in attachment - there is an array of pairs of 64 bit counters. First one incremented on VMX instruction vm exit from L1 to L0, second one on vmx instruction successful completion (I wanted to count how much of them fail, I was afraid of some bug in L0 for handling VMREAD and VMWRITE, but later I found it was something else). Child was vmware 12.0.1 workstation hypervisor (L1) and guest was win 10 x64 build 10240 (L2). CPU was Xeon E3 1230 V2 (Ivy Bridge)

L0 counted these numbers:
vmexit 19 vmclear 0x790063 times
vmexit 20 vmlaunch 0x325150 times
vmexit 21 vmptrld 0x3C8207 times
vmexit 22 vmptrst 0 times
vmexit 23 vmread 0x6812BA3 times (few of them failed to complete, the second counter is less than the first one, later I found it was caused by VMREAD to nonexisting VMCS field)
vmexit 24 vmresume 0x45F700 times
vmexit 25 vmwrite 0x2C627A6 times (few of them failed to complete, the second counter is less than the first one, later I found it was caused by VMWRITE to nonexisting VMCS field)
vmexit 26 vmxoff 0x3C8295 times
vmexit 27 vmxon 0x3C8295 times
vmexit 50 invept 0x3CA309 times
vmexit 53 invvpid 0x3D6859 times

When using Haswell and enabled VMCS shadowing, the countes for vmread and vmwrite would be both 0, so vm exits from L1 to L0 caused by vmx instrunstions will be reduced about 5 times. Yeah they are still vm exits from L1 to L0 caused by another reasons, like MSR access, cpuid etc.
If L0 consumed 5% of CPU time and L1+L2 95% without VMCS shadowing (Ivy Bridge), then on CPU with VMCS shadowing and L0 enabled VMCS shadowing, L0 would consume 1% of CPU time and L1+L2 99%. That is measurable, but not much visible to human eye. Anyway on servers running a lot of VMs even such small improvement is desirable. If L0 has poor design and consumes 50% of CPU time, then of course the performance would be dramatically better with VMCS shadowing as without it (10% of CPU time for L0, 90% for L1+L2).

Posted: **Tue Mar 15, 2016 11:23 am**

VMPTRLD always operates on machine memory (thats memory as is in RAM).
When parent hypervisor executes VMPTRLD (L0), its operand is referenced by virtual memory pointer which holds same 64 bit value of machine memory where VMCS is located.
e.g.
; erase memory holding VMCS, set revision ID, execute vmxon, vmclear
mov rax,guest_PA_of_VMCS
mov [rsp+0x20],rax
vmptrld [rsp+0x20]

Excuse me for the very basic questions (I'm a beginner...): As far as I understand the assembly code snippet above is executed by the L1 hypervisor in order to "try" to load VMCS for its guest (VMCS1-2) located at guest_PA_of_VMCS (from point of view of L1 hypervisor guest_PA_of_VMCS is a "real" physical address).
Then, as you said, it is in charge of L0 hypervisor -basically vm exit handler (L0, vm exit number 21)- find out the machine (host) physical address where VMCS1-2 is located. It can then access VMCS1-2 fields using L0 hypervisor's Virtual Addresses (VA) where it is mapped.
Now my doubt is: how does L0 hypervisor emulate L1 VMREAD/VMWRITE ? Can it use VMREAD/VMWRITE instructions itself to do the job ?

L0 have to explore whether guest PA is valid, whether guest PA is not the same as guest PA of VMXON, then map machine memory of VMCS somewhere into VA of L0 and explore whether revision ID matches.

Here are you referring to "guest VXMON region" revision ID vs. VMCS1-2 revision ID ?

Thanks all for your help!

Posted: **Tue Mar 15, 2016 2:05 pm**

cianfa72 wrote:Now my doubt is: how does L0 hypervisor emulate L1 VMREAD/VMWRITE ? Can it use VMREAD/VMWRITE instructions itself to do the job ?

Depends on how you implement it. You need to make it read from or write to where the VMCS that the guest wants to access is stored. Anything that works is allowed. If you use the VMCS area the L1 hypervisor set up in the physical guest memory, it can be as simple as a memory read or write to this VMCS area.

Posted: **Wed Mar 16, 2016 3:20 am**

feryno wrote:VMPTRLD always operates on machine memory (thats memory as is in RAM)

you mean actually VMPTRLD instruction as executed by processor, I guess... (in other words: when CPU processor execute VMPTRLD (in our scenario just only L0 hypervisor code can really execute it) the value stored in the referenced memory operand is interpreted by the processor as the VMCS machine physical address)

Kevin wrote:
cianfa72 wrote:Now my doubt is: how does L0 hypervisor emulate L1 VMREAD/VMWRITE ? Can it use VMREAD/VMWRITE instructions itself to do the job ?
Depends on how you implement it. You need to make it read from or write to where the VMCS that the guest wants to access is stored. Anything that works is allowed. If you use the VMCS area the L1 hypervisor set up in the physical guest memory, it can be as simple as a memory read or write to this VMCS area.

I was thinking about another option to implement it (maybe the same previously described by feryno) based on L0 hypervisor loading VMCS1-2 (using VMPTRLD). Doing that processor current-VMCS pointer points now to VMCS1-2 region in machine physical memory allowing L0 hypervisor to execute VMREAD/VMWRITE directly to VMCS1-2 fields in order to emulate L1 hypervisor VMCS1-2 VMREAD/VMWRITE

Make sense ?

Posted: **Wed Mar 16, 2016 3:45 am**

The host has only a single current VMCS pointer. It will have to set it to the VMCS belonging to the L1 guest (VMCS0-1), otherwise it can't run it. Of course your vmwrite handler could first load VMCS1-2, then vmwrite on the host, then switch back to VMCS0-1, but this doesn't feel very efficient. Also, as VMCS1-2 is stored in guest physical memory and the guest could modify its contents as it likes, you never want to trust it and you never want to run a VM from it. I guess it is possible to use vmwrite/vmread just for storing data, but why would you when mov is as good?

Directly going to vmread/vmwrite in the vmexit handlers only makes sense if you access a shadow VMCS outside the guest (i.e. VMCS0-2). Your handlers will then need additional logic because what the guest sees and what the real VMCS contains probably isn't exactly the same.

Posted: **Wed Mar 16, 2016 8:44 am**

Kevin wrote:I guess it is possible to use vmwrite/vmread just for storing data, but why would you when mov is as good?

Sorry for the trivial question: what do you mean with "when mov is as good" ?

Thanks

Posted: **Wed Mar 16, 2016 10:26 am**

If you only write data with vmwrite so that you can later read it with vmread, but you never actually start a VM from that VMCS, then there is no point in using those instructions. You can then simply write the data to some memory location with normal non-VMX instructions.

OSDev.org

nested VM virtualization

nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization

Re: nested VM virtualization