vt-d2 interrupt remapping in virtualized scenario

cianfa72 · Post by **cianfa72** » Tue Dec 03, 2013 7:22 am

Hi folks,

consider the virtualization scenario where a PCIe device is assigned to a specific VM (e.g. PCI passthrough device VM assigned into ESXi). I was trying to better understand how VT-d2 (interrupt remapping) is involved in delivery a such interrupt to the specific VM

Reading Intel doc interrupt remapping (implemented by northbridge integrated on-die on modern CPUs) basically maps a "logical interrupt" - programmed by VM into assigned device (e.g. MSI capability registry entries into PCIe configuration space or MMIO range in case of MSI-X) - in a physical interrupt message directly delivered to the CPU's local APIC (actually IRTE entries have fields for cpu-id destination, interrupt vector plus other interrupt related stuff)

Now if that is correct my doubt is: how does the interrupt remapping engine deliver an interrupt to a specific VM assigned vCPU? There exist dedicated sw structures (managed by Hypervisor/VMM) involved in this deliver process (e.g. VMCS or others) ?

thanks for your help !

Nable · Post by **Nable** » Tue Dec 03, 2013 6:52 pm

Are you sure that interrupt can be delivered to vCPU?
At the time I've studied these things (~1.5 years ago, although I didn't have access to hardware with VT-d) interrupt came to physical CPU and then there were only 2 variants: either interrupt causes VM_EXIT and then it's delivered to host OS via its IDT, so you can handle it and possibly inject into guest, or it doen't trigger VM_EXIT (if interrupt came when VM was running), then all interrupts are delivered via guest's IDT (if host needs some of interrupt vectors, you have to patch guest's IDT to trigger VM_EXIT).

cianfa72 · Post by **cianfa72** » Wed Dec 04, 2013 2:47 am

Nable wrote:Are you sure that interrupt can be delivered to vCPU?

No..it was just a guess to understand how things works..

Nable wrote:or it doen't trigger VM_EXIT (if interrupt came when VM was running), then all interrupts are delivered via guest's IDT (if host needs some of interrupt vectors, you have to patch guest's IDT to trigger VM_EXIT).

in this second case when the VM is running on a physical CPU (e.g. is running on a core of a multi-core processor) how does the CPU (core) recognize the device the interrupt is coming from and in turn do not trigger a VM_EXIT (leaving guest OS to manage the interrupt using guest's IDT)?
Is VMCS structure involved in this delivery process ?

feryno · Post by **feryno** » Wed Dec 04, 2013 8:00 am

I don't answer the main question of the thread (so I should better shut up), just to make some clarifications - I saw too much confusing things that I had to respond.

interrupt causes VM_EXIT and then it's delivered to host OS via its IDT, so you can handle it and possibly inject into guest, or it doesn't trigger VM_EXIT (if interrupt came when VM was running), then all interrupts are delivered via guest's IDT (if host needs some of interrupt vectors, you have to patch guest's IDT to trigger VM_EXIT)

if interrupt caused VM_EXIT it didn't went through any IDT yet (either via guest IDT, not at all via host IDT)

interrupt doesn't trigger VM_EXIT (interrupt come when VM was running)

this may occur, just remember that at every VM exit at Intel CPU rflags.IF is set to 0 (so only INIT, NMI etc may come). But if your hypervisor enables interrupts (by executing the STI instruction) then interrupt may come via host IDT (also depends on the value in CR8 whether it is not blocked)

if host needs some of interrupt vectors, you have to patch guest's IDT to trigger VM_EXIT

host usually doesn't need guest interrupt vectors, host may just need intercepting interrupts
patching guest IDT is bad solution by my opinion, for hypervisor it is easier to intercept interrupts so every interrupt coming in guest mode causes VM exit

yes, there are 3 possible situations:
[0] interrupt occurs in root mode (IF=0 and high CR8 may block its delivery via host IDT right now so if you don't enable interrupts either decrease CR8 low enough its delivery will be postponed to occur after switching into guest mode = vm entry) - see the following [1], [2]
[1] interrupt occurs in guest mode and it causes VM exit (because hypervisor set interrupt intercept in VMCS)
[2] interrupt occurs in guest mode and doesn't cause VM exit (because hypervisor didn't set interrupt intercept in VMCS) it is delivered via guest IDT if guest IF=1 and CR8 is low enough

P.S. remember also interrupt window and some rare instructions which may block interrupt delivery for 1 following instruction (mov ss,... etc)

Nable · Post by **Nable** » Wed Dec 04, 2013 3:09 pm

Thanks a lot for clarification, feryno! Your post is much more correct than my attempt to describe situation in brief.

patching guest IDT is bad solution by my opinion, for hypervisor it is easier to intercept interrupts so every interrupt coming in guest mode causes VM exit

Yes, that's a bad and hard solution but interception of all interrupts can cause significant overhead when we use passthroughing of real devices to VM. That's why people came with idea of "ELI - exitless interrupt handling" (with this keywords you can find the article about this idea, if you didn't read about it before). Idea was rather interesting, although it's too difficult to implement it in a secure and reliable way, so I think that it'll never go into production unless Intel^W some CPU vendor adds feature of specifying set of interrupt vectors to intercept.

cianfa72 wrote:how does the CPU (core) recognize the device the interrupt is coming from and in turn do not trigger a VM_EXIT (leaving guest OS to manage the interrupt using guest's IDT)?

As far as I understand, CPU core doesn't know anything about the exact source of external interrupt.

cianfa72 wrote:Is VMCS structure involved in this delivery process ?

There's a 1-bit setting (see IASDM Vol. 3B part 2, table Table 21-5: "Definitions of Pin-Based VM-Execution Controls", bit 0) : either all external interrupts are intercepted and cause a VM exit (if they come when VM is running), or they are not intercepted and doesn't cause VM exit. Moment when interrupt will be delivered depends on the factors that feryno mentioned.

cianfa72 · Post by **cianfa72** » Thu Dec 05, 2013 4:03 am

Thanks all for your replies

reading the slides at (related to ELI) http://www.iolanes.eu/_docs/eli_asplos12_slides.pdf

and coming back to the guest's IDT patching option (physical interrupts do not trigger a VM_EXIT to the Hypervisor/VMM when the CPU core is running in VMX guest mode), I guess the only way to implement it is to rewrite guest's specific IDT entries to trigger an exception (Segment Non Present - #NP). Otherwise Hypervisor/VMM (running obviously in VMX root mode) has no way to reprogram a guest's IDT entry to point directly to its handler routine

Does it make sense ?

feryno · Post by **feryno** » Thu Dec 05, 2013 4:47 am

cianfa72 wrote:Thanks all for your replies

reading the slides at (related to ELI) http://www.iolanes.eu/_docs/eli_asplos12_slides.pdf

and coming back to the guest's IDT patching option (physical interrupts do not trigger a VM_EXIT to the Hypervisor/VMM when the CPU core is running in VMX guest mode), I guess the only way to implement it is to rewrite guest's specific IDT entries to trigger an exception (Segment Non Present - #NP). Otherwise Hypervisor/VMM (running obviously in VMX root mode) has no way to reprogram a guest's IDT entry to point directly to its handler routine

Does it make sense ?

why to do it in a such complicated way (patching guest interrupt vectors) ?
vmx is here to do everything transparently (so guest knows nothing, hypervisor may manipulate interrupts in a way that they disappear for guest or even create fake interrupts for guest - e.g. when you need OS to map some page into virtual space you may create fake #PF - that worked fine at me under one ms win x64 project)
if you want to intercept guest exceptions like #NP just enable the corresponding bit in exception bitmap (I did it some time ago quite often with #DB, #BP, #PF and rarely I intercepted even all guest exceptions - it was an debugger project)
if you want to intercept external interrupts (generated by devices) then set bit 0. (External-interrupt exiting) in Pin-Based VM-Execution Controls

the link pointing to the slides - maybe they just wanted to improve performance - every vm exit costs some CPU cycles (which is at least 2048 cpu cycles at me - but that seems to be negligible for CPU running at few GHz - maybe some huge complicated hypervisor does a lot of overhead and consumes more than 10000 cycles)

Nable · Post by **Nable** » Thu Dec 05, 2013 3:26 pm

feryno wrote: which is at least 2048 cpu cycles at me - but that seems to be negligible for CPU running at few GHz

Just in case if you are interested:
some devices (especially low-latency network adapters) can generate hundreds and even thousands IRQ per second. And interrupt coalescing is not a solution (it increases latency). So, there are possible use-cases when VM exit costs are not negligible. Of course, these cases are very rare and described idea is more interesting from the theoretical/experimental point of view, not from the practical/enterprise one. IMHO, example of practical way is usage of para-virtualised drivers (virtio).
Here's one of the links to the full article: http://researcher.watson.ibm.com/resear ... plos12.pdf

feryno wrote:when you need OS to map some page into virtual space you may create fake #PF - that worked fine at me under one ms win x64 project

Suddenly amazing, it would be interesting to see working implementation. Oh, I should try googling for it instead of asking.

feryno wrote:I did it some time ago quite often with #DB, #BP, #PF and rarely I intercepted even all guest exceptions - it was an debugger project

Is this project proprietary ( I mean "for internal (inside the company) use only" ) or it's smth accessible?

cianfa72 wrote:Does it make sense ?

Are you asking about patching guest? No, in most cases such modifications don't make sence. Straight ways of using hardware virtualization are simpler, more flexible, more reliable and in almost all real cases their performance is enough.

cianfa72 · Post by **cianfa72** » Fri Dec 06, 2013 7:50 am

cianfa72 wrote:Does it make sense ?

Are you asking about patching guest? No, in most cases such modifications don't make sence. Straight ways of using hardware virtualization are simpler, more flexible, more reliable and in almost all real cases their performance is enough.

My question was simple (I'm a beginner...): consider guest's IDT patching (yes...we know it is not a good idea but consider it just for a moment..). I was just thinking about a possible implementation...

The first idea in my mind was just rewrite the specific guest's IDT entry with a code segment selector + hypervisor/VMM provided handler routine offset ...but then i realized that when CPU is running in Protected Mode with Paging enabled all addresses are seen as virtual addresses valid just in the current (virtual) address space (hypervisor's handler routine lives, instead, into hypervisor/VMM address space)

So the only option to implement it is to rewrite the specific guest's IDT entry setting P bit to 0 forcing the CPU (when running in vmx guest mode) to trigger a VM_EXIT upon the delivery on a VM (device) assigned interrupt (provided that external interrupt interception bit is enabled into VMCS Pin-Based VM-Execution Controls is enabled)

So my question "Does it make sense ?" just meant: Is this reasoning correct ?

Thanks.

feryno · Post by **feryno** » Mon Dec 09, 2013 7:05 am

Cianfa72 wrote:

but then i realized that when CPU is running in Protected Mode with Paging enabled all addresses are seen as virtual addresses valid just in the current (virtual) address space (hypervisor's handler routine lives, instead, into hypervisor/VMM address space)

So the only option to implement it is to rewrite the specific guest's IDT entry setting P bit to 0 forcing the CPU (when running in vmx guest mode) to trigger a VM_EXIT

Yes, you must use such trick - generate some exception.
Personally I would set External-interrupt exiting in Pin-Based VM-Execution Controls and also Acknowledge interrupt on exit (so I know interrupt vector).

Nable wrote:

Suddenly amazing, it would be interesting to see working implementation. Oh, I should try googling for it instead of asking.

I'll be quite surprised if you are be able to google something usefull.
The target OS for that project was win x64 (but the same should be done for other OS-es).
I had to access virtual memory of given process from hypervisor. Virt. mem is full of holes. Some holes are because there is nothing.
Some holes are because performance - e.g. when win x64 loads an executable, you can find PE32+ header somewhere in virt. memory (first page of executable mapped into virt. mem). If it is some common DLL then you can suppose it is already somewhere in physical memory (because a lot of other processes use it), but not yet fully mapped in virtual memory of the process which just now attempted to load the DLL (OS maps it into virt. mem on demand as process accesses it, mapping it whole into virt. mem at once decreases startup performance especially when only few parts of it will be used).
So when I wanted to dump the whole executable image then I had to generate fake #PF for every missing virt. mem page. CR3 had to match the given process at the time of generating fake #PF. OS then mapped missing page into virt. mem and then I was able to dump it as contiguous range without empty holes. I was tell not to use any OS system call, I had to do the dump using only virtualization technology (so OS didn't know about it and malware couldn't detect it). It had to be done very carefully because if generating fake #PF for a page where OS doesn't expect to map anything then OS kills the application and if it was ring0 virt. memory range I wanted to dump then BSOD. Fake #PF was generated as read access on the page to be mapped in.
I can make some video how does it look - the holes, the performace (If I remeber correctly it took about 1-2 seconds to force OS to completely map about 5 MB of missing pages on Core 2 Duo CPU - the project finished more than 3 years ago). The info I provided is enough to implement it in your similar project.
Intercept CR3 writes. If CR3 matches given process and interrupts aren't disabled and CR8 is low enough you can start to generate fake #PF as read attempts. You can get necessary memory range from PE32+ header so you won't generate #PF outside of executable image. Don't generate fake #PF for pages which are already mapped.

Nable wrote:

Is this project proprietary ( I mean "for internal (inside the company) use only" ) or it's smth accessible?

Yes I developed it for somebody else. Few people got binaries after signing NDA with owners. Few not. I can send you emails / yahoo messenger contacts. Work finished more than 3 years ago. I don't know whether the company still exist. But you may try, it doesn't cost anything.

cianfa72 · Post by **cianfa72** » Tue Dec 10, 2013 5:09 am

Personally I would set External-interrupt exiting in Pin-Based VM-Execution Controls and also Acknowledge interrupt on exit (so I know interrupt vector).

Do you mean re-enable interrupts on VM_EXIT executing SIT instruction in hypervisor code (RFLAGS.IF=0->1 in order to acknowledge the interrupt signal and grant CPU core's LAPIC to deliver the interrupt vector - provided that CR8 register is low enough) ?

I was tell not to use any OS system call, I had to do the dump using only virtualization technology (so OS didn't know about it and malware couldn't detect it). It had to be done very carefully because if generating fake #PF for a page where OS doesn't expect to map anything then OS kills the application and if it was ring0 virt. memory range I wanted to dump then BSOD. Fake #PF was generated as read access on the page to be mapped in.

Having said I'm a beginner...AFAIK to enable guest OS to manage an exception (fake #PF fault in this case) we (hypervisor/VMM) have to save on current stack position or on specific CPU registers (MSRs ??) the needed information about the fault itself (virt. memory address of the faulting instruction etc..). When generating fake #PF did you provide those data accordingly ?

I can make some video how does it look - the holes, the performance (If I remember correctly it took about 1-2 seconds to force OS to completely map about 5 MB of missing pages on Core 2 Duo CPU - the project finished more than 3 years ago).

To me it would be very interesting...

feryno · Post by **feryno** » Wed Dec 11, 2013 7:36 am

cianfa72 wrote:

Do you mean re-enable interrupts on VM_EXIT executing STI instruction in hypervisor code

No, I have never enabled interrupts in root mode for my tiny Intel hypervisors.
What I meant was this:
When enabling external interrupt exiting and acknowledge interrupt on exit, the external interrupt causes vm exit and guest doesn't know anything about it. CPU transfers control to hypervisor vm exit handler and you get interrupt vector from VM-exit interruption-information field. Then your hypervisor may handle the interrupt (so for guest the interrupt never occured, guest doesn't know anything about it) or hypervisor may inject it back into guest by copying the value from VM-exit interruption-information field into VM-Entry Interruption-Information field (+ erasing bit 12 is always good choice) and executing vmresume and then guest sudenly knows that the interrupt occured because it hits guest IDT.
Host IDT is not involved here, the interrupt never goes through host IDT, just via guest IDT in case you injected it into guest using event injection.
In shortcut, my host IDT is hit only when there is a bug in my tiny hypervisor (but my hypervisor code is usually something between 4-16 kB - I'm working only alone, not in any team, the rest of hypervisor occupied memory upto 8 MB are paging tables, EPT tables, VMCS for upto 64 CPUs, stack for every CPU etc). But more complex and advanced hypervisor (done in a team work) may need to enable interrupts in root mode so then external interrupts are delivered via its host IDT.
My tiny hypervisors operate in root mode only very rarely (only few times per second) and for very short time (about 2000 cpu cycles).
You wrote you are beginner - so do not think in any complicated way = do not enable interrupts in hypervisor (enabling interrupts by executing the STI makes situation much more difficult, then some external interrupts come in root mode via host IDT and you have to inject them into guest).
Just remember that in root mode interrupts are disabled at vm exit so no external interrupt come (only INIT, NMI etc. may come).

But I remember that I had to enable interrupts in root mode in one hypervisor but it was for AMD (not for Intel) and that was for something difficult and off topic (under AMD you should execute STGI, under Intel it is enough to do STI).

about fake #PF:
I described that technique enough. You are thinking in a complicated way (no need to play with MSRs). Just remember that for guest it was completely transparent and OS running as guest was thinking it was its program accessing some virtual pages so OS created mappings (DLL was in RAM and as program accessed some its parts it generated #PF at which OS created entries in paging tables and returned execution back to the program, for pages which were not accessed by program the fake pagefaults were generated by hypervisor). For OS is was not important whether the #PF was generated by program accessing not yet present virt. memory page (real #PF) or generated by hypervisor (fake #PF).

cianfa72 wrote:

To me it would be very interesting...

http://fdbg.x86asm.net/h00.00A2.dump.zip
Here you have a dump made by some my old hypervisor project on windows server 2003 R2 x64. It is a dump of the running app_stop.exe program. The program runs in 5 virt. memory pages 200000000h-200004FFFh and all its 5 pages are present. Then there are some pages for OS structures and for stack. Then there are some DLLs and only some parts of them were used so the whole virtual memory is only 1,3 MB although the sum of all DLLs is few MB (not yet accessed pages of DLLs are not present in virtual memory). You need x64 OS from MS to run the included show_dump.exe to see the dump in a comfortable way - load the one of the 3 files which doesn't have any extension. If you don't have such OS I can describe the structure of the dump.

cianfa72 · Post by **cianfa72** » Thu Dec 12, 2013 8:07 am

about fake #PF:
I described that technique enough.

Maybe the piece of information missing to me was the following (e.g. IASDM Vol 3C 33.2)

Event injection. VMX operation allows injecting interruptions to a guest virtual machine through the use of VM-entry interrupt-information field in VMCS. Injectable interruptions include external interrupts, NMI,
processor exceptions, software generated interrupts, and software traps. If the interrupt-information field
indicates a valid interrupt, exception or trap event upon the next VM entry, the processor will use the
information in the field to vector a virtual interruption through the guest IDT after all guest state and MSRs are
loaded. Delivery through the guest IDT emulates vectoring in non-VMX operation by doing the normal privilege
checks and pushing appropriate entries to the guest stack (entries may include RFLAGS, EIP and exception
error code). A VMM with host control of NMI and external interrupts

So, IIUC, the fake #PF read access generated by hypervisor to force guest OS to map pages into (guest) virt. process address space, require the setup of specific VMCS components (VM-entry interruption-information field, VM-entry exception error-code field etc.) just before the hypervisor resumes the VM itself (VM-entry/resume)....?

You need x64 OS from MS to run the included show_dump.exe to see the dump in a comfortable way - load the one of the 3 files which doesn't have any extension. If you don't have such OS I can describe the structure of the dump.

Thanks just do it....

your help is very appreciated !

feryno · Post by **feryno** » Mon Dec 16, 2013 8:15 am

cianfa72 wrote:

Maybe the piece of information missing to me was the following (e.g. IASDM Vol 3C 33.2)

nothing missing at you
this is nowhere described in CPU manuals (some info spreads when you talk with hypervisor developers)

cianfa72 wrote:

So, IIUC, the fake #PF read access generated by hypervisor to force guest OS to map pages into (guest) virt. process address space, require the setup of specific VMCS components (VM-entry interruption-information field, VM-entry exception error-code field etc.) just before the hypervisor resumes the VM itself (VM-entry/resume)....?

yes, it is done by setting only these 2 fields in VMCS
hypervisor may create fake #PF, OS runninig in guest then thinks the #PF was regularly triggered when an executable attempted to access not yet present page in virt. memory (if you do a mistake then OS usually kills the executable if it was ring3 and BSOD when it was ring0)

cianfa72 wrote:

Thanks just do it....

your help is very appreciated !

I just wanted you to see holes in virt. memory. For performance - ms windows OS maps pages on demand (as executable acceses them). If OS would map everything at program startup that lasted too long. It is better to delay most of mapping later so user may start to interact with just running program as fast as possible. Big parts of DLLs are never accessed. Whole DLLs are loaded in physical memory but only parts of them are mapped into virt. memory of running processes. You can see the powerfulness of hypervisor here - it can force guest to map missing pages. It is good when you e.g. want to watch guest and scan for possible infections. The whole fake #PF is a task for only few minutes of programming (setting the 2 fields in VMCS). The pain is to identify running processes so you inject it into correct process and you use correct virtual memory address.

If you start to develop some simple hypervisors my suggestions concerning interrupts are:
[0] do it as easy as possible (later you make it more advanced) and let run root mode as little as possible (intercept only what you need and for e.g. vm exit caused by CPUID you really don't need to push/pop all 16 GPR64 etc.)
[1] don't enable interrupts by the STI instruction in vm exit handler (so it seems like all interrupts occur in guest mode)
[2] your hypervisor may intercept interrupts from guest and it may make them a) to disappear for guest, b) inject them back into guest, c) create nonexisting interrupt

OSDev.org

vt-d2 interrupt remapping in virtualized scenario

vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario

Re: vt-d2 interrupt remapping in virtualized scenario