IOMMU translation for containers running inside a VM

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
cianfa72
Member
Member
Posts: 97
Joined: Sat Dec 22, 2012 12:01 pm

IOMMU translation for containers running inside a VM

Post by cianfa72 »

Consider a Linux guest VM running on qemu/kvm system with Intel multicore processor supporting both VT-x and VT-d. Linux guest runs a dpdk-based container (e.g. docker) implementing a network device.

VM's exposed NICs from qemu/kvm are emulated e1000 cards. dpdk container requires directed I/O access by VM's NICs to containerized process's memory locations acting as network buffers (vNIC memory-mapped tx/rx ring plus data buffers). In order to do that "virtual" IOMMU takes place translating guest DMA (PCI) bus-relative target addresses (GIOVA) into the relevant VM's GPAs (note that qemu/kvm VM configuration exposes iommu support to guest). qemu/kvm configure the hardware (PGTT=011b) to perform both first-stage and second-stage (nested) translations when the logical processor runs in VMX non-root mode under the control of any VM vcpus' VMCS.

My concern now is about the emulation of e1000 NICs implemented within the qemu process. Basically I believe qemu emulation code actually accesses (reads/writes) virtual addresses within the process to which relevant vNIC memory-mapped control structures and target data buffers are mapped to.

Do you think it actually make sense ? Thanks.
Last edited by cianfa72 on Thu May 29, 2025 3:21 am, edited 1 time in total.
Octocontrabass
Member
Member
Posts: 5815
Joined: Mon Mar 25, 2013 7:01 pm

Re: IOMMU translation for containers running inside a VM

Post by Octocontrabass »

The IOMMU is used to let the VM access physical hardware. When the VM accesses emulated hardware, like the emulated e1000, the IOMMU is not involved.
cianfa72
Member
Member
Posts: 97
Joined: Sat Dec 22, 2012 12:01 pm

Re: IOMMU translation for containers running inside a VM

Post by cianfa72 »

Octocontrabass wrote: Mon May 26, 2025 3:01 pm The IOMMU is used to let the VM access physical hardware. When the VM accesses emulated hardware, like the emulated e1000, the IOMMU is not involved.
As I understand it, you mean an OS, regardless of whether it runs bare metal or inside a VM, let devices (physical or emulated respectively) to perform DMA to target memory locations without any address remapping by IOMMU (e.g. DMA's target address programmed by NIC's device driver = physical address of target memory location).

However VMM/Hypervisor leverage IOMMU to support PCI passthrough/SR-IOV for VMs.

That means iommu feature possibly exposed to VMs from hypervisor (e.g. qem/kvm) is useful from guest viewpoint only if it implements itself an hypervisor (as in nested virtualization).

Nevertheless there could be cases where guest OS is designed such way that DMA target I/O addresses (IOVA) are different from physical addresses of target memory locations. Therefore I/O mapping/translation of IOVAs to PAs is actually needed.
Last edited by cianfa72 on Tue May 27, 2025 12:04 pm, edited 2 times in total.
Octocontrabass
Member
Member
Posts: 5815
Joined: Mon Mar 25, 2013 7:01 pm

Re: IOMMU translation for containers running inside a VM

Post by Octocontrabass »

Emulated devices do not use the physical IOMMU, regardless of whether or not the guest uses its emulated IOMMU.
cianfa72
Member
Member
Posts: 97
Joined: Sat Dec 22, 2012 12:01 pm

Re: IOMMU translation for containers running inside a VM

Post by cianfa72 »

Octocontrabass wrote: Tue May 27, 2025 10:01 am Emulated devices do not use the physical IOMMU, regardless of whether or not the guest uses its emulated IOMMU.
Ok, you mean...suppose guest OS uses emulated IOMMU to remap DMA target addresses to its "physical" addresses. It is fooled by the hypervisor into thinking it is running on real HW, hence it configures what it believes is real IOMMU (of course hypervisor must expose IOMMU support to the guest). Guest creates in its (guest) memory the relevant structures needed for IOMMU, i.e. legacy or scalable mode tables and first-stage translation paging-structures. However the point is that emulated devices will never use it :?

Basically guest IOMMU structures are there in guest's memory, however none of VM's emulated devices (e.g. emulated e1000) will use it.
Octocontrabass
Member
Member
Posts: 5815
Joined: Mon Mar 25, 2013 7:01 pm

Re: IOMMU translation for containers running inside a VM

Post by Octocontrabass »

Emulated devices can use the guest IOMMU. My point is the hypervisor can't use the hardware IOMMU when the guest tries to use its IOMMU with an emulated device. For emulated devices, the hypervisor has to emulate the guest IOMMU entirely in software.
cianfa72
Member
Member
Posts: 97
Joined: Sat Dec 22, 2012 12:01 pm

Re: IOMMU translation for containers running inside a VM

Post by cianfa72 »

Octocontrabass wrote: Fri May 30, 2025 11:37 am Emulated devices can use the guest IOMMU. My point is the hypervisor can't use the hardware IOMMU when the guest tries to use its IOMMU with an emulated device.
Sorry, not sure to fully grasp your point (I'm a newbie..). As far as I know, IOMMU transactions' initiators are always DMA-capable devices other than processor (physical/logical CPUs). Therefore the OS/driver code only create/setup in memory's IOMMU specific structures actually used by DMA-capable devices when they try to access main memory locations (basically accesses to target I/O virtual/bus-relative addresses programmed by device drivers are translated by IOMMU to physical addresses of the target locations in main memory).

In case of VM with emulated devices, of course the hypervisor (e.g. qemu/kvm) can't use hardware IOMMU to manage what appear to the guest as DMA accesses from (emulated) devices to (guest) main memory.
Octocontrabass wrote: Fri May 30, 2025 11:37 amFor emulated devices, the hypervisor has to emulate the guest IOMMU entirely in software.
Ok, your point is that the hypervisor for emulated devices has to emulate enterely in software the presence/function/behavior of guest IOMMU. For instance the hypervisor, for a device it emulates, must write to the relevant locations in host/machine main memory that are mapped in the GPA locations as expected from the guest device driver for that (emulated) device.
Last edited by cianfa72 on Mon Jun 02, 2025 10:45 am, edited 5 times in total.
Octocontrabass
Member
Member
Posts: 5815
Joined: Mon Mar 25, 2013 7:01 pm

Re: IOMMU translation for containers running inside a VM

Post by Octocontrabass »

cianfa72 wrote: Mon Jun 02, 2025 9:51 amAs far as I know, IOMMU transactions' initiators are always DMA-capable devices other than the processor (physical/logical CPUs).
Correct.
cianfa72 wrote: Mon Jun 02, 2025 9:51 amTherefore guest code can only create/setup in memory IOMMU's specific structures actually used by DMA-capable devices when they try to access main memory locations (basically target I/O virtual/bus-relative addresses are translated by IOMMU to physical addresses of the target locations in main memory).
Correct. For PCI passthrough, the hypervisor translates the guest's IOMMU structures from guest-physical addresses to host-physical addresses to use with the hardware IOMMU. For emulated devices, the hypervisor still reads the guest's IOMMU structures to find the correct guest-physical addresses, but it doesn't use them with the hardware IOMMU because the device is emulated on the CPU.

It seems like you edited your post while I was writing mine.
cianfa72
Member
Member
Posts: 97
Joined: Sat Dec 22, 2012 12:01 pm

Re: IOMMU translation for containers running inside a VM

Post by cianfa72 »

Octocontrabass wrote: Mon Jun 02, 2025 10:37 am Correct. For PCI passthrough, the hypervisor translates the guest's IOMMU structures from guest-physical addresses to host-physical addresses to use with the hardware IOMMU.
Ok, so in this case the hypervisor does program the hardware IOMMU (including creating/configuring the relevant mapping structures) to map DMA transactions' target addresses of PCI passthrough devices to the relevant addresses into host/machine address space (HPAs).
Octocontrabass wrote: Mon Jun 02, 2025 10:37 am For emulated devices, the hypervisor still reads the guest's IOMMU structures to find the correct guest-physical addresses, but it doesn't use them with the hardware IOMMU because the device is emulated on the CPU.
Ok, so in this case (emulated devices) the hypervisor code (e.g. qemu) that emulate the device, in order to emulate the devices' DMA transactions into guest memory, actually accesses (without the use of any hardware IOMMU but just using MMU to map hypervisor process's VAs) the relevant HPA addresses where the guest DMA target location are mapped into.
Last edited by cianfa72 on Mon Jun 02, 2025 11:20 am, edited 1 time in total.
Octocontrabass
Member
Member
Posts: 5815
Joined: Mon Mar 25, 2013 7:01 pm

Re: IOMMU translation for containers running inside a VM

Post by Octocontrabass »

That's correct.
cianfa72
Member
Member
Posts: 97
Joined: Sat Dec 22, 2012 12:01 pm

Re: IOMMU translation for containers running inside a VM

Post by cianfa72 »

Ok, just to reiterate: let's assume the hypervisor/VMM exposes IOMMU support to VM. In case of emulated devices alone (no SR-IOV/PCI passthrough or paravirtualization) guest can still create & configure IOMMU structures in guest memory (e.g. Intel scalable mode's root-tables, context-tables and PASID-tables plus IOMMU paging-structures) since it "thinks" it is running on hardware supporting it (in reality it is being fooled by the hypervisor). However hypervisor/VMM doesn't use hardware IOMMU at all neither guest runs in VMX non-root mode with IOMMU hardware translation enabled.
Octocontrabass
Member
Member
Posts: 5815
Joined: Mon Mar 25, 2013 7:01 pm

Re: IOMMU translation for containers running inside a VM

Post by Octocontrabass »

That all sounds correct to me.
Post Reply