KVM page fault exceptions

cianfa72 · Post by **cianfa72** » Fri Dec 16, 2016 2:16 am

Hi,

a basic question about KVM implementation (I'm using kvm based VMs to implement a network virtual lab running on top on an Intel based sever)

Intel server supports EPT nested paging, so I'd like to better understand how page fault exceptions raised by CPU when guest OS is running are handled in this scenario.

As far as I can understand, leveraging on EPT, kvm module may be able to disable #PF vm exit (setting the related bit in the VMCS field for the guest VM). In that case page fault exceptions (#PF) due to guest OS paging system translations are handled by guest OS itself without the need to VM exit giving back the control to KVM.

Furthermore allocation of host physical page to be used for "guest physical pages" including the related EPT mapping (basically the EPT entries filling process for them) will be handled by KVM handler upon EPT VIOLATION or MISCONFIGURATION vm exit (EPT-induced vm exit).

did I get it right ? Thanks

cianfa72 · Post by **cianfa72** » Wed Dec 21, 2016 6:39 am

Help please

Nable · Post by **Nable** » Wed Dec 21, 2016 5:26 pm

What is the exact problem? Even if you find the manual too large, you can experiment with page tables.
I should read old sources to be sure but AFAIR it works like this:
1. EPT entries has "present" bit that allow you to allocate physical memory lazily, i.e. when guest accesses GVA (guest virtual address) that have mapped GPA (guest physical address) according to guest's page tables but don't have present HPA (host physical address).
2. When page fault occurs during GVA -> GPA translation (before doing GPA -> HPA) this page fault can be handled by guest without any VM exits if you configure VMX properly.

cianfa72 · Post by **cianfa72** » Thu Dec 22, 2016 2:08 am

Thanks for reply

Nable wrote:What is the exact problem? Even if you find the manual too large, you can experiment with page tables.
I should read old sources...

Which manual you are referring to ? Furthermore could you tell me which is the kvm source code module (.c) for it ?

Nable wrote:1. EPT entries has "present" bit that allow you to allocate physical memory lazily, i.e. when guest accesses GVA (guest virtual address) that have mapped GPA (guest physical address) according to guest's page tables but don't have present HPA (host physical address).

If I understand correctly, this condition (present bit missing in the EPT entry for the HPA page) would trigger an EPT-induced vm exit (EPT VIOLATION or MISCONFIGURATION vm exit) handled by a specific kvm handling routine, right ?

Nable · Post by **Nable** » Fri Dec 23, 2016 5:48 am

> Which manual you are referring to ?
I'm talking about IASDM, of course. Although I don't remember exact volume number, I should look for it in the evening.

> Furthermore could you tell me which is the kvm source code module (.c) for it ?
http://lxr.free-electrons.com/source/arch/x86/kvm/vmx.c

> If I understand correctly, this condition (present bit missing in the EPT entry for the HPA page) would trigger an EPT-induced vm exit (EPT VIOLATION or MISCONFIGURATION vm exit) handled by a specific kvm handling routine, right ?
Yes, you are right. For KVM this exit code is defined as EXIT_REASON_EPT_VIOLATION and it is handled by "handle_ept_violation" function. Here it is: http://lxr.free-electrons.com/source/ar ... mx.c#L6111

Btw, I should note that Intel manuals are quite difficult to read. Not just because of the complex subject but because they has twisted style of explanation when you have to constantly jump from chapter to chapter in order to catch the whole picture (AMD manuals in whole and specifically SVM implementation are much easier to understand). KVM description is even worse, it's extremely twisted. Linux source code has a similar problem, although IDE could help you to navigate through these thousands of lines. While I was learning about hardware assisted virtualization, Bochs and Palacios (yes, Palacios is mostly dead and it's code style isn't very good but at least it's much easier to read) source code helped me very much.

cianfa72 · Post by **cianfa72** » Sat Dec 24, 2016 10:06 am

Nable wrote:What is the exact problem?

I was wondering about the following (you can find same topic asked for also on other forums without getting a clear answer...)...just to recap the scenario, I'm working on a virtual networking lab where each network node is implemented as a qemu/kvm based VM running on top of a Linux Ubuntu (bare-metal) system equipped with a huge amount of RAM (128 GB).

Memory is not an issue as follow:

Code: Select all

root@unl01:~# free -g
             total       used       free     shared    buffers     cached
Mem:           125         19        106          0          0          1
-/+ buffers/cache:         17        107
Swap:          127          0        127
root@unl01:~#

Nevertheless, for the linux process running the qemu VM instance, there exist an amount of minor page faults incrementing during the time (note that major page faults conversely are low and remain stable after VM boot is completed, though). See for instance the following output (a couple of second separated) for the process with PID 47716 running an instance of qemu-system-x86_64 executable:

Code: Select all

root@unl01:~# ps -p 47716 -o min_flt,maj_flt,cmd
 MINFL  MAJFL CMD
222142654  69 /opt/qemu/bin/qemu-system-x86_64 -device e1000,netdev=net0,mac=50:01:00:1a:00:00 -netdev tap,id=ne
root@unl01:~# 
root@unl01:~# 
root@unl01:~# ps -p 47716 -o min_flt,maj_flt,cmd
 MINFL  MAJFL CMD
222148030  69 /opt/qemu/bin/qemu-system-x86_64 -device e1000,netdev=net0,mac=50:01:00:1a:00:00 -netdev tap,id=ne
root@unl01:~#

As you can see, the number of major page faults is stable (69) whereas the number of minor page faults keeps incrementing.

Memory (RAM) is available (free) in a large amount, thus why basically kernel memory manager continuously try to shrink down the working set for the process running the qemu/kvm VM instance (resulting in minor page faults when guest code accesses memory pages again) ?

Googling for it I found http://kneuro.net/linux-mm/index.php?fi ... cache.html Reading it, as far as I can understand, Linux memory manager subsystem always try to move pages from active_list to inactive_list unmapping them from process virtual memory (basically clearing process' PTE entries for them). Upon process' attempt to access one of those, page fault handler can find it in memory and simply restore process' PTE entry pointing to it (basically this is a minor page fault event tracked by linux kernel)

What do you think about, it could be a valid reason for it ?

Nable · Post by **Nable** » Sat Dec 24, 2016 12:30 pm

1. Why do you have swap? And why do you have so much swap? This sounds like a bad idea at first. If you don't need hibernation - you most likely don't need swap at all.
2. How much memory do you allocate for VMs? I don't see "-m" switch in the command line. Note that kernel allocates memory lazily for all processes, including QEmu instances.
3. Did you enable KSM? This is an in-kernel service that is periodically looking for pages with same contents and merges them into the one page with CoW (copy-on-write) logic. KSM helps with VMs a lot but it would contribute to page faults counters of course. Although you don't lose much performance due to it.
4. Why did you choose VMs when containers (LXC) are more than enough for this task?

cianfa72 · Post by **cianfa72** » Mon Dec 26, 2016 12:56 pm

Nable wrote:1. Why do you have swap? And why do you have so much swap? This sounds like a bad idea at first. If you don't need hibernation - you most likely don't need swap at all.

I disabled swap (swapoff -a) anyway I never seen swap activities during the time VMs were running...

Code: Select all

root@unl01:~# free -k
             total       used       free     shared    buffers     cached
Mem:     131762936   25273540  106489396      10988     194212    1818944
-/+ buffers/cache:   23260384  108502552
Swap:            0          0          0
root@unl01:~#

Nable wrote:2. How much memory do you allocate for VMs? I don't see "-m" switch in the command line. Note that kernel allocates memory lazily for all processes, including QEmu instances.

each VMs has assigned 2GB RAM

Nable wrote:3. Did you enable KSM? This is an in-kernel service that is periodically looking for pages with same contents and merges them into the one page with CoW (copy-on-write) logic. KSM helps with VMs a lot but it would contribute to page faults counters of course. Although you don't lose much performance due to it.

KSM is disabled as follows:

Code: Select all

root@unl01:~# cat  /sys/kernel/mm/ksm/run
0
root@unl01:~#

Nable wrote:4. Why did you choose VMs when containers (LXC) are more than enough for this task?

My virtual networking lab is based on this project where each network node (router) is implemented via a dedicated VM

However, even with swap turned off, I continue to see qemu processes' minor page faults keep incrementing. Considering furthermore that lab VMs have been running for at least 1 week I'm not sure this behaviour has to be considered expected or not (basically I've not found a clear understanding of it)...

Nable · Post by **Nable** » Tue Dec 27, 2016 3:53 pm

If you don't see significant performance loss, why should you care about counters? And it makes sense to enable KSM, after all.
Btw, I've thought about another possible source of minor page faults: on each context switch kernel has to remove/replace mappings of current process so that next one won't be able to read/write anything from there. And then kernel has to enable mappings for the new process. Maybe it's also done in some lazy way. I don't know enough details about Linux virtual memory management to prove or decline this idea.

cianfa72 · Post by **cianfa72** » Wed Dec 28, 2016 11:55 am

Nable wrote:another possible source of minor page faults: on each context switch kernel has to remove/replace mappings of current process so that next one won't be able to read/write anything from there. And then kernel has to enable mappings for the new process.

Sure on x86, upon context switching, linux kernel has to switch CR3 register to point to the address space of the context being loaded. Neverthless, considering a system with a plenty of RAM, I see no reason to unmap previous context' memory pages (basically invalidating the associated page table entries) - incurring in minor page faults then when the process address space will be loaded again.

From the point of view of linux kernel - host memory pages mapped to guest physical memory pages via EPT- belong to qemu process (user) address space. Regarding EPT hierarchy (PML4T, PDPT, PDT and PT) is the memory for it actually allocated by kvm in the context of qemu process ?

Nable · Post by **Nable** » Wed Dec 28, 2016 3:50 pm

cianfa72 wrote:Sure on x86, upon context switching, linux kernel has to switch CR3 register to point to the address space of the context being loaded.

There are also different forms of INVTLB for this.

cianfa72 wrote:Neverthless, considering a system with a plenty of RAM, I see no reason to unmap previous context' memory pages (basically invalidating the associated page table entries) - incurring in minor page faults then when the process address space will be loaded again.

Processes shouldn't be able to access each other's data, so kernel have to change address space when it switches to different process. Btw, it's also a good thing to pin QEmu processes to specific CPU cores because scheduler likes to constantly move unpinned processes from core to core and (consequently) from one NUMA node to another.

cianfa72 wrote:From the point of view of linux kernel - host memory pages mapped to guest physical memory pages via EPT - belong to qemu process (user) address space. Regarding EPT hierarchy (PML4T, PDPT, PDT and PT) is the memory for it actually allocated by kvm in the context of qemu process?

As far as I remember, it's true - QEmu needs access to guest memory, so the whole(?) guest RAM is mapped into QEmu process.

Btw, I've remembered one more source of periodical EPT faults: accesses to memory-mapped virtual devices are intercepted using unmapped guest's "physical" pages. Switching to VirtIO paravirtual devices may make situation better (they are using buffers in memory areas shared between host and guest).

cianfa72 · Post by **cianfa72** » Fri Dec 30, 2016 7:25 am

Nable wrote:Processes shouldn't be able to access each other's data, so kernel have to change address space when it switches to different process.

of course, but this should not involve minor page faults when the process address space (i.e. process context) is loaded again on the CPU core, I guess...

Code: Select all

root@unl01:~# root@unl01:~# free -k
             total       used       free     shared    buffers     cached
Mem:     131762936   70593604   61169332      10996     177000    1868560
-/+ buffers/cache:   68548044   63214892
Swap:    133943292          0  133943292
root@unl01:~# vmstat -a
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 61169680 347324 68430712    0    0     0     0    4    7  3  1 96  0  0
root@unl01:~#

Here, as you can see, all 'used' memory is actually 'active' (AFAIU active memory basically means that it has been recently accessed and there exist valid page entries mapping it into process address space)

Btw, tracing kvm I've noticed that QEmu process minor page faults increment when kvm_page_fault event occours, e.g.

Code: Select all

 root@unl02:~# trace-cmd record -e kvm:kvm_exit -f 'exit_reason == 48' -e kvm:kvm_page_fault
<snip....>
<...>-42086 [002] 194509.194110: kvm_exit:             reason EPT_VIOLATION rip 0x8284597 info 181 0
<...>-42086 [002] 194509.194110: kvm_page_fault:       address a2896f28 error_code 181

Therefore - considering EPT is enabled- which are actually the occurrences kvm_page_fault handler will be executed ?

Thanks

Nable · Post by **Nable** » Fri Dec 30, 2016 4:52 pm

cianfa72 wrote:Therefore - considering EPT is enabled- which are actually the occurrences kvm_page_fault handler will be executed?

I've posted the link to KVM source code above, that link brings you to the place where this function is executed. You can study the code further if you want to find deeper details. Oh, and did you think about my suggestion of VirtIO devices instead of default emulated RTL8139/Intel ones?

OSDev.org

KVM page fault exceptions

KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions

Re: KVM page fault exceptions