Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

papst · Post by **papst** » Thu Sep 05, 2024 1:12 pm

I'm writing a UEFI Type-1 Hypervisor that's able to currently boot Windows 10/11 and Debian 12, however when it comes to SLAT/EPT stuff gets interesting.

I decided to go with 2MiB pages instead of 4KiB pages in the EPT Paging structure in order to save memory. However some pages like the first few kilobytes and megabytes of physical memory have different memory types, that I have to respect and split the large page into 512 4KiB pages in order to accurately type the memory.

This works like a charm on older CPUs. With older I mean CPUs before Alder Lake (12th Gen), that don't have E-cores in place. When doing the same on my 14700KF CPU, that has 12 E-cores I'm running into a very strange problem.

Code that executes on any E-core and translates to a 4KiB (2MiB pages split into 512 4KiB pages) in the EPT paging structure runs very slow. It is most noticeable when playing games.

The same code runs perfectly fast (300+ FPS in Counter-Strike 2) on an old CPU (8700K) however when trying it on the 14700KF performance drops very hard (90 FPS to a maximum of 200 FPS).

It works OK when not splitting the pages and keeping 2MiB entries, however this again degrades performance due to the memory not being accurately typed.

I started to investigate this issue and came to the conclusion that disabling all E-cores in the BIOS fixes this problem completely. Checking both the specification on the 8700K and the 14700KF it seems they both use the Intel Smart Cache Technology but are different in one point:

For E Cores The 2nd level cache is shared between 4 physical cores.

This now leads me to believing that the MMU encounters a lot of cache misses, having to re-translate the executing address due to the other 3 cores filling up the cache and the pages being split to 4KiB in the paging structure.

I could also be completely wrong about this as it's very hard to debug (no vm-exits whatsoever), which is why I'm making this post.

Any help, points in the right direction or explanation why this could happen would be very appreciated!

Octocontrabass · Post by **Octocontrabass** » Fri Sep 06, 2024 7:51 pm

papst wrote: ↑Thu Sep 05, 2024 1:12 pmHowever some pages like the first few kilobytes and megabytes of physical memory have different memory types, that I have to respect and split the large page into 512 4KiB pages in order to accurately type the memory.

How many pages are you splitting? More than one would be pretty unusual, unless the guest is doing something funny with the virtualized MTRRs.

papst wrote: ↑Thu Sep 05, 2024 1:12 pmmy 14700KF CPU

I hope you've updated your BIOS.

papst wrote: ↑Thu Sep 05, 2024 1:12 pmThis now leads me to believing that the MMU encounters a lot of cache misses,

Have you checked the performance counters to see which cache it might be?

papst · Post by **papst** » Fri Sep 06, 2024 11:15 pm

Octocontrabass wrote: ↑Fri Sep 06, 2024 7:51 pm How many pages are you splitting? More than one would be pretty unusual, unless the guest is doing something funny with the virtualized MTRRs.

The first page is being split at all times due to the BIOS setting up different memory types there. After some time of debugging I tracked the page splits down.

The guest is able to kind of "inspect" memory, meaning that access to pages specified by the guest can be tracked an analyzed (very useful for malware analysis). Only then the respective 2MiB page is being split into 512 4KiB pages (to provide fine-grained inspection), which then makes the execution very slow.

To verify this I disabled all the other handling of such pages (restricting access etc..) but just split one page in a process that executes the page very frequently, which triggered the performance drop on the 14700KF, but again not the 8700K.

Octocontrabass wrote: ↑Fri Sep 06, 2024 7:51 pm I hope you've updated your BIOS.

I updated the BIOS some weeks ago when this update dropped for Gigabyte boards. To be really sure I have updated it once again but couldn't notice any differences.

Octocontrabass wrote: ↑Fri Sep 06, 2024 7:51 pm Have you checked the performance counters to see which cache it might be?

Yes, I checked microsoft's perfmon, could see a lot more cache faults when the pages are being split but didn't go very much into detailed monitoring yet. Thank you for this link, I'll dig into it in the SDM!

feryno · Post by **feryno** » Sat Sep 07, 2024 8:06 am

Only just to be sure - could you please dump MSR IA32_VMX_EPT_VPID_CAP = 48CH on your 14th gen CPU and show us their content - for P as well E cores (whether they are the same or different, but I expect they are equal). Also for all VMX related MSRs (480H... 493H) whether there is something suspiciously different or whether MSRs are all identical among P as well E cores?
I'm using the same design as you:
4 level 4 KiB for machine memory below 2 MiB
3 level 2 MiB for machine memory 2 MiB ... 512 GiB
2 level 1 GiB above 512 GiB
Could you please dump and show us all entries of your 4th level 4 KiB EPT paging (and maybe first few entries of your 3rd level of EPT for 2 MiB paging) at the time of creating your EPT, then later from running OS before performance penalty and after observing performance drop?
Intel is very sensitive for correct caching attributes in EPT page tables.
Whether your UEFI vendor correctly synchronized MTRRs and control registers CR0, CR4 among all P as well E cores on your system and there is nothing different between P and E cores? At least compare BSP, one AP P core and one AP E core.
Almost all hypervisors use globally shared EPT where all cores/threads update the EPT entries concurrently, it may happen that few cores may attempt to change the same EPT entry at the same time but once they set accessed and dirty bits there is nothing to change anymore.
Your observation is very interesting and strange. If there is nothing wrong in your EPT then it could also be a bug in CPU.
You could also add a boot parameter so OS does not use some memory range and then check the performance again, but from my experience ms win OS almost does not use the low memory for common tasks (it uses mem below 1 MiB for initializing all AP CPUs during boot process and later from already running OS for ACPI resume from sleep).

papst · Post by **papst** » Sat Sep 07, 2024 9:49 pm

feryno wrote: ↑Sat Sep 07, 2024 8:06 am Only just to be sure - could you please dump MSR IA32_VMX_EPT_VPID_CAP = 48CH on your 14th gen CPU and show us their content - for P as well E cores (whether they are the same or different, but I expect they are equal).

Yes they are equal on each core, here is the dump for the first P and E Core:

Code: Select all

P-Core 0 IA32_VMX_EPT_VPID_CAP value: 0xf0106f34141
execute_only_pages: 1
page_walk_length_4: 1
memory_type_uncacheable: 1
memory_type_write_back: 1
pde_2mb_pages: 1
pdpte_1gb_pages: 1
invept: 1
ept_accessed_and_dirty_flags: 1
advanced_vmexit_ept_violations_information: 1
supervisor_shadow_stack: 1
invept_single_context: 1
invept_all_contexts: 1
invvpid: 1
invvpid_individual_address: 1
invvpid_single_context: 1
invvpid_all_contexts: 1
invvpid_single_context_retain_globals: 1
max_hlat_prefix_size: 0

E-Core 0 IA32_VMX_EPT_VPID_CAP value: 0xf0106f34141
execute_only_pages: 1
page_walk_length_4: 1
memory_type_uncacheable: 1
memory_type_write_back: 1
pde_2mb_pages: 1
pdpte_1gb_pages: 1
invept: 1
ept_accessed_and_dirty_flags: 1
advanced_vmexit_ept_violations_information: 1
supervisor_shadow_stack: 1
invept_single_context: 1
invept_all_contexts: 1
invvpid: 1
invvpid_individual_address: 1
invvpid_single_context: 1
invvpid_all_contexts: 1
invvpid_single_context_retain_globals: 1
max_hlat_prefix_size: 0

feryno wrote: ↑Sat Sep 07, 2024 8:06 am Also for all VMX related MSRs (480H... 493H) whether there is something suspiciously different or whether MSRs are all identical among P as well E cores?

There is also no difference on these MSRs, again a dump for the first P and E Core:

Code: Select all

P-Core 0 msr: 0x480 value: 0x3da050000000013
P-Core 0 msr: 0x481 value: 0xff00000016
P-Core 0 msr: 0x482 value: 0xfffbfffe0401e172
P-Core 0 msr: 0x483 value: 0xf77fffff00036dff
P-Core 0 msr: 0x484 value: 0x76ffff000011ff
P-Core 0 msr: 0x485 value: 0x7004c1e7
P-Core 0 msr: 0x486 value: 0x80000021
P-Core 0 msr: 0x487 value: 0xffffffff
P-Core 0 msr: 0x488 value: 0x2000
P-Core 0 msr: 0x489 value: 0x1ff2fff
P-Core 0 msr: 0x48A value: 0x2e
P-Core 0 msr: 0x48B value: 0x75d7fff00000000
P-Core 0 msr: 0x48C value: 0xf0106f34141
P-Core 0 msr: 0x48D value: 0xff00000016
P-Core 0 msr: 0x48E value: 0xfffbfffe04006172
P-Core 0 msr: 0x48F value: 0xf77fffff00036dfb
P-Core 0 msr: 0x490 value: 0x76ffff000011fb
P-Core 0 msr: 0x491 value: 0x1
P-Core 0 msr: 0x492 value: 0x1
P-Core 0 msr: 0x493 value: 0x8

E-Core 0 msr: 0x480 value: 0x3da050000000013
E-Core 0 msr: 0x481 value: 0xff00000016
E-Core 0 msr: 0x482 value: 0xfffbfffe0401e172
E-Core 0 msr: 0x483 value: 0xf77fffff00036dff
E-Core 0 msr: 0x484 value: 0x76ffff000011ff
E-Core 0 msr: 0x485 value: 0x7004c1e7
E-Core 0 msr: 0x486 value: 0x80000021
E-Core 0 msr: 0x487 value: 0xffffffff
E-Core 0 msr: 0x488 value: 0x2000
E-Core 0 msr: 0x489 value: 0x1ff2fff
E-Core 0 msr: 0x48A value: 0x2e
E-Core 0 msr: 0x48B value: 0x75d7fff00000000
E-Core 0 msr: 0x48C value: 0xf0106f34141
E-Core 0 msr: 0x48D value: 0xff00000016
E-Core 0 msr: 0x48E value: 0xfffbfffe04006172
E-Core 0 msr: 0x48F value: 0xf77fffff00036dfb
E-Core 0 msr: 0x490 value: 0x76ffff000011fb
E-Core 0 msr: 0x491 value: 0x1
E-Core 0 msr: 0x492 value: 0x1
E-Core 0 msr: 0x493 value: 0x8

feryno wrote: ↑Sat Sep 07, 2024 8:06 am Could you please dump and show us all entries of your 4th level 4 KiB EPT paging (and maybe first few entries of your 3rd level of EPT for 2 MiB paging) at the time of creating your EPT, then later from running OS before performance penalty and after observing performance drop?

I'm currently just setting the memory type of the first large page to UC, so I don't have any 4KiB page tables without allowing the guest to split any large pages, making some features unavailable but the performance does not drop.
When now exclusively splitting a large page (only one 4th level page table in the whole EPT structure, all others are large pages) of process memory that gets frequently accessed the performance drops. It seems like as soon as a 4th level is introduced and having E-cores enabled problems arise.

All of the executing memory is WB, with the only exception being MMIO regions and the first 2MiB, that I set to UC. So far working fine only when splitting to 4 KiB pages it starts to show performance drops.
I also checked the memory type on the 512 4KiB entries when the guest requests a page split and all of them are WB. This is how I'm splitting a large page:

Code: Select all

static bool ept_smash_large_page(ept_t* ept, ept_pde_2mb* pde_2mb)
{
	if (!pde_2mb->large_page)
	{
		return false;
	}

	// allocate 512 4KiB entries: 512 * 8 = 0x1000 bytes
	ept_pte* pt = (ept_pte*)mm_alloc(0x1000);

	if (!pt)
	{
		return false;
	}

	// get the pfn of the newly allocated page table
	uint64_t pt_pfn = mm_find_pa(pt) >> 12;

	if (!pt_pfn)
	{
		mm_free(pt);

		return false;
	}

	for (size_t i = 0; i < 512; i++)
	{
		// take over all the settings from the 2 MiB page for all the 4 KiB pages 
		ept_pte* pte = &pt[i];
		pte->flags = 0;
		pte->read_access = pde_2mb->read_access;
		pte->write_access = pde_2mb->write_access;
		pte->execute_access = pde_2mb->execute_access;
		pte->memory_type = pde_2mb->memory_type;
		pte->ignore_pat = pde_2mb->ignore_pat;
		pte->accessed = pde_2mb->accessed;
		pte->dirty = pde_2mb->dirty;
		pte->user_mode_execute = pde_2mb->user_mode_execute;
		pte->verify_guest_paging = pde_2mb->verify_guest_paging;
		pte->paging_write_access = pde_2mb->paging_write_access;
		pte->supervisor_shadow_stack = pde_2mb->supervisor_shadow_stack;
		pte->suppress_ve = pde_2mb->suppress_ve;

		// offset into the 2 MiB page
		pte->page_frame_number = (pde_2mb->page_frame_number << 9) + i;
	}

	// save the large page info, in order to recover when removing the 4th level
	ept_save_large_page(pde_2mb);

	// reset the pde and insert the 4th level
	ept_pde* pde = (ept_pde*)pde_2mb;
	pde->flags = 0;
	pde->read_access = 1;
	pde->write_access = 1;
	pde->execute_access = 1;
	pde->user_mode_execute = 1;
	pde->page_frame_number = pt_pfn;

	return true;
}

And this is how I'm setting up the initial EPT paging structure for each core (currently identity mapping for simplicity):

Code: Select all

static void ept_setup_identity_map(ept_t* ept)
{
	ept_pml4e* pml4e = &ept->pml4;
	pml4e->flags = 0;
	pml4e->read_access = 1;
	pml4e->write_access = 1;
	pml4e->execute_access = 1;
	pml4e->accessed = 0;
	pml4e->user_mode_execute = 1;
	pml4e->page_frame_number = mm_find_pa(&ept->pdpt) >> 12;

	for (size_t i = 0; i < 512; i++)
	{
		ept_pdpte* pdpte = &ept->pdpt[i];
		pdpte->flags = 0;
		pdpte->read_access = 1;
		pdpte->write_access = 1;
		pdpte->execute_access = 1;
		pdpte->accessed = 0;
		pdpte->user_mode_execute = 1;
		pdpte->page_frame_number = mm_find_pa(&ept->pd[i]) >> 12;

		for (size_t j = 0; j < 512; j++)
		{
			ept_pde_2mb* pde = &ept->pd[i][j];
			pde->flags = 0;
			pde->read_access = 1;
			pde->write_access = 1;
			pde->execute_access = 1;
			pde->ignore_pat = 0;
			pde->large_page = 1;
			pde->accessed = 0;
			pde->dirty = 0;
			pde->user_mode_execute = 1;
			pde->suppress_ve = 0;
			pde->page_frame_number = (i << 9) + j;
			pde->memory_type = mtrr_find_memory_type(pde->page_frame_number << 21, 0x1000ull << 9);
		}
	}
}

feryno wrote: ↑Sat Sep 07, 2024 8:06 am Almost all hypervisors use globally shared EPT where all cores/threads update the EPT entries concurrently, it may happen that few cores may attempt to change the same EPT entry at the same time but once they set accessed and dirty bits there is nothing to change anymore.

I have an EPT paging structure for each core to provide support for memory analysis. A user can for example monitor accesses to specified memory regions for each core.

feryno wrote: ↑Sat Sep 07, 2024 8:06 am Your observation is very interesting and strange. If there is nothing wrong in your EPT then it could also be a bug in CPU.

It's actually very hard to tell since the implementation works perfect without having E-cores enabled. The only thing different is that the L2 cache is being shared across 4 E-Cores.

Edit: 2 hours after writing this I fell back to what you said that hypervisors typically have one EPT paging structure and then gave this a shot, implemented and tested it and it actually works. No performance drops whatsoever with a global EPT paging structure. I guess having one for each core is just not really cache-friendly when having a shared cache across cores.

However some features won't work using this. I guess implementing a technique where the user can choose to use a global structure or a per-core structure fixes this too.

feryno · Post by **feryno** » Tue Sep 10, 2024 12:15 pm

Great that you solved it! Maybe current CPUs have enough cache only for shared paging and not enough for completely private tables for every cpu/core/thread.
Perhaps you could share most of EPT among all CPUs/cores/threads and you can add per cpu/core/thread private EPT entries only for the memory used by a process being watched - I implemented that on AMD CPUs for watching APIC as that was the only way working for older models prior introducing AVIC feature. Most of entries for physical memory virtualization (AMD = Nested Paging, Intel = EPT) are shared, the entry for APIC is private per CPU/core/thread (which of course requires also a private pointer to the base of the tables and one private entry in every level of paging, the rest of entries are identical among all cpus/cores/threads). Such structure is permanently constructed on AMD hypervisor startup and not modified later at all. That is very easy to implement, but doing that dynamically for every newly created processes would be a nightmare for me, and also an interaction with OS to catch which physical memory is every newly created process going to use.

papst · Post by **papst** » Fri Sep 13, 2024 9:47 pm

feryno wrote: ↑Tue Sep 10, 2024 12:15 pm Great that you solved it! Maybe current CPUs have enough cache only for shared paging and not enough for completely private tables for every cpu/core/thread.
Perhaps you could share most of EPT among all CPUs/cores/threads and you can add per cpu/core/thread private EPT entries only for the memory used by a process being watched - I implemented that on AMD CPUs for watching APIC as that was the only way working for older models prior introducing AVIC feature. Most of entries for physical memory virtualization (AMD = Nested Paging, Intel = EPT) are shared, the entry for APIC is private per CPU/core/thread (which of course requires also a private pointer to the base of the tables and one private entry in every level of paging, the rest of entries are identical among all cpus/cores/threads). Such structure is permanently constructed on AMD hypervisor startup and not modified later at all. That is very easy to implement, but doing that dynamically for every newly created processes would be a nightmare for me, and also an interaction with OS to catch which physical memory is every newly created process going to use.

I just now ended up with having a shared EPT structure and then on guest request just set them for each core when needed, with the performance drop on E-cores, but otherwise it works great!

And thank you very much for your help, wouldn't have fixed it without it!

OSDev.org

Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Re: Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Re: Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Re: Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Re: Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Re: Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB

Re: Intel Hypervisor SLAT: Performance degrade when pages are split in 4KiB