feryno wrote: ↑Sat Sep 07, 2024 8:06 am
Only just to be sure - could you please dump MSR IA32_VMX_EPT_VPID_CAP = 48CH on your 14th gen CPU and show us their content - for P as well E cores (whether they are the same or different, but I expect they are equal).
Yes they are equal on each core, here is the dump for the first P and E Core:
Code: Select all
P-Core 0 IA32_VMX_EPT_VPID_CAP value: 0xf0106f34141
execute_only_pages: 1
page_walk_length_4: 1
memory_type_uncacheable: 1
memory_type_write_back: 1
pde_2mb_pages: 1
pdpte_1gb_pages: 1
invept: 1
ept_accessed_and_dirty_flags: 1
advanced_vmexit_ept_violations_information: 1
supervisor_shadow_stack: 1
invept_single_context: 1
invept_all_contexts: 1
invvpid: 1
invvpid_individual_address: 1
invvpid_single_context: 1
invvpid_all_contexts: 1
invvpid_single_context_retain_globals: 1
max_hlat_prefix_size: 0
E-Core 0 IA32_VMX_EPT_VPID_CAP value: 0xf0106f34141
execute_only_pages: 1
page_walk_length_4: 1
memory_type_uncacheable: 1
memory_type_write_back: 1
pde_2mb_pages: 1
pdpte_1gb_pages: 1
invept: 1
ept_accessed_and_dirty_flags: 1
advanced_vmexit_ept_violations_information: 1
supervisor_shadow_stack: 1
invept_single_context: 1
invept_all_contexts: 1
invvpid: 1
invvpid_individual_address: 1
invvpid_single_context: 1
invvpid_all_contexts: 1
invvpid_single_context_retain_globals: 1
max_hlat_prefix_size: 0
feryno wrote: ↑Sat Sep 07, 2024 8:06 am
Also for all VMX related MSRs (480H... 493H) whether there is something suspiciously different or whether MSRs are all identical among P as well E cores?
There is also no difference on these MSRs, again a dump for the first P and E Core:
Code: Select all
P-Core 0 msr: 0x480 value: 0x3da050000000013
P-Core 0 msr: 0x481 value: 0xff00000016
P-Core 0 msr: 0x482 value: 0xfffbfffe0401e172
P-Core 0 msr: 0x483 value: 0xf77fffff00036dff
P-Core 0 msr: 0x484 value: 0x76ffff000011ff
P-Core 0 msr: 0x485 value: 0x7004c1e7
P-Core 0 msr: 0x486 value: 0x80000021
P-Core 0 msr: 0x487 value: 0xffffffff
P-Core 0 msr: 0x488 value: 0x2000
P-Core 0 msr: 0x489 value: 0x1ff2fff
P-Core 0 msr: 0x48A value: 0x2e
P-Core 0 msr: 0x48B value: 0x75d7fff00000000
P-Core 0 msr: 0x48C value: 0xf0106f34141
P-Core 0 msr: 0x48D value: 0xff00000016
P-Core 0 msr: 0x48E value: 0xfffbfffe04006172
P-Core 0 msr: 0x48F value: 0xf77fffff00036dfb
P-Core 0 msr: 0x490 value: 0x76ffff000011fb
P-Core 0 msr: 0x491 value: 0x1
P-Core 0 msr: 0x492 value: 0x1
P-Core 0 msr: 0x493 value: 0x8
E-Core 0 msr: 0x480 value: 0x3da050000000013
E-Core 0 msr: 0x481 value: 0xff00000016
E-Core 0 msr: 0x482 value: 0xfffbfffe0401e172
E-Core 0 msr: 0x483 value: 0xf77fffff00036dff
E-Core 0 msr: 0x484 value: 0x76ffff000011ff
E-Core 0 msr: 0x485 value: 0x7004c1e7
E-Core 0 msr: 0x486 value: 0x80000021
E-Core 0 msr: 0x487 value: 0xffffffff
E-Core 0 msr: 0x488 value: 0x2000
E-Core 0 msr: 0x489 value: 0x1ff2fff
E-Core 0 msr: 0x48A value: 0x2e
E-Core 0 msr: 0x48B value: 0x75d7fff00000000
E-Core 0 msr: 0x48C value: 0xf0106f34141
E-Core 0 msr: 0x48D value: 0xff00000016
E-Core 0 msr: 0x48E value: 0xfffbfffe04006172
E-Core 0 msr: 0x48F value: 0xf77fffff00036dfb
E-Core 0 msr: 0x490 value: 0x76ffff000011fb
E-Core 0 msr: 0x491 value: 0x1
E-Core 0 msr: 0x492 value: 0x1
E-Core 0 msr: 0x493 value: 0x8
feryno wrote: ↑Sat Sep 07, 2024 8:06 am
Could you please dump and show us all entries of your 4th level 4 KiB EPT paging (and maybe first few entries of your 3rd level of EPT for 2 MiB paging) at the time of creating your EPT, then later from running OS before performance penalty and after observing performance drop?
I'm currently just setting the memory type of the first large page to UC, so I don't have any 4KiB page tables without allowing the guest to split any large pages, making some features unavailable but the performance does not drop.
When now exclusively splitting a large page (only one 4th level page table in the whole EPT structure, all others are large pages) of process memory that gets frequently accessed the performance drops. It seems like as soon as a 4th level is introduced and having E-cores enabled problems arise.
All of the executing memory is WB, with the only exception being MMIO regions and the first 2MiB, that I set to UC. So far working fine only when splitting to 4 KiB pages it starts to show performance drops.
I also checked the memory type on the 512 4KiB entries when the guest requests a page split and all of them are WB. This is how I'm splitting a large page:
Code: Select all
static bool ept_smash_large_page(ept_t* ept, ept_pde_2mb* pde_2mb)
{
if (!pde_2mb->large_page)
{
return false;
}
// allocate 512 4KiB entries: 512 * 8 = 0x1000 bytes
ept_pte* pt = (ept_pte*)mm_alloc(0x1000);
if (!pt)
{
return false;
}
// get the pfn of the newly allocated page table
uint64_t pt_pfn = mm_find_pa(pt) >> 12;
if (!pt_pfn)
{
mm_free(pt);
return false;
}
for (size_t i = 0; i < 512; i++)
{
// take over all the settings from the 2 MiB page for all the 4 KiB pages
ept_pte* pte = &pt[i];
pte->flags = 0;
pte->read_access = pde_2mb->read_access;
pte->write_access = pde_2mb->write_access;
pte->execute_access = pde_2mb->execute_access;
pte->memory_type = pde_2mb->memory_type;
pte->ignore_pat = pde_2mb->ignore_pat;
pte->accessed = pde_2mb->accessed;
pte->dirty = pde_2mb->dirty;
pte->user_mode_execute = pde_2mb->user_mode_execute;
pte->verify_guest_paging = pde_2mb->verify_guest_paging;
pte->paging_write_access = pde_2mb->paging_write_access;
pte->supervisor_shadow_stack = pde_2mb->supervisor_shadow_stack;
pte->suppress_ve = pde_2mb->suppress_ve;
// offset into the 2 MiB page
pte->page_frame_number = (pde_2mb->page_frame_number << 9) + i;
}
// save the large page info, in order to recover when removing the 4th level
ept_save_large_page(pde_2mb);
// reset the pde and insert the 4th level
ept_pde* pde = (ept_pde*)pde_2mb;
pde->flags = 0;
pde->read_access = 1;
pde->write_access = 1;
pde->execute_access = 1;
pde->user_mode_execute = 1;
pde->page_frame_number = pt_pfn;
return true;
}
And this is how I'm setting up the initial EPT paging structure for each core (currently identity mapping for simplicity):
Code: Select all
static void ept_setup_identity_map(ept_t* ept)
{
ept_pml4e* pml4e = &ept->pml4;
pml4e->flags = 0;
pml4e->read_access = 1;
pml4e->write_access = 1;
pml4e->execute_access = 1;
pml4e->accessed = 0;
pml4e->user_mode_execute = 1;
pml4e->page_frame_number = mm_find_pa(&ept->pdpt) >> 12;
for (size_t i = 0; i < 512; i++)
{
ept_pdpte* pdpte = &ept->pdpt[i];
pdpte->flags = 0;
pdpte->read_access = 1;
pdpte->write_access = 1;
pdpte->execute_access = 1;
pdpte->accessed = 0;
pdpte->user_mode_execute = 1;
pdpte->page_frame_number = mm_find_pa(&ept->pd[i]) >> 12;
for (size_t j = 0; j < 512; j++)
{
ept_pde_2mb* pde = &ept->pd[i][j];
pde->flags = 0;
pde->read_access = 1;
pde->write_access = 1;
pde->execute_access = 1;
pde->ignore_pat = 0;
pde->large_page = 1;
pde->accessed = 0;
pde->dirty = 0;
pde->user_mode_execute = 1;
pde->suppress_ve = 0;
pde->page_frame_number = (i << 9) + j;
pde->memory_type = mtrr_find_memory_type(pde->page_frame_number << 21, 0x1000ull << 9);
}
}
}
feryno wrote: ↑Sat Sep 07, 2024 8:06 am
Almost all hypervisors use globally shared EPT where all cores/threads update the EPT entries concurrently, it may happen that few cores may attempt to change the same EPT entry at the same time but once they set accessed and dirty bits there is nothing to change anymore.
I have an EPT paging structure for each core to provide support for memory analysis. A user can for example monitor accesses to specified memory regions for each core.
feryno wrote: ↑Sat Sep 07, 2024 8:06 am
Your observation is very interesting and strange. If there is nothing wrong in your EPT then it could also be a bug in CPU.
It's actually very hard to tell since the implementation works perfect without having E-cores enabled. The only thing different is that the L2 cache is being shared across 4 E-Cores.
Edit: 2 hours after writing this I fell back to what you said that hypervisors typically have one EPT paging structure and then gave this a shot, implemented and tested it and it actually works. No performance drops whatsoever with a global EPT paging structure. I guess having one for each core is just not really cache-friendly when having a shared cache across cores.
However some features won't work using this. I guess implementing a technique where the user can choose to use a global structure or a per-core structure fixes this too.