elfenix wrote:
I was reading a blog entry on huge page performance:
https://easyperf.net/blog/2022/09/01/Ut ... s-For-CodeAnd, I'm pondering if it'd make sense to ditch 4k pages in favor of allocating larger chunks. If you're going the route of 'f- the MMU', then it's a no-brainer to just identity map with the largest page size. It seems, the 'big cost' of larger pages comes down to more time spent initializing memory before it's mapped and then more fragmentation...
My personal opinion is 'f- the page table structure', and export the interface of the MMU to the rest of the kernel purely as an API.
For example, Mach/BSD pmap interface provides API based mapping, with functions to say "Map this VA to this PA".
On the other hand, Linux exposes the MMU as some abstract multi-level page table, whether the underlying hardware uses a hardware walked page table or not. Mapping a page then becomes an update to the page table structure, and inform the MMU of the change.
The big benefits of the former are:
- You're not tied to a particular data structure for mapping. x86 can map those API calls to its 2, 3, 4 or 5 level page table structure, yet hide all the details of how many levels are required to be mapped from the rest of the kernel.
- If the abstract page table structure doesn't map cleanly to your MMU hardware, then you have extra overhead doing that mapping yourself anyway in the platform specific code. For example, any platform with an inverted page table will have to copy entries from the abstract page table structure to the inverted page table on demand.
- Because the page table details are hidden from the rest of the kernel, including the VMM, page tables become completely transient and can be built on demand. You can have some fixed small number of page tables, that process can take it in turns to use. A sleeping process, for example, has zero need for in-memory page tables.
Another benefit of an API based MMU interface is that it can be easily extended, which would be of benefit here. Say, your API handles a single page mapping per call, you can extend this to add an address range to map. In the normal per page case, your range will be your 4K page size. But, in the case of something like a framebuffer, you can specify an address range that encompasses your entire framebuffer in a single call. Then, depending on alignment, the backend of the API can transparently map that to large pages with no intervention.
For example, say you have an 64MB framebuffer, at physical address PA, then a single call:
Code:
pageno pfb = PA >> log2pagesize;
void * vfb = va_allocate(64<<20);
mmu_map(vfb, pfb, 64<<20, MAP_RW)
With x86 huge pages, the above could be mapped using 16 4MB mappings. Or, with PAE, 32 2MB mappings.
On ARM, the code can satisfy this with 4 * 16MB short descriptors.
MIPS can use 64KB TLB entries.
But all are completely abstracted away from the actual page size by the API call, so we can get the best of all worlds from big and small pages in a single API.
elfenix wrote:
That said, there's just _SO_ much overhead with 4k pages when you start thinking about managing a datastructure per page - roughly ~1% of memory in Linux is gobbled up by page management structure. That's before we get into the cost of the rest of the page table and friends. Most apps are also SIGNIFICANTLY more memory hungry today, and larger virtual address space allows means our malloc routines can better bin allocations for less fragmentation. Yay.
Is anyone experimenting with just saying 'forget the 4k page'? I'm sketching out ideas for 'my next kernel' and I think this is probably #1 on my list right now.
As I said above, once hidden and filled in on demand, page tables become transient and can be forgotten/reused on demand. The page tables then occupy space approximating your memory working set size for the subset of processes that are actually running at any one time.
It scales up by reserving a large number of page tables to be shared.
It scales down by forcing all processes to share a small number of page tables (even 1.)