Is it time to ditch 4k pages?

elfenix · **Posted:** Sun Feb 12, 2023 12:55 pm

I was reading a blog entry on huge page performance: https://easyperf.net/blog/2022/09/01/Ut ... s-For-Code

And, I'm pondering if it'd make sense to ditch 4k pages in favor of allocating larger chunks. If you're going the route of 'f- the MMU', then it's a no-brainer to just identity map with the largest page size. It seems, the 'big cost' of larger pages comes down to more time spent initializing memory before it's mapped and then more fragmentation... That said, there's just _SO_ much overhead with 4k pages when you start thinking about managing a datastructure per page - roughly ~1% of memory in Linux is gobbled up by page management structure. That's before we get into the cost of the rest of the page table and friends. Most apps are also SIGNIFICANTLY more memory hungry today, and larger virtual address space allows means our malloc routines can better bin allocations for less fragmentation. Yay.

Is anyone experimenting with just saying 'forget the 4k page'? I'm sketching out ideas for 'my next kernel' and I think this is probably #1 on my list right now.

Demindiro · **Posted:** Sun Feb 12, 2023 1:45 pm

elfenix wrote:

- roughly ~1% of memory in Linux is gobbled up by page management structure

IMO 1% of total memory is negligible. You can save much more by optimizing user applications instead.

While 2MiB pages are obviously beneficial for some applications, it is also *much* larger than 4KiB. On my desktop there are currently 124 processes running and many only use a small multiple of 2MiB of memory. With 2MiB pages only it'd waste a lot of memory. From what I can tell most desktop systems have a lot more processes running than mine too.

I think transparent hugepages are a better approach. From what I've read FreeBSDs malloc() allocates in chunks of 2MiB and lazily maps pages in. If the entire 2MiB chunk is backed by pages it is replaced by a 2MiB page. I believe this is a better approach than disregarding 4k pages entirely. Alternatively, you could use a separate allocator that uses 4KiB or or 2MiB pages for processes that allocate a little or a lot of memory respectively.

It is also worth considering mmap()ing files. If you only support 2MiB pages every cached file will use up at least 2MiB worth of memory, which is very wasteful given a lot of files are well below that size.

---

Somewhat tangential, but on the topic of CPU design, smaller page sizes may be better because:

There is more granularity, which makes it easier to free "holes" in fragmented memory heaps. It would also allow more efficient caching of small files.
The jump from regular to hugepage would be smaller. Instead of going from 4KiB to 2MiB to 1GiB you could instead go from e.g. 1KiB to 128KiB to 16MiB to 2GiB.

Main disadvantage I can think of is that it may affect L1 cache sizes and/or performance negatively due to optimizations w.r.t. VITP. I don't know nearly enough about CPU design to make an informed opinion on it though

.

Linus goes in much more depth than I. (EDIT: I thought Linus also wrote a mail where he advocated for 1KiB pages. I'll see if I can find it).

---

Also, I skimmed the article and I wish the author would start the graphs from 0. Not starting from 0 is very misleading.
For example, int the first graph it looks like 2MB is 2x faster than the baseline, but in truth it is only 1965 / 1861 = ~1.056x faster (as the author mentions below the graph). Significant, but not nearly as much as the graph makes you believe at a glance.

iansjack · **Posted:** Sun Feb 12, 2023 2:37 pm

The larger the pages the more memory is wasted in slack space. It’s the same as with disks. On average, each process will waste half the size of your pages. This probably wastes more RAM than the overheads of 4K pages.

There are obviously occasions when bigger page sizes are appropriate (e.g. if physically mapping all memory for kernel access) but I would say the best policy is to use mixed sizes: 4K as the default, larger sizes where appropriate.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3195

I feel it is better if such bloated software is optimized instead of wasting a lot of physical memory to solve it.

Also, Linux waste a lot of memory because it needs to keep track of pages based on creating processes with fork.

Another factor is that long mode takes twice as long to handle TLB misses due to twice as many page levels. And long mode mostly exist because of bloated software.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

elfenix wrote:

If you're going the route of 'f- the MMU', then it's a no-brainer to just identity map with the largest page size.

On x86, pages aren't allowed to cross effective memory types, which means you'll have to either split those large pages into smaller pages whenever a memory type boundary isn't aligned to the largest page size or disable caching for any pages that cross a memory type boundary.

iansjack wrote:

The larger the pages the more memory is wasted in slack space. It’s the same as with disks. On average, each process will waste half the size of your pages. This probably wastes more RAM than the overheads of 4K pages.

On the other hand, the amount of RAM wasted using 2 MiB pages with 8 GiB of RAM will be proportionally the same as the amount of RAM wasted using 4 kiB pages with 16 MiB of RAM. By this measure, it might be an acceptable tradeoff.

Some non-x86 architectures support page sizes between 4 kiB and 2 MiB. I suspect the optimal page size for current software is somewhere in that range.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 403

elfenix wrote:

I was reading a blog entry on huge page performance: https://easyperf.net/blog/2022/09/01/Ut ... s-For-Code

And, I'm pondering if it'd make sense to ditch 4k pages in favor of allocating larger chunks. If you're going the route of 'f- the MMU', then it's a no-brainer to just identity map with the largest page size. It seems, the 'big cost' of larger pages comes down to more time spent initializing memory before it's mapped and then more fragmentation...

My personal opinion is 'f- the page table structure', and export the interface of the MMU to the rest of the kernel purely as an API.

For example, Mach/BSD pmap interface provides API based mapping, with functions to say "Map this VA to this PA".

On the other hand, Linux exposes the MMU as some abstract multi-level page table, whether the underlying hardware uses a hardware walked page table or not. Mapping a page then becomes an update to the page table structure, and inform the MMU of the change.

The big benefits of the former are:

- You're not tied to a particular data structure for mapping. x86 can map those API calls to its 2, 3, 4 or 5 level page table structure, yet hide all the details of how many levels are required to be mapped from the rest of the kernel.
- If the abstract page table structure doesn't map cleanly to your MMU hardware, then you have extra overhead doing that mapping yourself anyway in the platform specific code. For example, any platform with an inverted page table will have to copy entries from the abstract page table structure to the inverted page table on demand.
- Because the page table details are hidden from the rest of the kernel, including the VMM, page tables become completely transient and can be built on demand. You can have some fixed small number of page tables, that process can take it in turns to use. A sleeping process, for example, has zero need for in-memory page tables.

Another benefit of an API based MMU interface is that it can be easily extended, which would be of benefit here. Say, your API handles a single page mapping per call, you can extend this to add an address range to map. In the normal per page case, your range will be your 4K page size. But, in the case of something like a framebuffer, you can specify an address range that encompasses your entire framebuffer in a single call. Then, depending on alignment, the backend of the API can transparently map that to large pages with no intervention.

For example, say you have an 64MB framebuffer, at physical address PA, then a single call:

Code:

  pageno pfb = PA >> log2pagesize;
  void * vfb = va_allocate(64<<20);
  mmu_map(vfb, pfb, 64<<20, MAP_RW)

With x86 huge pages, the above could be mapped using 16 4MB mappings. Or, with PAE, 32 2MB mappings.

On ARM, the code can satisfy this with 4 * 16MB short descriptors.

MIPS can use 64KB TLB entries.

But all are completely abstracted away from the actual page size by the API call, so we can get the best of all worlds from big and small pages in a single API.

elfenix wrote:

That said, there's just _SO_ much overhead with 4k pages when you start thinking about managing a datastructure per page - roughly ~1% of memory in Linux is gobbled up by page management structure. That's before we get into the cost of the rest of the page table and friends. Most apps are also SIGNIFICANTLY more memory hungry today, and larger virtual address space allows means our malloc routines can better bin allocations for less fragmentation. Yay.

Is anyone experimenting with just saying 'forget the 4k page'? I'm sketching out ideas for 'my next kernel' and I think this is probably #1 on my list right now.

As I said above, once hidden and filled in on demand, page tables become transient and can be forgotten/reused on demand. The page tables then occupy space approximating your memory working set size for the subset of processes that are actually running at any one time.

It scales up by reserving a large number of page tables to be shared.

It scales down by forcing all processes to share a small number of page tables (even 1.)

AndrewAPrice · **Posted:** Fri Feb 17, 2023 9:10 am

Let's say using 2M pages results in 1% less system memory dedicated to paging structures than if I used 4KB pages.

If more than 0.02MB (1%) per page goes unused (in other words I'm storing less than 1.98MB data in the page), then more memory is wasted using 2MB pages, than using 4KB pages and 1% more memory on data structures.

Crono · **Joined:** Tue Jul 19, 2022 2:49 pm **Posts:** 12

elfenix wrote:

Is anyone experimenting with just saying 'forget the 4k page'? I'm sketching out ideas for 'my next kernel' and I think this is probably #1 on my list right now.

4kb pages are absolutely needed for sequrity by obscurity OR to do randomizations which DO help to catch errors during app writing if map/unmap approach is used. Yor PF handler can do 32kb at a time or 16kb if mem mngr isnt too heavy.

Then there is possible TLB issue. I really dont know about this but in the past using one 2mb page used to consume 512 TLB entries because large/small TLBs were shared.

I think if you use large pages then better populate them sequentially to use less phys RAM until 2mb fully filled in.

linguofreak · **Joined:** Wed Mar 09, 2011 3:55 am **Posts:** 509

Crono wrote:

Then there is possible TLB issue. I really dont know about this but in the past using one 2mb page used to consume 512 TLB entries because large/small TLBs were shared.

That sort of thing is likely to be a very microarchitecture-specific issue, but, at worst, it would have the same TLB impact as if you had the same region mapped with 4k pages. Any microarchitecture that doesn't do this would see less TLB usage for a given amount of memory mapped.

OSDev.org

Is it time to ditch 4k pages?

Who is online