Paging performances when mapping 4MiB pages vs 4 KiB

vhaudiquet · Post by **vhaudiquet** » Wed Nov 10, 2021 3:06 pm

Hi !
I'm trying to 'merge' my i386 kernel and my ARM rapsberry pi kernel, and it's been really fun so far to support multiple architectures/platforms.
On both archs, i have paging enabled and kernel mapped in higher half.

I was wondering : on i386, you're allowed to have 4MiB pages directly in page directory (if PSE activated), and there is kind of the same 'feature' on ARM : you're allowed to have "sections" of 1MiB directly into your table.

From a performance point of view, is it worth to use only those instead of pages tables ?
Mapping is faster and resolving addresses should be too because there's one (or more) less level of indirection.
(unless caching makes all of that useless ? i don't really know, i think caching 1 big entry should still be faster, and on context switch we invalidate anyway...)

I can see why for processes you might want to alloc less memory (a daemon process or tiny process using 4 MiB RAM is big, because there might be an enormous amount of those)
but at least for the kernel : is it worth to map it using those "big" sections ? Memory is cheap those days and wasting 3 MiB seems worth it to me...

If someone has a kernel with multiple ports, i would love to hear from you if you've done memory allocation performance benchmark.

Anyway, i'm waiting on your comments on this matter (as my kernel is not too "evolved", mapping with "big" sections is completely ok for now, but maybe with multiple processes i'll get in trouble wasting too much RAM or anything...)

Thanks for reading !

zaval · Post by **zaval** » Wed Nov 10, 2021 4:14 pm

the answer is rather obvious. there are no reasons to not use this feature, because it is 100% an optimization, it decreases TLB pressure, shortens page walk duration. of course, you'll lose page attributes differentiation if you map all the kernel image and drivers in big blocks with the same attributes. no "readonly", no "not execute", everything will be writable and executable. but it's a trade off. and frankly, is this differentiation too helpful in kernel mode? when it comes to big kernel structures, that by nature have the same attributes, then it's just a sin to not use this feature. for example PFN array, SYSTEM hive or non paged pool - are all 100% candidates to get allocated exactly this way. for user mode, on the other hand, you wouldn't want to waste memory like this, it's not that "cheap", we all have to fight the bloat, corroding nowadays software,

and, of course, the sin would be to neglect the mentioned attributes here, because exactly for user mode, they exist and stop many bad doers.

thewrongchristian · Post by **thewrongchristian** » Wed Nov 10, 2021 4:20 pm

valou3433 wrote:Hi !
I'm trying to 'merge' my i386 kernel and my ARM rapsberry pi kernel, and it's been really fun so far to support multiple architectures/platforms.
On both archs, i have paging enabled and kernel mapped in higher half.

I was wondering : on i386, you're allowed to have 4MiB pages directly in page directory (if PSE activated), and there is kind of the same 'feature' on ARM : you're allowed to have "sections" of 1MiB directly into your table.

From a performance point of view, is it worth to use only those instead of pages tables ?
Mapping is faster and resolving addresses should be too because there's one (or more) less level of indirection.
(unless caching makes all of that useless ? i don't really know, i think caching 1 big entry should still be faster, and on context switch we invalidate anyway...)

I can see why for processes you might want to alloc less memory (a daemon process or tiny process using 4 MiB RAM is big, because there might be an enormous amount of those)
but at least for the kernel : is it worth to map it using those "big" sections ? Memory is cheap those days and wasting 3 MiB seems worth it to me...

If someone has a kernel with multiple ports, i would love to hear from you if you've done memory allocation performance benchmark.

Anyway, i'm waiting on your comments on this matter (as my kernel is not too "evolved", mapping with "big" sections is completely ok for now, but maybe with multiple processes i'll get in trouble wasting too much RAM or anything...)

Thanks for reading !

It will be very much worth it, especially for static mappings such as the kernel text and data. The other places that benefit might be MMIO, such as PCI BAR or framebuffer mapping.

I would also say that it is probably not worth it for user paging, but I certainly haven't benchmarked it. The thinking is while memory is cheap, you might be wasting multiple MiB per process (each mapping that isn't a multiple of 4MB will waste on average 2MB.) A bash shell might have four mapping in different mods, such as the following from a shell on my machine:

Code: Select all

$ cat /proc/$$/maps
559da3882000-559da38b1000 r--p 00000000 00:1b 378746                     /usr/bin/bash
559da38b1000-559da3990000 r-xp 0002f000 00:1b 378746                     /usr/bin/bash
559da3990000-559da39ca000 r--p 0010e000 00:1b 378746                     /usr/bin/bash
559da39ca000-559da39ce000 r--p 00147000 00:1b 378746                     /usr/bin/bash
559da39ce000-559da39d7000 rw-p 0014b000 00:1b 378746                     /usr/bin/bash
559da39d7000-559da39e2000 rw-p 00000000 00:00 0 
559da4118000-559da42c0000 rw-p 00000000 00:00 0                          [heap]
7f0fb2627000-7f0fb2b98000 r--p 00000000 00:1b 371248                     /usr/lib/locale/locale-archive
7f0fb2b98000-7f0fb2b9b000 rw-p 00000000 00:00 0 
7f0fb2b9b000-7f0fb2bc7000 r--p 00000000 00:1b 367052                     /usr/lib/x86_64-linux-gnu/libc.so.6
7f0fb2bc7000-7f0fb2d5b000 r-xp 0002c000 00:1b 367052                     /usr/lib/x86_64-linux-gnu/libc.so.6
7f0fb2d5b000-7f0fb2daf000 r--p 001c0000 00:1b 367052                     /usr/lib/x86_64-linux-gnu/libc.so.6
7f0fb2daf000-7f0fb2db0000 ---p 00214000 00:1b 367052                     /usr/lib/x86_64-linux-gnu/libc.so.6
7f0fb2db0000-7f0fb2db3000 r--p 00214000 00:1b 367052                     /usr/lib/x86_64-linux-gnu/libc.so.6
7f0fb2db3000-7f0fb2db6000 rw-p 00217000 00:1b 367052                     /usr/lib/x86_64-linux-gnu/libc.so.6
7f0fb2db6000-7f0fb2dc3000 rw-p 00000000 00:00 0 
7f0fb2dc3000-7f0fb2dd1000 r--p 00000000 00:1b 59814                      /usr/lib/x86_64-linux-gnu/libtinfo.so.6.2
7f0fb2dd1000-7f0fb2ddf000 r-xp 0000e000 00:1b 59814                      /usr/lib/x86_64-linux-gnu/libtinfo.so.6.2
7f0fb2ddf000-7f0fb2ded000 r--p 0001c000 00:1b 59814                      /usr/lib/x86_64-linux-gnu/libtinfo.so.6.2
7f0fb2ded000-7f0fb2df1000 r--p 00029000 00:1b 59814                      /usr/lib/x86_64-linux-gnu/libtinfo.so.6.2
7f0fb2df1000-7f0fb2df2000 rw-p 0002d000 00:1b 59814                      /usr/lib/x86_64-linux-gnu/libtinfo.so.6.2
7f0fb2df2000-7f0fb2df4000 rw-p 00000000 00:00 0 
7f0fb2e03000-7f0fb2e0a000 r--s 00000000 00:1b 373594                     /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7f0fb2e0a000-7f0fb2e0b000 r--p 00000000 00:1b 367038                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f0fb2e0b000-7f0fb2e33000 r-xp 00001000 00:1b 367038                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f0fb2e33000-7f0fb2e3d000 r--p 00029000 00:1b 367038                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f0fb2e3d000-7f0fb2e3f000 r--p 00032000 00:1b 367038                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f0fb2e3f000-7f0fb2e41000 rw-p 00034000 00:1b 367038                     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffd0df57000-7ffd0df78000 rw-p 00000000 00:00 0                          [stack]
7ffd0df78000-7ffd0df7c000 r--p 00000000 00:00 0                          [vvar]
7ffd0df7c000-7ffd0df7e000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

Each writeable mapping will potentially require a 4MB copy. 4MB vs 4KB granularity might well save you some TLB cache entries, but result in more memory pressure causing pages to spill to secondary storage. So while the CPU can lookup the page mapping much quicker with such big pages, the increased I/O might obviate all that advantage and then some.

Octocontrabass · Post by **Octocontrabass** » Wed Nov 10, 2021 5:49 pm

valou3433 wrote:I was wondering : on i386, you're allowed to have 4MiB pages directly in page directory (if PSE activated),

Or 2MiB pages if PAE is active.

valou3433 wrote:From a performance point of view, is it worth to use only those instead of pages tables ?

Recent tests with Linux have concluded that the ideal page size depends on your workload. I'm not sure if these tests really cover what you're asking about, though.

valou3433 wrote:on context switch we invalidate anyway...

You can use global pages to skip invalidating the kernel's TLB entries. In long mode, there's also PCID.

stlw · Post by **stlw** » Sat Nov 20, 2021 2:20 am

valou3433 wrote:Hi !
...
From a performance point of view, is it worth to use only those instead of pages tables ?
Mapping is faster and resolving addresses should be too because there's one (or more) less level of indirection.
(unless caching makes all of that useless ? i don't really know, i think caching 1 big entry should still be faster, and on context switch we invalidate anyway...)
...
Thanks for reading !

From TLB caching perspective we consider 4MB paging as deep legacy which doesn't deserve any hardware, the mode is kept only for compatibility reasons.
TLB caching capacity is poor vs 2MB paging, all the optimizations are done for 2MB paging mode only.
In best of the day your 4MB page will be internally autofractured to two 2MB pages.

vhaudiquet · Post by **vhaudiquet** » Sat Nov 20, 2021 4:04 am

stlw wrote: From TLB caching perspective we consider 4MB paging as deep legacy which doesn't deserve any hardware, the mode is kept only for compatibility reasons.
TLB caching capacity is poor vs 2MB paging, all the optimizations are done for 2MB paging mode only.
In best of the day your 4MB page will be internally autofractured to two 2MB pages.

Really ? Are there sources to that information ?
So you're saying that i should first detect if the processor is capable of PAE, and if so i do 2 MiB mappings, else i do 4 MiB ones (for the kernel) ?
But then using 4 MiB mappings is just a potential waste of 2MiB that works all the time...
I don't think that mapping 1 4MiB page would be less efficient than mapping 2 2MiB ones... as you said, they will be "autofractured to two pages", so using 4 MiB mappings is good anyway

stlw · Post by **stlw** » Sun Nov 21, 2021 2:58 am

valou3433 wrote:Really ? Are there sources to that information ?

I said "we"

https://www.linkedin.com/in/stanislav-s ... bdomain=il

So you're saying that i should first detect if the processor is capable of PAE, and if so i do 2 MiB mappings, else i do 4 MiB ones (for the kernel) ?
But then using 4 MiB mappings is just a potential waste of 2MiB that works all the time...
I don't think that mapping 1 4MiB page would be less efficient than mapping 2 2MiB ones... as you said, they will be "autofractured to two pages", so using 4 MiB mappings is good anyway

[/quote]

Yes, it is better to use PAE always. Putting very legacy hardware aside PAE always be available because it is industry standard now and also required baseline for 64-bit mode.
PSE may disappear in any future core. Or it may lose its caching in 2nd level TLB or any other behavior you won't expect.
Just because it is not treated as primary citizen.

vhaudiquet · Post by **vhaudiquet** » Sun Nov 21, 2021 8:33 am

Oh wow thanks, i did not know that !

I'll check for pae first then, and use pse only if pae not supported

thank you everyone for your answers

OSDev.org

Paging performances when mapping 4MiB pages vs 4 KiB

Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB

Re: Paging performances when mapping 4MiB pages vs 4 KiB