Page 1 of 1

1GB Pages

Posted: Fri Jan 19, 2018 4:59 am
by z0rr0
Hello everyone, I am starting to think to use 1GB pages in my kernel in order to get some performance improvement. However, I am not sure how that works in case of devices that are mapped on the memory. More precisely, with 1GB pages I have less granularity to define memory regions that should not be cached. Am I right? Or, there is another way to tell the mmu that some region must not be in cache.

Regards, Matias.

Re: 1GB Pages

Posted: Fri Jan 19, 2018 7:54 am
by stlw
z0rr0 wrote:Hello everyone, I am starting to think to use 1GB pages in my kernel in order to get some performance improvement. However, I am not sure how that works in case of devices that are mapped on the memory. More precisely, with 1GB pages I have less granularity to define memory regions that should not be cached. Am I right? Or, there is another way to tell the mmu that some region must not be in cache.

Regards, Matias.
First of all, not all the memory must be mapped through 1GB pages. You can keep few 2M and even 4K for better granularity.
You also could override PAT memory type using MTRR.

Re: 1GB Pages

Posted: Fri Jan 19, 2018 8:16 am
by z0rr0
Hello,
stlw wrote: First of all, not all the memory must be mapped through 1GB pages. You can keep few 2M and even 4K for better granularity.
I should say first that in my case I have only one page directory and page size is fixed during all system execution. Can I have different page sizes in the same page directory?
stlw wrote:You also could override PAT memory type using MTRR.
Thanks I have to check that.

Re: 1GB Pages

Posted: Fri Jan 19, 2018 5:02 pm
by Brendan
Hi,
z0rr0 wrote:
stlw wrote:First of all, not all the memory must be mapped through 1GB pages. You can keep few 2M and even 4K for better granularity.
I should say first that in my case I have only one page directory and page size is fixed during all system execution. Can I have different page sizes in the same page directory?
For 80x86, you can freely mix 4 KiB, 2 MiB and 1 GiB pages within the same virtual address space.

Note that CPUs have a limited number of TLB entries for various sizes. Usually there's thousands of TLB entries for 4 KiB pages and tens of TLB entries for larger sizes. This means that if you use a small number of 1 GiB pages it helps performance (less TLB misses), and if you over-use 1 GiB pages you get worse performance (lots of TLB misses because most of the TLB entries can't be used).

Don't forget that (with the recent "meltdown" vulnerability) you're going to want to be using PCID where possible, which means that the TLB has to contain entries from multiple different virtual address spaces, making "number of TLBs that OS can use" even more important.

Also, "cacheability" is stored in TLB entries too (so that the CPU doesn't have to consult MTRRs when there's a TLB hit). If you use a 1 GiB page for an area that contains a mixture of "cacheability" types (e.g. according to MTRRs, some is "write-back" RAM and some is an "uncached" memory mapped device) then the CPU may end up using many 4 KiB TLB entries instead of using a 1 GiB TLB entry so that it can store the "cacheability" in the TLB entries properly. CPU designers have special hacks to make this work for the first 2 MiB of the physical address space (with 2 MiB pages) where it's known that there's often an ugly mixture (some RAM, some memory mapped devices, some ROM) and this has probably been tested to make sure it works; but for all other areas there's a relatively high risk of subtle and not so subtle bugs in the CPU where using 1 GiB pages for "mixed cacheability" areas can lead to problems (e.g. INVLPG only invalidating 4 KiB of the 1 GiB area and causing an OS to crash in unpredictable ways).

Finally; usually an executable has several sections with different attributes (".text", ".rodata", ".data", ".bss", ...) and the permission bits (read/write, no-execute) in page tables are configured to reflect the section's attributes to improve security and/or detect bugs; and usually an executable also uses shared libraries. This means that if you only use 1 GiB pages and nothing else, a tiny little "hello world" executable will consume about 8 GiB of RAM (1 huge 1 GiB page for each section in the executable and each section in the libraries it uses); and (based on "half a page wasted per section") on average (for "1 GiB pages only") every process will waste 4 GiB of RAM.

Mostly, 1 GiB pages (and 2 MiB pages) should be used sparingly (e.g. only if a process is using the whole 1 GiB with the same permissions anyway); and failing to do that will hurt performance and/or waste a lot of RAM (which will also hurt performance - less RAM for file data caches, etc). If you only support a single page size, then it'd be much better to only use 4 KiB pages (although in this case it's usually not that hard to use larger page sizes where appropriate for memory mapped devices, because the physical memory manager doesn't have to deal with larger page sizes in this case).


Cheers,

Brendan

Re: 1GB Pages

Posted: Sat Jan 20, 2018 3:23 am
by Korona
Let me add something to what Brendan said: There are likely many areas where you can get a much larger performance boost for your OS than by supporting huge pages. In my experiments (on Linux using MAP_HUGETLB for mmap() or hugetlbfs), even CPU-intensive applications that max out memory bandwidth improve less than 5% from the reduced number of TLB misses, while it is quite easy to get a negative speedup if you do not design the application code properly and with huge pages in mind. For the 5% speedup the application was already allocating almost all of its memory from an arena allocator on 1GiB aligned chunks that performed a copying garbage collection to reclaim memory. Speedups are going to be much smaller for non-optimized workloads.

Re: 1GB Pages

Posted: Sat Jan 20, 2018 8:50 am
by Schol-R-LEA
Perhaps my view of this is rather different from most, because of the unusually tight integration in mean to have between the operating system's memory management and the userland software (made possible through the use of JIT for everything), but I would be hesitant to use even 2MB pages unless I knew that there were individual data structures which would fill most of the page on their own. While that is growing more frequent over time (especially with audio and video), it is a pretty uncommon use case.

However, as I say, my model isn't typical; I mean to have a pretty complex memory management system, which for both ARM and x86 long mode would use pages as the quanta of the garbage collection arenas. Most of the pages would be group tagged and populated with elements of the same type (for example, a 4KiB page may hold a packed set of 500 64-bit system integers plus a 48 byte header) and when swapping out a dirty page the system would run a GC cycle on it.

Different page uses might use different methods of memory management, and the system would have a selection of options to apply on a case-by-case basis. For example, the header of the previously mentioned page of integers might have a reference count for the page as a whole, and a bitmap of markers of the values currently in use; this means that the individual integers would not need to be GCed, per se, and when all of the values are freed the page itself can be discarded rather than swapped.

Whereas a page holding, for example, a set of small strings, might perform a compaction before the page is evicted, so that when it is swapped in, it has already been garbage collected.

Or yet again, for a function which makes no calls which could lead to a recursion and does nothing that might throw an exception, it's entire activation record might simply be handled on the system stack in the same way as those in more conventional languages, and no additional management would be needed.

It also let's the system amortize the cost of automatic memory management over time, rather than doing long, stop-the-world GC operations.

But all of that is predicated on the system knowing, or at least being about to determine, the sizes and types of the elements in question, and the expected behavior of the functions operating on them. This isn't going to be an option for a more conventional OS design.

Re: 1GB Pages

Posted: Sat Jan 20, 2018 9:12 am
by z0rr0
Hello and thanks for the answers,
Brendan wrote:Hi,

For 80x86, you can freely mix 4 KiB, 2 MiB and 1 GiB pages within the same virtual address space.
From intel manual, it was not clear for me that I could mix different pages size in the same page directory, thanks for the clarification. That was my main question because I did not know how mix physical memory and devices mapped on memory.
Brendan wrote: Note that CPUs have a limited number of TLB entries for various sizes. Usually there's thousands of TLB entries for 4 KiB pages and tens of TLB entries for larger sizes. This means that if you use a small number of 1 GiB pages it helps performance (less TLB misses), and if you over-use 1 GiB pages you get worse performance (lots of TLB misses because most of the TLB entries can't be used).

Don't forget that (with the recent "meltdown" vulnerability) you're going to want to be using PCID where possible, which means that the TLB has to contain entries from multiple different virtual address spaces, making "number of TLBs that OS can use" even more important.

Also, "cacheability" is stored in TLB entries too (so that the CPU doesn't have to consult MTRRs when there's a TLB hit). If you use a 1 GiB page for an area that contains a mixture of "cacheability" types (e.g. according to MTRRs, some is "write-back" RAM and some is an "uncached" memory mapped device) then the CPU may end up using many 4 KiB TLB entries instead of using a 1 GiB TLB entry so that it can store the "cacheability" in the TLB entries properly. CPU designers have special hacks to make this work for the first 2 MiB of the physical address space (with 2 MiB pages) where it's known that there's often an ugly mixture (some RAM, some memory mapped devices, some ROM) and this has probably been tested to make sure it works; but for all other areas there's a relatively high risk of subtle and not so subtle bugs in the CPU where using 1 GiB pages for "mixed cacheability" areas can lead to problems (e.g. INVLPG only invalidating 4 KiB of the 1 GiB area and causing an OS to crash in unpredictable ways).
Got it
Brendan wrote:Finally; usually an executable has several sections with different attributes (".text", ".rodata", ".data", ".bss", ...) and the permission bits (read/write, no-execute) in page tables are configured to reflect the section's attributes to improve security and/or detect bugs; and usually an executable also uses shared libraries. This means that if you only use 1 GiB pages and nothing else, a tiny little "hello world" executable will consume about 8 GiB of RAM (1 huge 1 GiB page for each section in the executable and each section in the libraries it uses); and (based on "half a page wasted per section") on average (for "1 GiB pages only") every process will waste 4 GiB of RAM.
I understand your point, but in my case, I am not reflecting the right access to those sections by using the permission bits. The only bits I am using are the cached/uncached bits (and that was actually my main issue).
Brendan wrote:Mostly, 1 GiB pages (and 2 MiB pages) should be used sparingly (e.g. only if a process is using the whole 1 GiB with the same permissions anyway); and failing to do that will hurt performance and/or waste a lot of RAM (which will also hurt performance - less RAM for file data caches, etc). If you only support a single page size, then it'd be much better to only use 4 KiB pages (although in this case it's usually not that hard to use larger page sizes where appropriate for memory mapped devices, because the physical memory manager doesn't have to deal with larger page sizes in this case).
I think my case is very particular. I am not heavily using paging. I have a flat memory space which is mapped one to one to the physical memory. I have only one page directory and the main idea was to speed up the time that the mmu takes to walk through the page directory. By using 1GB, I will be able to reduce the size of page directory (currently I am 2MB pages). However, I am not sure if there is really any improvement. I need to experiment a bit before.

Thanks again for these great answers, Matias.

Re: 1GB Pages

Posted: Wed Jan 24, 2018 4:35 am
by linguofreak
z0rr0 wrote:Hello and thanks for the answers,
Brendan wrote:Hi,

For 80x86, you can freely mix 4 KiB, 2 MiB and 1 GiB pages within the same virtual address space.
From intel manual, it was not clear for me that I could mix different pages size in the same page directory, thanks for the clarification. That was my main question because I did not know how mix physical memory and devices mapped on memory.
The PS bit in a PDPT/Page Directory entry determines whether the processor uses the address contained in the entry to find the next table down in the paging hierarchy, or uses it as a large page.

So for 4k pages, the processor checks the PML4 entry for the address to be translated, and uses that to find a PDPT, it checks the PDPT entry for the address to be translated, finds that the PS bit is 0, and uses the PDPT entry to find a page directory. It checks the relevant entry in that page directory, finds PS is 0, and uses the PDE to find a page table, and uses the relevant page table entry to find the 4k page containing the address that needed to be translated.

For 2M pages, the processor checks the relevant PML4 entry, finds the appropriate PDPT, checks the relevant PDPTE, finds PS is 0, and uses the PDPTE to find the page directory it needs. When it checks the relevant page directory entry, it finds that PS = 1, so rather than interpreting that PDE as pointing to a page table, it interprets it as pointing directly to a 2M page.

For 1G pages, the processor checks the relevant PML4 entry to find the PDPT, checks the relevant PDPTE, finds PS is 1, and interprets that PDPTE as pointing directly to a 1G page.

Re: 1GB Pages

Posted: Fri Jan 26, 2018 5:27 am
by z0rr0
Hello and thanks for the answer, it is much more clear now,
linguofreak wrote:
z0rr0 wrote:Hello and thanks for the answers,
Brendan wrote:Hi,

For 80x86, you can freely mix 4 KiB, 2 MiB and 1 GiB pages within the same virtual address space.
From intel manual, it was not clear for me that I could mix different pages size in the same page directory, thanks for the clarification. That was my main question because I did not know how mix physical memory and devices mapped on memory.
The PS bit in a PDPT/Page Directory entry determines whether the processor uses the address contained in the entry to find the next table down in the paging hierarchy, or uses it as a large page.

So for 4k pages, the processor checks the PML4 entry for the address to be translated, and uses that to find a PDPT, it checks the PDPT entry for the address to be translated, finds that the PS bit is 0, and uses the PDPT entry to find a page directory. It checks the relevant entry in that page directory, finds PS is 0, and uses the PDE to find a page table, and uses the relevant page table entry to find the 4k page containing the address that needed to be translated.

For 2M pages, the processor checks the relevant PML4 entry, finds the appropriate PDPT, checks the relevant PDPTE, finds PS is 0, and uses the PDPTE to find the page directory it needs. When it checks the relevant page directory entry, it finds that PS = 1, so rather than interpreting that PDE as pointing to a page table, it interprets it as pointing directly to a 2M page.

For 1G pages, the processor checks the relevant PML4 entry to find the PDPT, checks the relevant PDPTE, finds PS is 1, and interprets that PDPTE as pointing directly to a 1G page.