Page 1 of 2

x86-64 Paging - Just to be sure

Posted: Fri May 19, 2017 4:11 pm
by DevNoteHQ
Hi!

I am trying to figure out how x86-64 Paging works. I think i understood the principle, but could you tell me if my code is somewhat correct?

Code: Select all

#define PL2E 512 //512 --> 512 * 2MB Pages per PL3 Entry --> 512 * 16 * 2MB Pages per Process
#define PL3E 16  //16 --> 16 GB max Pages per Process. 512 is just to big.
#define PL4E 512 //512 --> PID of the running process --> 512 processes possible

#define PL2INITFLAGS 0b00001000000
#define PL3INITFLAGS 0b00000000001
#define PL4INITFLAGS 0b00000000001

namespace Paging
{
	uint64_t PL2[PL4E][PL3E][PL2E];
	uint64_t PL3[PL4E][PL3E];
	uint64_t PL4[PL4E];

	void init(void)
	{
		for (int i4 = 0; i4 < PL4E; i4++)
		{
			PL4[i4] = ((((uint64_t) &PL3 + i4 * PL3E) << 12) | PL4INITFLAGS);
			for (int i3 = 0; i3 < PL3E; i3++)
			{
				PL3[i4][i3] = ((((uint64_t) &PL2 + i4 * PL3E * PL2E + i3 * PL2E) << 12) | PL3INITFLAGS);
				for (int i2 = 0; i2 < PL2E; i2++)
				{
					PL2[i4][i3][i2] = 0 | PL2INITFLAGS;
				}
			}
		}
	}
}
I might have to set PL4 and PL3 to not present either... The plan is to use PL4 as an indicator which process currently runs.

Now to the theory behind: I just have to load a new PML4T for the current process. If i load it, it will the go to the 0 address in the lowest table. The address-counter will increment and if it gets out of the area of the first page, the MMU will try to load the next page if the next page is present, right? I don't have to manualy do something to make a "jump" to the next page (except for setting up the page before), right?

And sorry for my bad english.

EDIT: You didn't see that, okay? XD

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 7:34 am
by LtG
MathiLpHD wrote:Now to the theory behind: I just have to load a new PML4T for the current process. If i load it, it will the go to the 0 address in the lowest table. The address-counter will increment and if it gets out of the area of the first page, the MMU will try to load the next page if the next page is present, right? I don't have to manualy do something to make a "jump" to the next page (except for setting up the page before), right?
Not sure if I understand what you mean, but are you saying that instead of changing CR3 when switching a process you only want to change the topmost page directory entry? Is there a supposed benefit to this?

Note, you will be rewriting the existing entry, which means you need to flush the TLB, instead you could just change the CR3 (which automatically flushes TLB as needed).

Also note that if I did my math right your PL2 alone is 32MiB.. I would do all of that dynamically and without any hard coded limitations on number of processes (PID).
EDIT: You didn't see that, okay? XD
Probably not, see what?

PS. I guess it's nit-picking, but on your first code line it should be "16 * 512 * 2 MB", just irritates my eyes =)

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 9:54 am
by DevNoteHQ
LtG wrote: Probably not, see what?

PS. I guess it's nit-picking, but on your first code line it should be "16 * 512 * 2 MB", just irritates my eyes =)
Well it's actually correct in my order. 512 PML4T Entries * 16 PDPT Entries * 2MB Pagesize.
And all for's had <= conditions instead of <. #StackOverflow XD
LtG wrote:Not sure if I understand what you mean, but are you saying that instead of changing CR3 when switching a process you only want to change the topmost page directory entry? Is there a supposed benefit to this?

Note, you will be rewriting the existing entry, which means you need to flush the TLB, instead you could just change the CR3 (which automatically flushes TLB as needed).

Also note that if I did my math right your PL2 alone is 32MiB.. I would do all of that dynamically and without any hard coded limitations on number of processes (PID).
No i want to change CR3 for each process as you ment. But because CR3 points to the topmost page directory (=PML4T), i think both statements are true, at least that's how i understood it.

Another problem i am currently facing is that i have to insert 4K aligned entries for PML4T and PDPT (not for PD, because it only holds the base address for the Pages, which have to be 2MiB aligned). Which means my RAM usage for Paging increases to 66MiB (32MiB PD, 32MiB PDPT (with a lot of empty space) and 2MiB PML4T). But i still want to do it statically because then i can directly map the whole PM4T to itself. And i am also not getting much benefit of doing it dynamically because i would either need to place the addresses somewhere in memory (=> Bugs are likely), or reserving some space for it, which uses up the same amount of RAM.

Here the updated code:

Code: Select all


#include <mm/vmm.hpp>
#include <common.hpp>

#define PL2E 512 //512 --> 512 * 2MB Pages per PL3 Entry --> 512 * 16 * 2MB Pages per Process
#define PL3E 16  //16 --> 16 GB max Pages per Process. 512 is just to big.
#define PL4E 512 //512 --> PID of the running process --> 512 processes possible

#define SIZE2M 0x000200000
#define SIZE1G 0x040000000

#define PG_PRESENT   0x1
#define PG_WRITABLE  0x2
#define PG_USER      0x4
#define PG_BIG       0x80
#define PG_NO_EXEC   0x8000000000000000
#define PG_ADDR_MASK 0xFFFFFFFFFF000

#define Proc_Kernel_VMA	0x000003E0000000 //In each Process
#define Kernel_Kernel_VMA	0xFFFFFF800000000 //The last Process = Kernel

#define PL2P 0xFFFFFFFFF8000000
#define PL3P 0xFFFFFFFFFA000000
#define PL4P 0xFFFFFFFFFC000000

namespace Paging
{
	uint64_t *PL2 = PL2P;
	uint64_t *PL3 = PL3P;
	uint64_t *PL4 = PL4P;

	void init(void)
	{
		for (uint16_t i4 = 0; i4 < PL4E - 1; i4++)
		{
			PL4[i4 * 4096] = (((uint64_t) PL3[i4 * 4096]) | PG_PRESENT | PG_USER | PG_WRITABLE);
			for (uint16_t i3 = 0; i3 < PL3E; i3++)
			{
				PL3[i4 * PL3E * 4096 + i3 * 4096] = (((uint64_t) PL2[i4 * PL3E * 4096 + i3 * 4096]) | PG_PRESENT | PG_USER | PG_WRITABLE);
				for (uint16_t i2 = 0; i2 < PL2E; i2++)
				{
					PL2[i4 * PL3E * 4096 + i3 * PL2E * 4096 + i2] = (0 | PG_BIG | PG_USER | PG_WRITABLE);
				}
			}
			for (uint16_t i2 = 0; i2 * SIZE2M < (uint64_t)&_rend; i2++)
			{
				PL2[i4 * PL3E * 4096 + (PL3E - 1) * PL2E * 4096 + i2] = (((uint64_t) i2 * SIZE2M) | PG_PRESENT | PG_BIG | PG_WRITABLE);
			}
		}
		PL4[(PL4E - 1) * 4096] = (((uint64_t) PL3[(PL4E - 1) * 4096]) | PG_PRESENT | PG_WRITABLE);
		PL3[(PL4E - 1) * 4096] = ((((uint64_t) PL2[(PL4E - 1) * 4096])) | PG_PRESENT | PG_WRITABLE);
		for (uint16_t i2 = 0; (i2 * SIZE2M) < ((uint64_t) &_rend); i2++)
		{
			PL2[(PL4E - 1) * 4096 + i2] = (((uint64_t) i2 * SIZE2M) | PG_PRESENT | PG_BIG | PG_WRITABLE);
		}
		SetCR3((uint64_t) PL4);
	}
}
I hope I didn't make a mistake... TODO: Map it to itself

EDIT: The PL2 Entries shouldn't start at &_start because &_start = 1MiB which means it's not 2MiB aligned.

EDIT2: Am i even allowed to set the PL2P and so on to a virtual address? I mean as long as i don't disable Paging it should work... It's just gonna get critical when i set CR3, right?

Oh and the code for SetCR3 (don't kill me, I am a noob i assembly XD):

Code: Select all

[global SetCR3]
SetCR3:
	mov cr3, rdi
	ret

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 10:35 am
by LtG
All of the paging tables, at all levels, have to be 4KiB aligned. If you use large entries (>4KiB) then those entries have to be aligned to their size. So for example:
PD = 4KiB
PDE = 4KiB (if it points to PT) or larger if it points to a large page
EDIT2: Am i even allowed to set the PL2P and so on to a virtual address? I mean as long as i don't disable Paging it should work... It's just gonna get critical when i set CR3, right?
As for the entries in all levels of the paging structure, they all have to be physical memory addresses, no virtual addresses allowed. How would the CPU know how to translate a virtual address at any level without knowing physical address at some point? Or maybe I misunderstood again =)
But i still want to do it statically because then i can directly map the whole PM4T to itself
Not sure what you mean here. If you do all the paging statically then you will be severely restricted, for example in number of processes (address spaces really). However if you do do it statically and are not going to modify the tables at runtime then you don't even have to have the page tables themselves mapped to virtual address space as you won't be touching them.

Note, statically allocating them means that the binary size will increase by your 66MiB, of course they use rudimentary compression where the binary will contain a 66MiB BSS section, however the loader (whether normal app loader in an OS or a boot loader in this instance) will need to allocate that memory. Suppose the system doesn't have a contiguous 66MiB physical memory area? Also some boot loaders might have some restrictions on the size as well, not sure.

Dynamic allocation is much more flexible and it's actually very easy. Split physical and virtual memory into two, PMM and VMM. The PMM only keeps track of 4KiB pages and doesn't care about anything else. The VMM only does virtual memory management and when requested to "add memory" to some process it modifies the requested virtual address (page) and requests a 4KiB frame from the PMM and maps the VMM address to that page. When de-allocating memory it does the reverse and the PMM adds that memory frame (4KiB) to it's storage (stack and bitmap are most common I think).

If you haven't already, lookup recursive mapping, basically you map the _current_ paging structures at the end of the _current_ virtual address space. That way the last virtual pages always point to the physical memory of the current paging structures. That way you always know how to find them for the current process. It may take a moment to wrap your head around the concept but it's actually quite intuitive once you get used to the idea and quite easy and efficient.

I didn't fully go over your code, but remember that when your code is loaded by the boot loader all the static allocations (globals) will be physical addresses. Once you enable paging all you ever access is virtual addresses, with the obvious exception that what you provide to CR3 is physical and all references in all the paging structures refer to physical addresses as far as the CPU is concerned.

Note. In some parts I've been referring to the 32-bit paging because it's easier to talk about, but everything works the same with long mode paging with respect to the things I've mentioned.

If you haven't read these already:
http://wiki.osdev.org/Paging
http://wiki.osdev.org/Page_Tables#Recursive_mapping

Check especially the second one as it talks about recursive mapping.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 10:41 am
by LtG
I'd also recommend starting with 4KiB paging, it's simpler for everyone to explain to you, once you have that working it's trivial to do larger pages.

Note: Using larger pages instead of 4KiB is trivial, coming up with good algos to find free large pages not so much, and neither is swapping pages between processes to defrag memory to create more large pages. Unless, of course, you only support large pages which can be somewhat wasteful, but depends on your goals.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 12:49 pm
by DevNoteHQ
LtG wrote:I'd also recommend starting with 4KiB paging, it's simpler for everyone to explain to you, once you have that working it's trivial to do larger pages.

Note: Using larger pages instead of 4KiB is trivial, coming up with good algos to find free large pages not so much, and neither is swapping pages between processes to defrag memory to create more large pages. Unless, of course, you only support large pages which can be somewhat wasteful, but depends on your goals.
Well i plan to completely dump 4K pages. That's just an extra page table i would have to provide and it has no real benefit on x86-64 where memory is usually bigger then 4GiB. And 4GiB with 4K pages under x86-64 would use up more than 8MiB of RAM. 4GiB with 2MiB Pages use more than 16KiB of RAM.

Ou i think i got what you ment. I was under the impression that i have to provide a PML4T with 512 entries but i could also just provide a PML4T[0] entry for each process. That way the number of processes is not limited.
So i think it's time to include malloc in my system. Otherwise i won't be able to find new locations in my memory to put the tables...
LtG wrote:If you haven't already, lookup recursive mapping, basically you map the _current_ paging structures at the end of the _current_ virtual address space. That way the last virtual pages always point to the physical memory of the current paging structures. That way you always know how to find them for the current process. It may take a moment to wrap your head around the concept but it's actually quite intuitive once you get used to the idea and quite easy and efficient.
Well but then i have to create an extra page for each process running, don't i? That's to much wasted RAM with 2MiB pages. And adding 4K pages, especially when i only use them for page structures, will only make things complicated. And with 4K pages i would have to make a page for each table because they have to be 4K aligned. Which means 8 4K pages for 1K of usable space? I could potentialy set each entry on the physical begining of a page, but the question is if a process is loaded exactly at address 0? Or if i can set GCC to always start binaries above address 8?

Couldn't i only map the paging structures into my OS? My processes won't be able to modify the paging structures anyway, so it doesn't really make sence? I could then use a whole 2MiB page for a PML4T[0] "table". I'll then have to set the current PML4T[0] as present and all the others as not present. And do i even have to map the OS in my processes? Maybe for SYSCALL/SYSEXIT? Or because i always have to change CR3 in syscalls/irq's/exceptions when my OS is not mapped in my processes? Is that the same reason why you would map the paging structure in each process? So I won't have to change CR3 in order to add a page?

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 1:01 pm
by Octacone
MathiLpHD wrote: Well i plan to completely dump 4K pages. That's just an extra page table i would have to provide and it has no real benefit on x86-64 where memory is usually bigger then 4GiB. And 4GiB with 4K pages under x86-64 would use up more than 8MiB of RAM. 4GiB with 2MiB Pages use more than 16KiB of RAM.
So what? Memory is meant to be used not spared. What is 8 MiB compared to lets say 16384 MiB? :|
More free ram != more performance.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 1:16 pm
by DevNoteHQ
Octacone wrote: So what? Memory is meant to be used not spared. What is 8 MiB compared to lets say 16384 MiB? :|
More free ram != more performance.
Well i don't really plan to move pages around. So what should eat my performance when doing 2MiB pages instead of 4KiB pages?

Current plan: One 2MiB page for PML4T[0] entries fixed at, lets say, 510MiB physical RAM, one 2MiB page for PDPT entries per process and one 2MiB page for PD entries per process. If the PD entries are full, the PDPT entry is going to be converted into a 1GiB page and the next PDPT entry is used to point to new PD entries in the old 2MiB page for the PD.

EDIT: I'll probably map the kernel to the last PDPT entry using it's own 2MiB PD page in each process and the paging structures to the last two PD entries.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 1:44 pm
by Brendan
Hi,
MathiLpHD wrote:Well i don't really plan to move pages around. So what should eat my performance when doing 2MiB pages instead of 4KiB pages?

Current plan: One 2MiB page for PML4T[0] entries fixed at, lets say, 510MiB physical RAM, one 2MiB page for PDPT entries per process and one 2MiB page for PD entries per process. If the PD entries are full, the PDPT entry is going to be converted into a 1GiB page and the next PDPT entry is used to point to new PD entries in the old 2MiB page for the PD.
Assume you have 123 processes, and each process has (on average) 3 areas with different page permissions (.text (executable), .rodata (read-only) and .data/.bss/heap (read-write) ) plus maybe 2 more areas because of shared libraries, etc. That adds up to 123*5 = 615 areas. With 2 MiB pages only, on average each area will waste half of a page, so that's 300 MiB of RAM wasted.

Then there's other things, like shared memory areas that are mapped into 2 or more processes as "copy on write", where one process modifies a single byte and you have to allocate/copy the entire 2 MiB. Including this sort of stuff, you're probably looking at more like 512 MiB of RAM wasted for those 123 processes.

However, (because disks are slower than RAM) you want to have some kind of VFS cache, where (e.g.) if someone wants to memory map a file that's already in cache you can just map those page/s into the process as "copy on write". This means that for efficient memory mapped files with 2 MiB pages, you're going to be wasting (on average) 1 MiB per file in your VFS cache. With 1000 files cached, that's another GiB of RAM wasted by the VFS cache.

Now assume you have a computer with 8 GiB of RAM - you can expect about 20% of RAM will be wasted. If the processes don't use all the RAM it means less files cached in the VFS cache, which means slower file IO, which means worse performance. If the processes do use more than 6.5 GiB of RAM it means you start swapping pages to disk, and (depending on often pages that were sent to swap are needed again) performance degrades because you need to spend more time swapping.

Of course there's other less significant problems too. For memory mapped files and swap space you'll be reading/writing 2 MiB chunks from disk (which are likely to be much larger than needed and much slower than necessary for various cases); most CPUs have separate "4 KiB page TLBs" and "2 MiB page TLBs", so you'd also be wasting a lot of the TLBs (all TLBs for 4 KiB pages) and getting more TLB misses; various CPUs have "cache aliasing" problems (where you get worse cache efficiency when everything is aligned on a "too big" boundary because everything fights for the same slots in the cache); etc.


Cheers,

Brendan

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 2:10 pm
by FallenAvatar
Edit: I was wrong, nvm

- Monk

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 2:18 pm
by iansjack
Large pages may seem attractive at first glance, but they really aren't. There are often occasions when you want to allocate just 4K or so of memory (for example as a shared area for interprocess communication). The fact that there is another level of table doesn't make things significantly complicated and, as Brendan points out, the RAM savings you envisage are illusory.

Large pages have their uses but, in general, 4K is a pretty good page size. Don't throw away increased flexibility in pursuit of an illusion.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 2:38 pm
by zaval
yes, 4 KB are good. unless you have a TLB with only 32 entries. in addition, having of all the variety of page protection goodies, only a dirty flag, :lol: you start thinking: screw you, high page granularity bounty, I'm gonna map all of my kernel in a huge 16MB page. :lol:
but of course, this is irrelevant for x86. it's even here, in mips, might be much better, but the vendor chose to cut of as much functionality as possible.
I'd not avoid to use 4K pages on a resource and feature rich x86 platform.

PS. Of course, I overstated it a little, because here, you have an unmapped segment 256MB of weight, for the kernel to not bother with self-mapping at all at least at early stages; but for a user mode, still, having 4KB pages will be a constant TLB refilling.)

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 3:23 pm
by DevNoteHQ
zaval wrote:yes, 4 KB are good. unless you have a TLB with only 32 entries. in addition, having of all the variety of page protection goodies, only a dirty flag, :lol: you start thinking: screw you, high page granularity bounty, I'm gonna map all of my kernel in a huge 16MB page. :lol:
but of course, this is irrelevant for x86. it's even here, in mips, might be much better, but the vendor chose to cut of as much functionality as possible.
I'd not avoid to use 4K pages on a resource and feature rich x86 platform.
Well i am a beginner. I started 2 weeks ago, so please don't kill me if i say something wrong:
PML4T -> points to -> PDPT -> points to -> PD -> points to -> PT -> points to page
Each address has to be 4K aligned. That means a PT entry can only map one page. Now if i would want to map the PT i'll have to add another page for a PT entry and soon i'll get into a circle of mapping pages. If i map the PT to itself then i have the problem that the page has a entry that shouldn't be touched by the program loaded and i have no idea how i should prevent that.
The point is using 4KiB pages as "system pages" is kind of inefficient because of the 4K alignment needed. If i use 2MiB pages as "system pages" than one page can be used to map 512 other pages.

Conclusion: I'll add 4KiB support. But the standard size will be 2MiB and i'll let the PMM search for 2MiB/1GiB (if needed) free space and then cut it in 512 pieces if i need 4K pages.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 8:13 pm
by eryjus
MathiLpHD wrote:Each address has to be 4K aligned. That means a PT entry can only map one page. Now if i would want to map the PT i'll have to add another page for a PT entry and soon i'll get into a circle of mapping pages. If i map the PT to itself then i have the problem that the page has a entry that shouldn't be touched by the program loaded and i have no idea how i should prevent that.
If I am reading and understanding your concern correctly, you are having trouble understanding recursive mapping. Check out this wiki section: http://wiki.osdev.org/Page_Tables#Recursive_mapping. You can also search the forum (I know I had a post myself on the topic from several years ago). The key here is that you only need a single recursive map at the top level (where the PML4 has an entry that points back to itself) and the rest of the math works out. You do not have to map each table.

Taking the last entry in the PML4 is the easiest in my opinion to understand. Sit down with a piece of paper and prove to yourself that you can reach all of your table structures when you set PML4[511] = PML4. Use that as a fact as you traverse your tables. It's elegant, but that also means it can take some time to get your head around what is really happening.

Re: x86-64 Paging - Just to be sure

Posted: Sat May 20, 2017 8:54 pm
by LtG
I'll mention once more: I'd recommend dropping large pages for the time being. Dealing with just 4KiB is easier and you can ignore the biggest problem which is how to find contiguous 2MiB frames which are 2MiB aligned as well.

Once you have your 4KiB only paging actually working and you understand it fully, then implement large pages. But remember what everyone has said in this thread, getting large pages to provide meaningful performance improvement is tricky. Naive large page support might even make the performance worse than not having it.

x86 optimization is always tricky, once you start dealing with very small details it becomes very tricky to get it right and unless you can test it and benchmark it, you won't know. There's many things in x86 optimization that are simply counter-intuitive.

PS. In case you haven't already, remember that you need to get the systems memory layout, check the Wiki for "memory map" and "detecting memory".