Confirmative question about 4-level and 5-level paging

Ethin · Post by **Ethin** » Thu Aug 26, 2021 7:31 pm

Just reading through chapter 4 of the Intel SDMs, vol. 3A, so I can better understand paging (I find it a bit confusing myself, or at least how the structures work), and just want to confirm the layout of the paging structures and what CR3 points to.
This is my understanding of paging in long mode when CR4.LA57 is set and clear:

If CR4.LA57 is set, then:
1. CR3 contains the physical address of a PML5 table in bits 51:12, with bits 11:0 representing flags/control bits.
2. The PML5 table contains 512 PML5 entries, each of which points to a PML4 table (not a PML4 entry). Each PML5 entry contains the physical address of that entry in bits 51:12 (which are taken from CR3) and 11:3 (which are bits 56:48 of the physical linear address), with bits 2:0 clear.
3. Each PML4 table referenced by each PML5 entry in the PML5 table contains 512 PML4 entries. Each PML4 entry contains the physical address of the PDPT it points to in bits 51:12 (taken from the PML5E) and bits 11:3 (physical linear address bits 47:39) and bits 2:0 clear.
4. Each PML4 entry points to a PDPT. Every PDPT contains 512 PDTs. Each PDT points to a page table with the physical address of that PT in bits 51:12 (taken from the PML4E) and bits 11:3 (which are bits 38:30 of the physical linear address of the PT).
5. Each PT contains 512 PTEs. Each PTE contains the physical address of a page in bits 51:12 (from the PDE) and bits 11:3 (which are bits 20:12 of the physical linear address).
6. Finally, each page is mapped by using bits 51:12 (from the PTE) and setting bits 11:0 to bits 11:0 of the page.
For 4-level paging, the process is identical but excludes the PML5 table.

Thus, for setting up paging, I would:

Create a PML5 or PML4 table and add one PML5 or PML4 entry to it, which points to my PDPT.
Create a PDPT and add as many PDTs as I need to it, pointing to page tables.
Create page tables at the specified addresses and write addresses for pages that I want to map into them, clearing the present bit.

Then to map a page, I would walk the entire page table (PML5 entry if 5-level paging is enabled, PML4 entry, PDPT, PT, page) and then, once I found the page, I'd flip the present bit. Then I'd run the INVLPG instruction to flush the TLB.

davmac314 · Post by **davmac314** » Thu Aug 26, 2021 8:57 pm

Ethin wrote:Thus, for setting up paging, I would:

Create a PML5 or PML4 table and add one PML5 or PML4 entry to it, which points to my PDPT.

For 5 level paging you'd need both a PML5 and PML4 table, with one entry in the PML5 referring to the PML4. (edit: but I'm guessing you understand that, you just missed it in the description here).

Ethin wrote:
Create a PDPT and add as many PDTs as I need to it, pointing to page tables.

Create page tables at the specified addresses and write addresses for pages that I want to map into them, clearing the present bit.
Then to map a page, I would walk the entire page table (PML5 entry if 5-level paging is enabled, PML4 entry, PDPT, PT, page) and then, once I found the page, I'd flip the present bit. Then I'd run the INVLPG instruction to flush the TLB.

Typically, at the point that you know which page you want to map where, you'd write its address *and* set the present bit immediately. There's not much point writing a page address without setting the present bit, unless you need to be able to trap accesses to the page for some reason.

Octocontrabass · Post by **Octocontrabass** » Thu Aug 26, 2021 11:01 pm

You're mixing up what goes into the entry with how the CPU calculates the physical address of the entry. You've also skipped one level.

Each entry contains bits 51:12 of the physical address of the next level's table in bits 51:12, and various other data in the remaining bits.

CR3 contains the physical address of a PML5 or a PML4 depending on CR4.

Each PML5 contains 512 PML5Es, and each PML5E contains the physical address of a PML4.
Each PML4 contains 512 PML4Es, and each PML4E contains the physical address of a PDPT.
Each PDPT contains 512 PDPTEs, and each PDPTE contains the physical address of a PD.
Each PD contains 512 PDEs, and each PDE contains the physical address of a PT.
Each PT contains 512 PTEs, and each PTE contains the physical address of a 4kiB page.

It's easier to follow if you throw out Intel's dumb names and call them PML3, PML2, and PML1 instead of PDPT, PD, and PT.

Why do you seem to assume you would only need one PML5E and one PML4E? How will you walk the page tables when you can't directly use physical addresses?

The CPU's page translation caches will never cache entries that are not present. Typically you would use INVLPG immediately after you clear the present bit, so it wouldn't be necessary to use it again when you set the present bit.

Rukog · Post by **Rukog** » Fri Aug 27, 2021 2:04 am

Ethin wrote:Just reading through chapter 4 of the Intel SDMs, vol. 3A, so I can better understand paging (I find it a bit confusing myself, or at least how the structures work), and just want to confirm the layout of the paging structures and what CR3 points to.
This is my understanding of paging in long mode when CR4.LA57 is set and clear:

If CR4.LA57 is set, then:

CR3 contains the physical address of a PML5 table in bits 51:12, with bits 11:0 representing flags/control bits.

The PML5 table contains 512 PML5 entries, each of which points to a PML4 table (not a PML4 entry). Each PML5 entry contains the physical address of that entry in bits 51:12 (which are taken from CR3) and 11:3 (which are bits 56:48 of the physical linear address), with bits 2:0 clear.

Each PML4 table referenced by each PML5 entry in the PML5 table contains 512 PML4 entries. Each PML4 entry contains the physical address of the PDPT it points to in bits 51:12 (taken from the PML5E) and bits 11:3 (physical linear address bits 47:39) and bits 2:0 clear.

Each PML4 entry points to a PDPT. Every PDPT contains 512 PDTs. Each PDT points to a page table with the physical address of that PT in bits 51:12 (taken from the PML4E) and bits 11:3 (which are bits 38:30 of the physical linear address of the PT).

Each PT contains 512 PTEs. Each PTE contains the physical address of a page in bits 51:12 (from the PDE) and bits 11:3 (which are bits 20:12 of the physical linear address).

Finally, each page is mapped by using bits 51:12 (from the PTE) and setting bits 11:0 to bits 11:0 of the page.

For 4-level paging, the process is identical but excludes the PML5 table.
Thus, for setting up paging, I would:

Create a PML5 or PML4 table and add one PML5 or PML4 entry to it, which points to my PDPT.

Create a PDPT and add as many PDTs as I need to it, pointing to page tables.

Create page tables at the specified addresses and write addresses for pages that I want to map into them, clearing the present bit.
Then to map a page, I would walk the entire page table (PML5 entry if 5-level paging is enabled, PML4 entry, PDPT, PT, page) and then, once I found the page, I'd flip the present bit. Then I'd run the INVLPG instruction to flush the TLB.

If you want to understand Paging, do it with the 1-GByte page mapping instead, way easier to setup than other paging mode structure.

iansjack · Post by **iansjack** » Fri Aug 27, 2021 4:02 am

Rukog wrote:If you want to understand Paging, do it with the 1-GByte page mapping instead, way easier to setup than other paging mode structure.

But far less versatile and only useful in very specific cases.

nexos · Post by **nexos** » Fri Aug 27, 2021 8:14 am

Rukog wrote:If you want to understand Paging, do it with the 1-GByte page mapping instead, way easier to setup than other paging mode structure.

1G paging is only supported on newer CPUs. I wouldn't depend on it. Also, 1G granularity is unacceptable for basically every situation. The only type 1G paging would be useful is when mapping all of physical memory, but I'm not going to do that to begin with.

Octocontrabass · Post by **Octocontrabass** » Fri Aug 27, 2021 8:24 am

Rukog wrote:If you want to understand Paging, do it with the 1-GByte page mapping instead, way easier to setup than other paging mode structure.

I wouldn't say it's easier to set up. Large pages aren't allowed to cross effective cache type boundaries, so you need to check the MTRRs to choose appropriate cache settings for each page. For 1GiB pages, you'll have to map a lot of memory as uncacheable, which is a significant performance hit.

Ethin · Post by **Ethin** » Fri Aug 27, 2021 10:22 am

Octocontrabass wrote:You're mixing up what goes into the entry with how the CPU calculates the physical address of the entry. You've also skipped one level.

Each entry contains bits 51:12 of the physical address of the next level's table in bits 51:12, and various other data in the remaining bits.

CR3 contains the physical address of a PML5 or a PML4 depending on CR4.

Each PML5 contains 512 PML5Es, and each PML5E contains the physical address of a PML4.
Each PML4 contains 512 PML4Es, and each PML4E contains the physical address of a PDPT.
Each PDPT contains 512 PDPTEs, and each PDPTE contains the physical address of a PD.
Each PD contains 512 PDEs, and each PDE contains the physical address of a PT.
Each PT contains 512 PTEs, and each PTE contains the physical address of a 4kiB page.

It's easier to follow if you throw out Intel's dumb names and call them PML3, PML2, and PML1 instead of PDPT, PD, and PT.

Why do you seem to assume you would only need one PML5E and one PML4E? How will you walk the page tables when you can't directly use physical addresses?

The CPU's page translation caches will never cache entries that are not present. Typically you would use INVLPG immediately after you clear the present bit, so it wouldn't be necessary to use it again when you set the present bit.

The reason I'd only need one PML5E and one PML4E is because a PML5 entry controls a 256-TByte region of memory and a PML4E covers 512-GBytes of memory. A PDE covers either a 1-GByte or 2-MByte memory region and a PTE covers a 4-KByte memory region. Is this not correct?

Octocontrabass · Post by **Octocontrabass** » Fri Aug 27, 2021 11:10 am

Ethin wrote:The reason I'd only need one PML5E and one PML4E is because a PML5 entry controls a 256-TByte region of memory and a PML4E covers 512-GBytes of memory. A PDE covers either a 1-GByte or 2-MByte memory region and a PTE covers a 4-KByte memory region. Is this not correct?

You're still missing a level. A PDPTE covers a 1 GiB region and a PDE covers a 2 MiB region.

These are regions of linear addresses, not physical addresses. If you have only one PML4E, then you can only use linear addresses within that 512 GiB region. Why do you want to limit which linear addresses you can use?

Ethin · Post by **Ethin** » Fri Aug 27, 2021 5:14 pm

Octocontrabass wrote:
Ethin wrote:The reason I'd only need one PML5E and one PML4E is because a PML5 entry controls a 256-TByte region of memory and a PML4E covers 512-GBytes of memory. A PDE covers either a 1-GByte or 2-MByte memory region and a PTE covers a 4-KByte memory region. Is this not correct?
You're still missing a level. A PDPTE covers a 1 GiB region and a PDE covers a 2 MiB region.

These are regions of linear addresses, not physical addresses. If you have only one PML4E, then you can only use linear addresses within that 512 GiB region. Why do you want to limit which linear addresses you can use?

Oh okay, now I understand. Your right, I'd want to have as many regions available as possible. For filling the tables, could I just generate random addresses in bits 51:12 since these are (linear) addresses? I ask only because I get the feeling I couldn't just use address 1, 2, 3, 4, ..., for each region.

Octocontrabass · Post by **Octocontrabass** » Fri Aug 27, 2021 5:43 pm

Ethin wrote:For filling the tables, could I just generate random addresses in bits 51:12 since these are (linear) addresses?

Linear addresses determine which entries you fill, physical addresses are what you fill them with. You probably don't want either the linear or physical address to be completely random, although there are security techniques that involve some amount of randomization for both.

Ethin wrote:I ask only because I get the feeling I couldn't just use address 1, 2, 3, 4, ..., for each region.

Sure you can.

Ethin · Post by **Ethin** » Fri Aug 27, 2021 6:21 pm

Octocontrabass wrote:
Ethin wrote:For filling the tables, could I just generate random addresses in bits 51:12 since these are (linear) addresses?
Linear addresses determine which entries you fill, physical addresses are what you fill them with. You probably don't want either the linear or physical address to be completely random, although there are security techniques that involve some amount of randomization for both.

Ethin wrote:I ask only because I get the feeling I couldn't just use address 1, 2, 3, 4, ..., for each region.
Sure you can.

I don't understand. How would that work? Wouldn't I want the regions to (not) overlap?

Octocontrabass · Post by **Octocontrabass** » Fri Aug 27, 2021 6:45 pm

The linear addresses mapped by the tables never overlap. You can't have two different things at the same linear address unless there are two separate sets of tables and you switch between the two.

The physical addresses can overlap, in the sense that you can map the same physical address more than once. The only reason I can think of to do this is to access the page tables: you must map the same physical memory as both a page and a table in order to read or write the table.

davmac314 · Post by **davmac314** » Fri Aug 27, 2021 8:48 pm

Octocontrabass wrote:The only reason I can think of to do this is to access the page tables: you must map the same physical memory as both a page and a table in order to read or write the table.

I don't think referring to a page table via a PDE or other higher-level directory is normally referred to as "mapping". The term comes from the notion of mapping a linear address to a physical address. Merely having a directory present at some level and on some physical page doesn't (at least in my view) count as a mapping. That said, you typically do want such pages mapped in to some linear address, so you can manipulate them.

Ethin, since it sounds at this point like you a misunderstanding at some conceptual level, I'll try to explain the paging mechanism the way I think of it.

First thing to note is that the mechanism is primary about mapping linear addresses to physical addresses.

2nd thing to note is that, regardless of how many levels of "paging" you have (5, 4, or less), conceptually it works in the same way: each level is a directory (regardless of what it is called) which divides up the linear address range into a number of entries where each entry specifies the physical address of the next level directory, except in the case of the final level, i.e. the page table level, which provides the physical address of the actual page the linear address will map to). For mapping a linear address to a physical address, you essentially peel off a certain number of bits of the linear address to find the index within the current level directory (starting at the top), and take the physical address of the next level directory from that entry.

Ethin wrote:Oh okay, now I understand. Your right, I'd want to have as many regions available as possible. For filling the tables, could I just generate random addresses in bits 51:12 since these are (linear) addresses?

They are not linear addresses. They are physical addresses.

The linear address determines how (by what linear address) the memory will be accessed once it is mapped. The physical address determines what physical memory is actually accessed. Usually when you want to map some memory you either want to map a specific address (eg a framebuffer might be accessible via a specific physical address) or you want to choose an address that is available (not mapped elsewhere). I don't think there's much point randomising this. For the linear address, you may randomise it but must make sure that it's not already in use, so it can't be just completely random.

The complete mapping process is something like:

find a suitable linear address
find a suitable physical address
(step 3) figure out the page table, and the entry within, for the chosen linear address
write the chosen physical address in that entry

Step (3) is complicated by the fact that to write to the page table, you'll need to have the page table itself mapped into memory. Also "walking the hierarchy" which you referred to earlier is theoretically possible, but you need to be aware that because the entries store physical addresses, you need at each level for the directory page itself to be mapped in to the linear address space at some address so you can read it, and you need some way to determine what that address is.

Probably the easiest way to start out is:

go with a single entry in PML4 (and PML5) as you were first thinking; this lets you map 512GB which is easily enough to start with.
this means you need a single PDPT, but this should have the full 512 entries
meaning you need 512 PD pages, each with 512 entries
and therefore you need 512 * 512 PT pages - that's a full gigabyte worth, mind!

It should be obvious that this scheme potentially wastes a heap of memory, so you might choose not to have the full 512 * 512 PTs, but that means you won't be able to map the full 512GB range (at least, not without allocating more PTs later on, or using larger-sized pages). However, you can arrange all the required directory pages in a pretty straightforward layout somewhere in memory, which makes it easy to find them when you want to manipulate the mappings later on. Eg:

At address P: 1 PML5
At P+4k: 1 PML4
At P+8k: a series of 512 PDs
At P+8k+(512*4k): a series of 512*512 PTs (or less, if you don't need a full 512GB worth)

Also, in the beginning it's probably easiest to set up a one-to-one linear-to-physical mapping. I.e. map each linear address to the same physical address. This also means you can easily walk the paging directory hierarchy without worrying about the physical/linear difference, but you don't actually need to, because it's trivial to find the PT for a particular linear address - just divide it by 4096, and that's an index into the series of PTs that begin at P+8k+(512*4k). To get this set up, note that you have 512 PDs each with 512 entries, therefore a total of 512*512 PDEs, laid out so that you can consider them as a single array; the first should refer to the first PT, the 2nd to the 2nd PT, and so on.

Once you've got that set up and working, then you can think about improving it and extending it.

OSDev.org

Confirmative question about 4-level and 5-level paging

Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging

Re: Confirmative question about 4-level and 5-level paging