Page 1 of 1

Understanding the IA-32e mode paging structures/translation

Posted: Wed Jan 12, 2011 12:26 pm
by lallous
Hello,

1)

I want to make sure that I understood %topic% properly.

The 64bit linear address has its 48bits used only. Split into:

Code: Select all

PML4, Directory Ptr, Directory, Table, Offset
The CR3 points to the PML4 physical address.

So, given a linear address L, to convert it to physical:

1. CR3 -> PML4 physical base, then multiply by 4k
2. Now select a PML4 entry: PML4[L.PML4] -> Takes us to a 4k page structure (the PDPT)
3. Given the PDPT (which has 512 entries) select a page directory:
PDPT[L.DirectoryPtr] = PageDir
4. Now PageDir is another 4k page structure that has 512 entries. Each entry points to a page table
5. Select the page table: PageDir[L.Directory] = PageTable
6. Now again the page table has 4k of size and 512 PTE entries, we select a PTE
PTE = PageTable[L.Table]

Physical = PTE.page_frame * 4kb + L.Offset

Is my understanding correct?

2)

Assume we use 4kb pages. To pre-create a PML4, we need:
4k (for the PML4) + (4k (PDPT) * 4k (PD) * 4k (PT) ) = 0x1000001000L of memory space, right?

So in the case of 4k pages, it is better to allocate each 4kb structure on demand.

3)
If we want to pre-create a 2mb page / PML4 structure:

4k (for the PML4) + (4k (PDPT)* 4k (PD) ) = 0x1001000, right?

Thanks.

Re: Understanding the IA-32e mode paging structures/translat

Posted: Wed Jan 12, 2011 1:38 pm
by Combuster
lallous wrote:Assume we use 4kb pages. To pre-create a PML4, we need:
4k (for the PML4) + (4k (PDPT) * 4k (PD) * 4k (PT) ) = 0x1000001000L of memory space, right?

So in the case of 4k pages, it is better to allocate each 4kb structure on demand.
Every entry in the PML4 points to the location of a PDPT
Every entry in a PDPT points to the location of a PD (or a 1G page)
Every entry in a PD points to the location of a PT (or a 2M page)
Every entry in a PT points to a 4k page.

To map 4k of the address space, you need one PT, one PD to refer to that PT, one PDPT and one PML4, which requires 16k of space, if you want to map everything using 4k pages, each table refers to 512 subtables: (1 + 512*(1 + 512*(1 + 512 * (1)) * 4096 = 0x0000_0080_4020_1000 bytes, just a little over 513GB

In any case, precreating address spaces cost more RAM than virtually any application that will be using that space.

Re: Understanding the IA-32e mode paging structures/translat

Posted: Wed Jan 12, 2011 8:10 pm
by gravaera

Re: Understanding the IA-32e mode paging structures/translat

Posted: Wed Jan 12, 2011 11:03 pm
by Darwish
(Sorry if this was more than you asked for, but some time ago I found that I need to research all of this to understand the topic; ‘**’ is the power operator)
lallous wrote:I want to make sure that I understood %topic% properly.
To really grok this whole paging thing, you need to understand why it was created. The very-first time-sharing computer, MIT's Compatible Time-Sharing System (CTSS), faced the problem of protecting the supervisor (kernel) from the rest of system user processes. They did it by statically reserving the low 5-Kbytes of system memory to the kernel, and the rest 27-Kbytes to user applications. Afterwards, they wanted to avoid memory allocation clashes between users, but due to hardware limitations of the time, they solved that problem by allowing only one program in core memory at any point in time! [1]

As a solution to the “protect users from each other” problem, the ATLAS computer pioneered abstracting the concept of an “address” from the low-level detail of a physical memory location. Thus, from a memory perspective, each process will have its own world view (and no need for historical tricks like memory overlays: demand paging automatically does a similar effect by only loading parts of the executable that are currently needed; a process image can thus be bigger than the amount of contiguous free physical memory available). Now, If we are going to map every byte of the process address space, we'll need a mapping table with the size of physical memory or larger! To solve this, we group physical memory addresses into “blocks”, and map these “blocks” instead of singular bytes in the mapping table. [2]

(NOTE-1: the whole difference between segmentation and paging is that the former divides the address space into blocks of arbitrary size, while the latter organizes such space into uniform block sizes.)

Now, imagine having a 32-bit address space while mapping “blocks” of 4-Kbytes in size. Each mapping table entry will equal 4-bytes to be able to point to the whole 32-bit space. Since each entry maps 4-Kbytes (2**12) of address space, we'll need (2**32/2**12 = 2**20) entries to map the entire 4-Gbyte/32-bit space. So, the mapping table size will equal ((number of needed entries * entry size) = 2**20 * 2**2 = 2**22 bytes = 4-Mbytes) to function. i.e. using that simple scheme, we'll need a 4-MByte mapping table for each process in the system! I'm sure you won't like to use any kernel like that :)

(NOTE-2: Hardware designers intentionally choose power-of-2 page sizes to quickly find a virtual address page number and offset using right and left shifts. Otherwise, they'll have to use the too-slow division operator; check the cited Denning paper for further details.)

If we are going to analyze why each process needed a 4-Mbyte mapping table above, we'll find that the key problem was mapping unused parts of the address space. And to further complicate matters, most processes use their memory space in a very sparse manner; e.g. using the very top for the stack, and the very-bottom for the executable code and data. Thus, the paging scheme has to support virtual addresses moving up and down.

Hardware architecture books describe several solutions to the above problem. For the contemporary x86 CPUs, the solution chosen was to page the page-table itself. Intel and AMD64 documents cover the details.

[1] F.J. Corbato et al., “An Experimental Time-Sharing System”, AIEE-IRE Proceedings of the 1962 Spring Joint Computer Conference, MIT
[2] P.J. Denning, “Virtual Memory”, 1970, ACM Computing Surveys
lallous wrote: If we want to pre-create a 2mb page / PML4 structure:
4k (for the PML4) + (4k (PDPT)* 4k (PD) ) = 0x1001000, right?
(I'll just call the PDP and PD as ‘PML3’ and ‘PML2’ respectively, it makes more sense that way)

Because the AMD designers wanted to limit requiring contiguous physical memory as far as possible, each page table was designed not to exceed 4-Kbytes of contiguous RAM. Since each entry here points to a full 64-bit address, such page table entry sizes will equal 8-bytes. Thus, max number of possible entries in a table = (page table size / entry size = 2**12 / 2**3 = 2**9) entries. That's in fact why the PML4, PML3, and PML2 offsets are 9-bits each and why all of these tables also have 512 entries each. This leads us to:
  • A PML2 entry covers a 2-Mbyte region since it directly points to a 2MB page
  • A PML3 entry covers a 1-Gbyte region since it points to 512 PML2 entries
  • A PML4 entry covers a 512-Gbyte region since it points to 512 PML3 entries
Another explanation using page offsets:
  • A PML2 entry covers a 2-Mbyte region since 2**21 = 2**20 * 2 = 2 Mbytes
  • A PML3 entry covers a 1-Gbyte region since 2**30 = 2**20 * 2**10 = 1024-Mbytes = 1 Gbyte
  • A PML4 entry covers a 512-Gbyte region since 2**39 = 2**30 * 2**9 = 512 Gbytes
Good luck.

(Edit: Add a small note on memory overlays and demand paging.)