Questions about paging in 64-bit mode

IanSeyler · Post by **IanSeyler** » Tue Jan 06, 2009 1:34 pm

Hello,

Currently I know how to set up identity mapped pages in long mode. How do I also map physical memory to a virtual address?

For instance in the code below I map the first 4GB of RAM to start at 0x0000000000000000
How can I also map the first physical 4GB of RAM to start at 0xFFFF800000000000 ?

Still trying to wrap my head around Page maps... I would like to do this as I can have the kernel and apps running in the higher canonical area in memory so I can use all available RAM in the system (and it won't overlap with PCI devices that map memory in the first 4 Gigs).

Code: Select all

; 16-bit code
; Create the Level 4 Page Map. (Maps 4GBs of 2MB pages)
; First create a PML4E enrty.
; A single PML4E entry can map 512GB with 2MB pages.
	cld
	mov di, 0x2000			; Create a PML4E entry for the first 4GB of RAM
	mov eax, 0x00003007
	stosd
	xor eax, eax
	stosd

; Second create four PDPE entries.
; A single PDPE entry can map 1GB with 2MB pages
	mov di, 0x3000
	mov eax, 0x00010007
	stosd
	xor eax, eax
	stosd
	mov eax, 0x00011007
	stosd
	xor eax, eax
	stosd
	mov eax, 0x00012007
	stosd
	xor eax, eax
	stosd
	mov eax, 0x00013007
	stosd
	xor eax, eax
	stosd

; Third create 2048 PDE entries.	
	push es
	mov ax, 0x1000
	mov es, ax
	mov di, 0x0000
	mov eax, 0x0000008F ; Bit 7 must be set to 1 as we have 2MB pages
	xor ecx, ecx
again: ; create a 2MB page
	stosd
	push eax
	xor eax, eax
	stosd
	pop eax
	add eax, 0x00200000
	inc ecx
	cmp ecx, 2048
	jne again		; Create 2048 2MB page maps.
	pop es

Thanks,
-Ian

AJ · Post by AJ » Tue Jan 06, 2009 2:47 pm

OK - this takes a bit of getting your head around, but it makes sense once you've put it in to practice. I use an extended version of the 32 bit system where you page the PD in to itself, making all page tables available in a 4 MiB range. Firstly, have in your mind:

The PML4 - Each entry represents 512GiB of memory. Divided in to:
PDPT - Each entry represents 1GiB of memory. Divided in to:
PD - Each entry represents 2MiB of memory and finally a
PT - Each entry represents 4KiB of memory. (I hope I got all that right!)

Now, when you first initialise the PML4, point the final entry of the PML4 to itself. This is a bit fractal - the PML4 is now also a PDPT, a PD and a PT. The virtual memory at 0xFF80 0000 0000 (using non-canonical for legibility). This means, that to access the actual PML4, you write to 0xFFFF FFFF F000. It also means that if you want to page-in any part of memory, you simply write to the entry at:

(0xFF80 0000 0000 + ([desired virtual address / 0x1000] * 8 bytes) ) : 8 bytes being the length of a PTE.

Therefore, to write physical memory to 0xFFFF 8000 0000 0000, you start by writing the value zero, or'd with appropriate flags, to:

(0xFF80 0000 0000 + (0x8 0000 0000 *

) = 0xFFC0 0000 0000

With appropriate page fault handlers and some thought, you can arrange automatic creation of the necessary PD and PT that you will need.

I said it needs some thought and I'm sorry to ramble, but I'm extremely tired having just got the baby off to sleep! I hope I've got my maths right but I'm sure someone will jump in if there's a problem!

Cheers,
Adam

Hyperdrive · Post by **Hyperdrive** » Wed Jan 07, 2009 6:42 am

AJ wrote: (0xFF80 0000 0000 + ([desired virtual address / 0x1000] * 8 bytes) ) : 8 bytes being the length of a PTE.

Therefore, to write physical memory to 0xFFFF 8000 0000 0000, you start by writing the value zero, or'd with appropriate flags, to:

(0xFF80 0000 0000 + (0x8 0000 000 * 8)) = 0xFFC0 0000 0000

There's a typo, I think. Just for clarification: In the example the result is correct, but "0x8 0000 000" is not - it should be "0x8 0000 0000".

AJ · Post by AJ » Wed Jan 07, 2009 6:50 am

There's a typo, I think. Just for clarification: In the example the result is correct, but "0x8 0000 000" is not - it should be "0x8 0000 0000".

Absolutely - I've corrected the original text for clarity. Thanks

Adam

IanSeyler · Post by **IanSeyler** » Wed Jan 07, 2009 9:06 am

Looks like I have more learning to do. Maybe I should give some more information on what I am trying to do here.

What I am trying to get is full and complete access to all of the available RAM in the system without loosing any to memory-mapped PCI devices. Also I am using 2MB pages so I don't use PT's.

What I would like is to map all of the physical RAM to start at 0xFFFF800000000000 and run the OS and apps in there
This way any of the memory-mapped PCI devices are still sitting in the lower canonical address.

So if a system had 4GB of RAM and a video card with 1GB of RAM you could theoretically access 5GB of memory space.

I guess I am trying to cross-map the memory.

Physical memory starting at 0x0000000000000000 is mapped to virtual address 0xFFFF800000000000
Physical memory starting at 0xFFFF800000000000 (which would be nothing) is mapped to virtual address 0x0000000000000000

The first 2MB of physical memory would be mapped to both so I can jump to the higher address.

Is this even possible? Or am I completely wrong in how x86 memory works?

Thanks,
-Ian

IanSeyler · Post by **IanSeyler** » Thu Jan 08, 2009 8:12 am

Anyone have any ideas on this?

Hyperdrive · Post by **Hyperdrive** » Thu Jan 08, 2009 9:20 am

iseyler wrote:What I am trying to get is full and complete access to all of the available RAM in the system without loosing any to memory-mapped PCI devices. Also I am using 2MB pages so I don't use PT's.

What I would like is to map all of the physical RAM to start at 0xFFFF800000000000 and run the OS and apps in there
This way any of the memory-mapped PCI devices are still sitting in the lower canonical address.

The problem is not the virtual address space, it's the physical. You have to change the mapping of your devices to get them above the end of the physical present RAM.

Let's assume your example of 4 GiB of physical installed RAM and a graphics card with a 1 GiB memory mapping. It now depends which physical addresses are decoded to be within the graphics card's space. If the card is configured to reside at, say, 0x0000_0000_8000_0000 - 0x0000_0000_C000_0000, that are physical addresses. All accesses to this area - no matter how they are mapped to virtual addresses - will go to the graphics card and not to RAM. You will "loose" the 1 GiB.

To get the full 4 GiB of RAM you have to adjust the mapping in the graphics card. Just tell it to reside at, e.g., 0x0000_0001_0000_0000 - 0x0000_0001_4000_0000. That has nothing to do with you virtual->physical address mapping. That is done before the north bridge/HyperTransport router sees the address.

There are (at least) two issues with that:

There are BIOS ROMs, ISA hole, option ROMs, APICs, ... that are mapped below the 4 GiB mark. These are not relocatable (or at least only within some restricted range and most likely not above 4 GiB)
Even if you accept this, not all cards support mapping above 4 GiB. Often, they have just 32 bit base addresses. And I heard about some cards supporting 64 bit addresses but failing when being mapped above 4 GiB.

There may be more points about that. But the above should be sufficient for now.

Regards,
Thilo

Brendan · Post by **Brendan** » Thu Jan 08, 2009 10:05 am

Hi,

iseyler wrote:What I am trying to get is full and complete access to all of the available RAM in the system without loosing any to memory-mapped PCI devices. Also I am using 2MB pages so I don't use PT's.

Most CPUs have separate TLBs for 4 KiB pages and large pages, and if you don't use 4 KiB pages you won't be able to use most of the TLB and you'll get worse performance because of that. For an example, here's the TLB details for a Intel Core 2 CPU (originally from Sandpile):

4 KiB Code TLBs - 128 Entries, 4-Way, LRU
Large Code TLBs - 8 Entries (2 MiB pages only) or 4 Entries (4 MiB pages only), 4-Way, LRU
4 KiB Data TLBs - L0: 16 Entries, 4-Way, LRU, Loads Only; L1: 256 Entries, 4-Way, LRU
Large Data TLBs - L0: 16 Entries, 4-Way, LRU, Loads Only; L1: 32 Entries, 4-Way, LRU
PDC - 32 Entries, 4-Way, LRU (note: I have no idea what "PDC" is - I'm guessing it's "Page Directory Cache" that caches page directory entries instead of page table entries to reduce lookup times for TLB misses)

And here's the details for AMD's K8:

4 KiB Code L1 - 32 Entries, Fully, LRU, 2 Cycle Latency
4/2 MiB Code L1 - 8 Entries, Fully, LRU, 2 Cycle Latency
4 KiB Data L1 - 32 Entries, Fully, LRU -- Dual, 2 Cycle Latency
4/2 MiB Data L1 - 8 Entries, Fully, LRU -- Dual, 2 Cycle Latency
4 KiB Code L2 - 512 Entries, 4-Way, LRU, Exclusive, 10 Cycle Latency
4 KiB Data L2 - 512 Entries, 4-Way, LRU, Exclusive, 10 Cycle Latency
Miscellaneous - 24 Entry PDE Cache; 32 Entry CAM Flush Filter

As you can see, in almost all cases there's between 4 to 16 times as many 4 KiB TLBs as there are 2 MiB TLBs. Exceptions are the L0 data TLBs in Intel Core 2 (where 4 KiB pages have the same number of entries as 2 MiB pages), and AMD K8's L2 TLBs (where there simply isn't any TLBs for 2 MiB pages).

Also note that performance differences will depend on data usage. For a simple example, if you reserve 2 MiB for each process' stack then using a 2 MiB TLB will probably waste that TLB entry, because only a small part of the area is likely to be used often and you'd probably get the same number of TLB misses with a 4 KiB TLB entry.

Basically, for maximum TLB efficiency (minimum number of TLB misses) you'd want to use large page TLB entries for large data structures (where all data in the 2 MiB area is likely to be accessed at the same time), and use 4 KiB TLB entries for everything else (e.g. when you've run out of large page TLB entries).

Then there's RAM usage. For 4 KiB pages you'd need to use RAM for page tables (and this RAM could be considered wasted). For "2 MiB pages only", if you need to allocate 10 bytes for something then you'd be forced to allocate an entire 2 MiB page and you'd waste almost all of it. Regardless of what you do, RAM will be wasted for one reason or another. It's hard (impossible) to predict which method would waste more RAM unless you know in advance how many processes will be running and how much space each process will use.

iseyler wrote:What I would like is to map all of the physical RAM to start at 0xFFFF800000000000 and run the OS and apps in there

Because there's no segmentation for 64-bit code, that would mean that 64-bit processes would need to use relocatable code, and that you can't have any protection between separate processes, or between processes and the kernel (unless you use "software isolation", like Singularity).

iseyler wrote:So if a system had 4GB of RAM and a video card with 1GB of RAM you could theoretically access 5GB of memory space.

At 32 bits per pixel, 5 GiB is enough space for a 36000 * 36000 high resolution picture of my Grandmother. 5 GiB is small when you're talking about total address space size (even though it's large when you're talking about RAM). There's lots of techniques that are used to make the address space size seem a lot larger than available RAM (shared memory, swap space, allocation on demand, zeroed page optimization, memory mapped files, etc).

For most 64-bit OSs (where there's a separate virtual address space for each process), you'd end up with 131072 GiB of virtual address space for each process plus 131072 GiB of virtual address space for the kernel. If the OS supports a maximum of 4294967296 processes (32-bit process IDs); then in theory you end up with a maximum of 562949953552384 GiB of total virtual address space. Of course you'd probably need to use some swap space for that...

Lastly, in the past I've always tried to encourage people to avoid identity mapping RAM in kernel space for a variety of reasons (e.g. the ability to send parts of the kernel to swap space, fault tolerance, NUMA optimizations, etc). I won't repeat it here, because in reality most people (including me) need to write several kernels before they've learnt enough to start thinking about implementing more advanced features, and by the time they're ready for this they don't really need to ask.

iseyler wrote:Is this even possible? Or am I completely wrong in how x86 memory works?

It is possible. IMHO it's also not a very good design, but also IMHO there's probably no reason to worry about whether it's a good design or not (as long as you learn from it, it probably won't matter)...

Cheers,

Brendan

OSDev.org

Questions about paging in 64-bit mode

Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode

Re: Questions about paging in 64-bit mode