OSDev.org

Posted: **Sun Feb 28, 2021 12:02 am**

I've wanted to map all physical memory for a while as this is an obvious optimization that can be used for many things. It allows you to get a virtual address for any physical address easily and do the reverse if the virtual address is in the direct mapping range.

So I've read that you can only have 46 bits of physical addresses and you have 48 bits of virtual address to work with. So I did the obvious and mapped all physical memory in my system at offset 0xFFFF800000000000 using large or huge pages when supported. I optimized my vmm_get_physical_address() and my vmm_map_pages(physicalAddress, flags) to use this direct mapping. Pretty straightforward and I get a huge performance boost for any kernel data structures that are dynamically allocated, which includes all the TCBs and IPC buffers.

Now this memory is mapped using these flags: PAGE_PRESENT | PAGE_WRITE | PAGE_NX.

It just dawned on me that this doesn't work for memory-mapped hardware devices where I want to disable caching using something like PAGE_WRITE_THROUGH | PAGE_CACHE_DISABLE.

I suppose I could map all the physical memory once more with these flags at 0x0000400000000000, and I might just do that... But I was curious what others have done. I don't recall anyone mentioning mapping the physical memory multiple times, but maybe I am recalling wrong or missed it. Also Linux doesn't do that... But I don't know that this is an argument either way.

Posted: **Sun Feb 28, 2021 12:15 am**

I'm relying on firmware setting up the MTRRs correctly. Then I don't have to deal with caching on a page level. It's worked out so far.

Posted: **Sun Feb 28, 2021 12:35 am**

So then I take it you aren't configuring / touching the PAT at all? And neither MTRRs?

Just map things normally without any caching flags and things just work?

I didn't expect that... There has been discussions here in the past about mapping the framebuffer in WC mode... But if the MTRRs are doing it, is that still required?

I suppose I'll have to dump my MTRRs to see what the firmware put there. But I thought I'd ask anyways

Posted: **Sun Feb 28, 2021 1:24 am**

So I looked it up in the manual. If I don't touch the PAT bits, I am essentially selecting WB mode with the PAT every time. According to Intel SDM v3A, Table 11-7, this means the effective cache type of a given region is always the same as the one set up in the MTRRs. And unless I ever get around to dynamically moving my frame buffer around, I have little reason to reconfigure those. The firmware already needs to setup the MTRRs during the process of initializing PCI, and it must initialize PCI to even find the hardware to boot the OS with. Therefore, I will simply not touch those things and let other people worry about it.

kzinti wrote:I didn't expect that... There has been discussions here in the past about mapping the framebuffer in WC mode... But if the MTRRs are doing it, is that still required?

It is entirely possible that firmware sets up the frame buffer to be in the wrong mode. WT, WP, and UC would all have the effect of propagating writes directly to main memory, and would therefore be suitable to a frame buffer since writes would have an immediate effect, but lead to worse performance. But honestly, at this time I don't care enough about performance to even worry about this. If I ever see a system that does this, I may have to reconsider this stance, but so far it has worked out. It's almost as if the eggheads at Intel put some thought into this system.

Posted: **Sun Feb 28, 2021 4:05 am**

kzinti wrote: I didn't expect that... There has been discussions here in the past about mapping the framebuffer in WC mode... But if the MTRRs are doing it, is that still required?

That is definitely one of the things on my TODO list. Once I get there.
I remember seeing a thread about mapping the framebuffer in different modes, and IIRC the performance difference was significant.
I might be wrong, but I think klange does something similar.

Posted: **Sun Feb 28, 2021 4:09 am**

kzinti wrote:There has been discussions here in the past about mapping the framebuffer in WC mode...

I remember that one of them was mine. I made a framebuffer console working fine, but on real hardware it was too slow: about 250 cycles/pixel. That number is not random: it's roughly what a completely uncached access to memory costs. That's completely unacceptable. To improve things I implemented fpu_memcpy() using the largest registers available SSE, AVX, AVX 2 etc, up to 256-bit. And it improved linearly, up to 8x with 256-bit registers. Better, but still not fast enough.

At the end, I realized that the right solution was to use PAT (which can override MTRRs) and make the framebuffer WC. The improvement was HUGE: from ~250 cycles/pixel to 1.125 cycles / pixel. In other words, I got an improvement of 220x! Then, I tried to replace the fpu_memcpy() with a regular memcpy() and the result was a performance degradation of something like 8% (if I remember correctly). In other words, using PAT to set the framebuffer pages to be WC is mandatory and it matters more than any other optimization.

Posted: **Sun Feb 28, 2021 8:28 am**

kzinti wrote:Now this memory is mapped using these flags: PAGE_PRESENT | PAGE_WRITE | PAGE_NX.
It just dawned on me that this doesn't work for memory-mapped hardware devices where I want to disable caching using something like PAGE_WRITE_THROUGH | PAGE_CACHE_DISABLE.

Why don't you map the entire memory once with pageflags like (conventional_memory(phys_addr) ? PAGE_PRESENT | PAGE_WRITE | PAGE_NX : PAGE_WRITE_THROUGH | PAGE_CACHE_DISABLE)? I see no point in having two mappings when you can solve this with one easily. You can still use 2M and 1G pages if the entire region is of the same type (conventional vs. device memory).

Cheers,
bzt

Posted: **Sun Feb 28, 2021 12:19 pm**

vvaltchev wrote:At the end, I realized that the right solution was to use PAT (which can override MTRRs) and make the framebuffer WC. The improvement was HUGE:

I did implement WC using PAT for my framebuffer. I didn't measure performances with/without it as I didn't have a timer at the time. But it did seem to make a huge difference for me as well. I'll have to dump the MTRRs to see how the framebuffer is setup there because if it is UC, setting the PAT to WC shouldn't help at all (according to the AMD manual). Might also be time for some profiling.

bzt wrote:Why don't you map the entire memory once with pageflags like (conventional_memory(phys_addr) ? PAGE_PRESENT | PAGE_WRITE | PAGE_NX : PAGE_WRITE_THROUGH | PAGE_CACHE_DISABLE)?

That's a good idea, I didn't think of using the bootloader memory map to figure what should be mapped cacheable or not. I'll see if I can get away with just the MTRRs + PAT, but this seems a good solution as well.

Posted: **Sun Feb 28, 2021 1:00 pm**

kzinti wrote:So I've read that you can only have 46 bits of physical addresses

Future CPUs can have up to 52 bits of physical addresses. Some current CPUs already claim support for at least 48 bits of physical addresses through CPUID, although I'm not sure if there's any existing hardware that can use all of those bits.

kzinti wrote:and you have 48 bits of virtual address to work with.

Or 57, if Intel ever releases the CPUs that are supposed to support it.

kzinti wrote:using large or huge pages when supported.

Large/huge pages are not allowed to map any region with more than one effective memory type. For example, if the MTRRs are configured to map part of a large page WB and the other part UC, then you must either split the large page into smaller pages or configure the large page to be UC.

nullplan wrote:I'm relying on firmware setting up the MTRRs correctly. Then I don't have to deal with caching on a page level. It's worked out so far.

You still need to pay at least enough attention to the MTRRs to split large pages that cross memory type boundaries, otherwise you'll get undefined behavior.

kzinti wrote:I'll have to dump the MTRRs to see how the framebuffer is setup there because if it is UC, setting the PAT to WC shouldn't help at all (according to the AMD manual).

I'm not sure where you're seeing that; Intel and AMD both agree selecting WC in the PAT overrides whatever is set in the MTRRs.

Posted: **Sun Feb 28, 2021 1:27 pm**

Octocontrabass wrote:I'm not sure where you're seeing that; Intel and AMD both agree selecting WC in the PAT overrides whatever is set in the MTRRs.

AMD64 Programmer's Manual volume 2: system programming.
Table 7-7: Combine MTRR and Page-Level MemorytType with Unmodified PAT MSR.

The first row says that if the MTRR memory type is UC, the page flags are ignored and you end up with UC.

Now I just noticed the "Unmodified PAT MSR" part in the table's name which I missed before. So you are probably right. Do you know where to find that info in Intel's (or AMD's) manual?

Posted: **Sun Feb 28, 2021 1:53 pm**

kzinti wrote:Do you know where to find that info in Intel's (or AMD's) manual?

Intel: volume 3A, table 11-7.
AMD: volume 2, table 7-10.

Posted: **Sun Feb 28, 2021 1:58 pm**

Awesome, thanks!

Posted: **Sun Feb 28, 2021 5:52 pm**

kzinti wrote: I don't recall anyone mentioning mapping the physical memory multiple times, but maybe I am recalling wrong or missed it. Also Linux doesn't do that... But I don't know that this is an argument either way.

IIRC it is doable and the Intel manual mentioned it.

But IMO, this setup (multiple logical apertures with different cache attribute mapped to the same physical memory) is quite risky in terms of correctness. For example the CPU might prefetch when your code accessed something around the cache-able aperture and cause side effect on the MMIO registers within it.

It's probably a worse situation to deal with than a SMP system with no hardware cache coherency.

Posted: **Sun Feb 28, 2021 5:58 pm**

xeyes wrote:But IMO, this setup (multiple logical apertures with different cache attribute mapped to the same physical memory) is quite risky in terms of correctness. For example the CPU might prefetch when your code accessed something around the cache-able aperture and thus cause side effect on the MMIO registers if it has read side effects.

It is not the intention to access the same physical memory through different mappings. The intentions are:

1) Save time / complexity by not having to map conventional memory for kernel use (since it's all mapped already).
2) No need to map MMIO hardware.
3) Reduce load on TLB by using large pages. Taking into account MTRRs is going to make this tricky.
4) Having this for x86_64 and not for other archs (ia32, ...) forces me to properly abstract my memory manager interfaces.

xeyes wrote:It's probably a worse situation to deal with than a SMP system with no hardware cache coherency.

I am not sure I would want (or even could) use something like this on non-x86_64. It could be tricky indeed.

Posted: **Sun Feb 28, 2021 6:13 pm**

kzinti wrote:
xeyes wrote:But IMO, this setup (multiple logical apertures with different cache attribute mapped to the same physical memory) is quite risky in terms of correctness. For example the CPU might prefetch when your code accessed something around the cache-able aperture and thus cause side effect on the MMIO registers if it has read side effects.
It is not the intention to access the same physical memory through different mappings. The intentions are:

1) Save time / complexity by not having to map conventional memory for kernel use (since it's all mapped already).
2) No need to map MMIO hardware.
3) Reduce load on TLB by using large pages. Taking into account MTRRs is going to make this tricky.
4) Having this for x86_64 and not for other archs (ia32, ...) forces me to properly abstract my memory manager interfaces.

xeyes wrote:It's probably a worse situation to deal with than a SMP system with no hardware cache coherency.
I am not sure I would want (or even could) use something like this on non-x86_64. It could be tricky indeed.

That makes sense, I'm just saying, try your best to avoid a setup like that even if you don't intend to use both apertures.

The coherency thing is just a random example, I personally find it easier to reason with HW that don't provide any coherency support than HW that can treat the same region in different ways.

OSDev.org

Mapping all physical memory on x86_64

Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64

Re: Mapping all physical memory on x86_64