Paging, PAT and MTRRs

johnsa · Post by **johnsa** » Sat Feb 14, 2015 2:42 am

Hey,

I just wanted to clarify and make sure that I understand this process correctly (as it seems to have gotten even more arcane than last time I was working on a kernel and setting up paging in 32bit pmode).

In the paging structures for long mode, we have PML4, PDP, PT etc.
For my system I'm not using any 4kb pages at all and instead sticking to using either 2Mb pages or 1Gb pages if available and where possible.

The paging structures now contain two or 3 bits: PCD(4) andd PWT(3) and PAT(only on actual entries which map memory).
Structure entries which act as pointers to futher tables do not include a PAT bit.
These bits combine to form an index into the PAT table:
2*PCD+PWT
4*PAT+2*PCD+PWT

Now each PAT entry in the MSRs has a specific memory type.
MTRRs also specify memory ranges with types.
The initial map of memory you get back from e820 doesn't give you enough detail to know which areas should be marked as write-combining, or write back etc.

What is the best approach to combine the info from MTRRS, E820 to setup these paging flags? I assume also that during PCI device enumeration I should look at the memory ranges in BARs and depending on each devices use of that memory range updating the relevant paging bits?

Also given that the pages are 2Mb (or 1Gb) they will obviously span much larger areas than some of the more restrictive ranges, so I guess at the sake of wasting some memory and simplicity one could/should just mark and 2Mb page that overlaps the restricted range?

Combuster · Post by **Combuster** » Sat Feb 14, 2015 3:27 am

As far as MTRRs go: Don't change them unless you want to set a specific setting for a specific device. The BIOS gets pretty much everything right on startup.

johnsa · Post by **johnsa** » Sat Feb 14, 2015 2:44 pm

In terms of SMP would that apply to all cores or only the BSP? (IE: One should transfer the BSP MTRR settings to the APs).

Brendan · Post by **Brendan** » Sat Feb 14, 2015 8:22 pm

Hi,

johnsa wrote:In terms of SMP would that apply to all cores or only the BSP? (IE: One should transfer the BSP MTRR settings to the APs).

For SMP, you must make sure that all MTRRs are the same on all CPUs. When that isn't the case the contents of RAM ends up corrupted due to different CPUs using different assumptions.

For changing the MTRRs there's a relatively strict sequence of steps that CPUs need to follow in "locked step" (where all CPUs finish "step n" before any CPU starts "step n+1").

For PAT, almost all normal pages of RAM should be "write-back" (which is default, so you don't need to care about PAT for that) and almost all memory mapped IO should be "uncached" (which is default, so you don't need to care about PAT for that). There's only a few special cases where you might bother with using PAT to improve performance (e.g. for video display memory, maybe, if you want it to be "write combining" and there wasn't a free "variable range MTTR" you could use to do it).

These special cases aren't necessarily limited to memory mapped IO either - there are a few (rare) cases where "write-back" isn't the best option for specific pages of RAM, either because of pathologic access patterns (where caching does more harm than good) or strange requirements. An example would be a massive array that's accessed with a very random access pattern, where there's nothing gained from caching (because you won't be using anything in that cache line again soon) and filling the cache with data you won't be using soon harms performance elsewhere. Another example might be a utility that's designed to measure/benchmark RAM chip bandwidth, where leaving the pages as "write-back" makes things harder (and increases the chance that the results will be distorted by instruction fetch or something).

Cheers,

Brendan

johnsa · Post by **johnsa** » Sun Feb 15, 2015 8:49 am

That makes sense,

I've correlated your wise words against chapter 11 in volume 3 of the Intel manuals and have followed your suggestions as to how to structure things. I have noted however that the PAT entries on boot/startup don't have a logical setup including all of the possible memory types so I am resetting the IA32_PAT msr to include one entry for each type with 0 still set to Write Back default. All my Page table entries are now PCD=0, PWT=0 and PAT=0 (so they should all by default be Write Back).
I have added the ability to the page allocator to request a number of pages with a specified type (WC,UC etc).

Where this still gets quite tricky is that I'm sticking to 2Mb pages and any 2Mb page which overlaps 2 or more MTRR ranges with different memory types can lead to undefined behaviour.
Especially given that the MTRRs are usually aligned to 4kb boundaries this is quite likely..

V_MTRR0 (1mb -> 29mb) as write back
V_MTRR1 (29mb -> 30mb) as uncached strong
V_MTRR2 (30mb -> 32mb) write combining

PAGE_14 = write back (28-30mb)
PAGE_15 = write_back (30-32mb)

Refer to : 11.11.9 in Volume 3.

I can either update MTRRs to fit into that page, or I can set the whole 2mb page to the most restrictive type.

johnsa · Post by **johnsa** » Sun Feb 15, 2015 9:35 am

FYI

Code: Select all



;------------------------------
; Default PAT on boot/reset
;------------------------------
; | PAT Entry | Memory Type
; | PAT0 	  |	WB
; | PAT1 	  | WT
; | PAT2 	  | UC
; | PAT3	  | UC
; | PAT4 	  | WB
; | PAT5 	  | WT
; | PAT6 	  | UC
; | PAT7      | UC

;------------------------------
;PAT Selection with paging bit flags
;------------------------------
;PAT PCD PWT PAT Entry
;0   0   0   PAT0
;0   0   1   PAT1
;0   1   0   PAT2
;0   1   1   PAT3
;1   0   0   PAT4
;1   0   1   PAT5
;1   1   0   PAT6
;1   1   1   PAT7
;2*PCD+PWT
;4*PAT+2*PCD+PWT
;bit 12=pat
;bit 3=pwt (page write through)
;bit 4=pcd (page cache disable)

PAT_UNCACHEABLE    = 0x00
PAT_WRITECOMBINING = 0x01
PAT_WRITETHROUGH   = 0x04
PAT_WRITEPROTECTED = 0x05
PAT_WRITEBACK      = 0x06
PAT_UNCACHED       = 0x07

;------------------------------
; MSR_IA32_PAT layout
;------------------------------
;bits  entry
;0-2   PA0
;8-10  PA1
;16-18 PA2
;24-26 PA3
;32-34 PA4
;40-42 PA5
;48-50 PA6
;56-58 PA7

;------------------------------
; Our PAT Structure
;------------------------------
;PA0=PAT_WRITEBACK
;PA1=PAT_WRITEBACK
;PA2=PAT_WRITETHROUGH
;PA3=PAT_WRITEPROTECTED
;PA4=PAT_WRITECOMBINING
;PA5=PAT_UNCACHEABLE
;PA6=PAT_UNCACHED
;PA7=PAT_UNCACHED

proc k_init_pat_uniprocessor
	
	mov eax,1
	cpuid
	shr edx,16
	and edx,1
	.if edx = 0
		ret 	; No PAT support.
	.endif
	
	; Disable Interrupts
	cli
	
	; Enable no caching mode.
	mov rax,cr0
	or rax,(1 shl 30) 			; Set CD
	and rax, not (1 shl 29) 	; Clear NW
	mov cr0,rax
	
	; Flush caches
	wbinvd

	; Clear PGE (Flush TLB)
	mov rax,cr4
	mov rbx,rax
	mov rax,0xFFFFFFFFFFFFFF7F
	and rbx,rax
	mov rax,rbx 
	mov cr4,rax

	; Update PAT Entries
	mov rcx,MSR_IA32_PAT
	mov rax,(PAT_UNCACHED shl 56) or (PAT_UNCACHED shl 48) or (PAT_UNCACHEABLE shl 40) or (PAT_WRITECOMBINING shl 32) or (PAT_WRITEPROTECTED shl 24) or (PAT_WRITETHROUGH shl 16) or (PAT_WRITEBACK shl 8) or (PAT_WRITEBACK shl 0)
	mov rdx,rax
	shr rdx,32
	and eax,0xffffffff
	wrmsr
	
	; Flush caches
	wbinvd

	; Clear PGE (Flush TLB)
	mov rax,cr4
	mov rbx,rax
	mov rax,0xFFFFFFFFFFFFFF7F
	and rbx,rax
	mov rax,rbx 
	mov cr4,rax

	; Enable normal caching
	mov rax,cr0
	and rax,not (1 shl 30) 		; Clear CD
	and rax, not (1 shl 29) 	; Clear NW
	mov cr0,rax
		
	; Enable PGE
	mov rax,cr4
	or rax,0x80
	mov cr4,rax
	
	; Enable interrupts
	sti
	
	ret
endp

proc k_init_pat
	
	mov eax,1
	cpuid
	shr edx,16
	and edx,1
	.if edx = 0
		ret 	; No PAT support.
	.endif
	
	; fastcall k_signal_create_countdown, [k_coreCount]
	; mov [countdownHandle0],rax
	; fastcall k_signal_create_countdown, [k_coreCount]
	; mov [countdownHandle1],rax
		
	; fastcall k_signal_all_ap_cores, addr k_init_pat_smp
	
	; Disable Interrupts
	cli
	
	; Signal the countdown latch that this core has completed.
	; fastcall k_signal_countdown, [countdownHandle0]
	
	; Ensure all Cores reach here
	; fastcall k_wait_all, [countdownHandle0]
	
	; Enable no caching mode.
	mov rax,cr0
	or rax,(1 shl 30) 			; Set CD
	and rax, not (1 shl 29) 	; Clear NW
	mov cr0,rax
	
	; Flush caches
	wbinvd

	; Clear PGE (Flush TLB)
	mov rax,cr4
	mov rbx,rax
	mov rax,0xFFFFFFFFFFFFFF7F
	and rbx,rax
	mov rax,rbx 
	mov cr4,rax

	; Update PAT Entries
	mov rcx,MSR_IA32_PAT
	mov rax,(PAT_UNCACHED shl 56) or (PAT_UNCACHED shl 48) or (PAT_UNCACHEABLE shl 40) or (PAT_WRITECOMBINING shl 32) or (PAT_WRITEPROTECTED shl 24) or (PAT_WRITETHROUGH shl 16) or (PAT_WRITEBACK shl 8) or (PAT_WRITEBACK shl 0)
	mov rdx,rax
	shr rdx,32
	and eax,0xffffffff
	wrmsr
	
	; Flush caches
	wbinvd

	; Clear PGE (Flush TLB)
	mov rax,cr4
	mov rbx,rax
	mov rax,0xFFFFFFFFFFFFFF7F
	and rbx,rax
	mov rax,rbx 
	mov cr4,rax

	; Enable normal caching
	mov rax,cr0
	and rax,not (1 shl 30) 		; Clear CD
	and rax, not (1 shl 29) 	; Clear NW
	mov cr0,rax
		
	; Enable PGE
	mov rax,cr4
	or rax,0x80
	mov cr4,rax
	
	; Signal the countdown latch that this core has completed.
	; fastcall k_signal_countdown, [countdownHandle1]
		
	; Ensure all Cores reach here
	; fastcall k_wait_all, [countdownHandle1]
	
	; Enable interrupts
	sti
	
	ret
endp

proc k_init_pat_smp

	; Disable Interrupts
	cli
	
	; Signal the countdown latch that this core has completed.
	; fastcall k_signal_countdown, [countdownHandle0]
	
	; Ensure all Cores reach here
	; fastcall k_wait_all, [countdownHandle0]
	
	; Enable no caching mode.
	mov rax,cr0
	or rax,(1 shl 30) 			; Set CD
	and rax, not (1 shl 29) 	; Clear NW
	mov cr0,rax
	
	; Flush caches
	wbinvd

	; Clear PGE (Flush TLB)
	mov rax,cr4
	mov rbx,rax
	mov rax,0xFFFFFFFFFFFFFF7F
	and rbx,rax
	mov rax,rbx 
	mov cr4,rax

	; Update PAT Entries
	mov rcx,MSR_IA32_PAT
	mov rax,(PAT_UNCACHED shl 56) or (PAT_UNCACHED shl 48) or (PAT_UNCACHEABLE shl 40) or (PAT_WRITECOMBINING shl 32) or (PAT_WRITEPROTECTED shl 24) or (PAT_WRITETHROUGH shl 16) or (PAT_WRITEBACK shl 8) or (PAT_WRITEBACK shl 0)
	mov rdx,rax
	shr rdx,32
	and eax,0xffffffff
	wrmsr
	
	; Flush caches
	wbinvd

	; Clear PGE (Flush TLB)
	mov rax,cr4
	mov rbx,rax
	mov rax,0xFFFFFFFFFFFFFF7F
	and rbx,rax
	mov rax,rbx 
	mov cr4,rax

	; Enable normal caching
	mov rax,cr0
	and rax,not (1 shl 30) 		; Clear CD
	and rax, not (1 shl 29) 	; Clear NW
	mov cr0,rax
		
	; Enable PGE
	mov rax,cr4
	or rax,0x80
	mov cr4,rax
	
	; Signal the countdown latch that this core has completed.
	; fastcall k_signal_countdown, [countdownHandle1]
		
	; Ensure all Cores reach here
	; fastcall k_wait_all, [countdownHandle1]
	
	; Enable interrupts
	sti
	
	ret
endp

Brendan · Post by **Brendan** » Sun Feb 15, 2015 10:32 pm

Hi,

johnsa wrote:Where this still gets quite tricky is that I'm sticking to 2Mb pages and any 2Mb page which overlaps 2 or more MTRR ranges with different memory types can lead to undefined behaviour.
Especially given that the MTRRs are usually aligned to 4kb boundaries this is quite likely..

For normal RAM, you could just avoid using any 2 MiB page that isn't 100% RAM. In that case you can expect to waste ~640 KiB at 0x00000000, and possibly waste several other areas. On modern systems (e.g. with 4 GiB of RAM or more) this might not be much of a problem.

For executable files, typically you want to use page level protection. For example, an executable's ".text" section might be "read only, executable", it's ".rodata" might be "read only, no execute", it's ".data" might be "read/write, no execute", etc. On average you can expect to waste half a page per section. If you have 200 processes with 4 sections each, then that's an average of 800 MiB of RAM wasted.

Then there's shared libraries. If all the processes combined use a total of 50 different shared libraries and each shared library has 3 sections, then you'd have to waste another 150 MiB of RAM for that.

Next comes file system caching and its interaction with memory mapped files. For example, maybe you've got a process that wants to memory map some small files but you also want some of those files to be present in the VFS's cache (so other processes can access them quickly). To allow efficient memory mapping the VFS file cache would need to use 2 MiB blocks for file data; so for every file in the VFS cache you can expect to waste another 1 MiB. If you've got 1234 small files in the VFS cache, then you waste 1 GiB of RAM.

Then there's things like shared memory. A pair of processes want to communicate with a small 20 KiB shared memory buffer? Waste some RAM. A process wants to "fork()"? Every 2 MiB page that's modified needs to be "copied on write", so if only 3 bytes are modified you need to allocate a whole 2 MiB (and not just 4 KiB), so waste a lot more RAM for that.

Now let's add this up. Maybe 3 MiB of RAM wasted due to "partial pages" in the memory map; then 800 MiB of RAM wasted to make page level protection work for processes and another 150 MiB wasted for shared libraries, 1 GiB of RAM wasted for disk caches and maybe another 10 MiB wasted for things like shared memory areas. On a computer with 4 GiB of RAM you might be wasting over half the RAM. On a computer with 32 GiB of RAM; you'd be able to run more processes, have more files in the VFS cache, etc; and maybe you'd still be wasting a quarter of the RAM.

It's not just RAM usage though. For a lot of things there's a compromise between RAM and performance. Rather than having an expensive calculation you might have a lookup table and use more RAM to get better performance. Rather than doing more disk IO you might cache more file data. Rather than doing all the "server side includes" again a web server might do it once and cache the resulting web page. By wasting a significant amount of RAM you can cause significant performance problems (up to and including swapping pages to/from disk and making everything several orders of magnitude slower, because the RAM you could've/should've been using to avoid swapping was wasted).

Mostly; you need to support 4 KiB pages. You can support both 2 MiB and 4 KiB pages (and 1 GiB pages too for extreme/unlikely cases); and that can be a little tricky (your physical memory managed needs a way to recombine 4 KiB pages back into a 2 MiB page, otherwise all your 2 MiB pages end up split up and gone after the OS has been running a little while); but you can't just have 2 MiB pages and nothing else.

Cheers,

Brendan

johnsa · Post by **johnsa** » Wed Feb 18, 2015 5:49 am

Agreed 100% under normal circumstance it would be very wasteful and would yield awful performance.

However under these conditions perhaps it's not so bad:

1) Assume minimum of 4-8gb ram.
2) Their is only one process running (Single tasking -> multicore capable) .. a bit like the xbox.
3) Page level protection does apply but there's only one process (so it will only occupy a minimum of 8mb).
4) A process can be switched in/out.. but in a "fast" single tasking way.
5) Shared libraries (same applies..this is a bit wasteful if there are 50 odd running)
6) Shared memory isn't necessary as there is no IPC..
7) File system caching would probably use multiples of 2Mb pages but sub-allocate cache from them?

I think the 2mb page model can work exclusively without 4kb pages as long as you're not design a generic OS (ie: a single tasking real-time setup).

Or am I still mental?

Brendan · Post by **Brendan** » Wed Feb 18, 2015 9:07 pm

Hi,

johnsa wrote:I think the 2mb page model can work exclusively without 4kb pages as long as you're not design a generic OS (ie: a single tasking real-time setup).

Or am I still mental?

For single-tasking using 2 MiB pages would be "less bad"; but single-tasking itself will make up for that.

johnsa wrote:3) Page level protection does apply but there's only one process (so it will only occupy a minimum of 8mb).
4) A process can be switched in/out.. but in a "fast" single tasking way.

I don't know what you mean here. If there's only one process, why would you switch it out; and what does "in a fast single tasking way" mean?

johnsa wrote:5) Shared libraries (same applies..this is a bit wasteful if there are 50 odd running)

There's 2 benefits of shared libraries - they allow you to have portability (e.g. the same code running on different OSs with a portable API like POSIX implemented as a shared library that hides the differences between the OSs), and the same pages can be shared by multiple processes (e.g. so if a library costs 5 MiB and is used by 10 processes, it costs 5 MiB in total and not 50 MiB).

There's also 2 main disadvantages. They make (compile-time or link-time) optimisation impossible. For a simple example, if you've got code does "myBool = isPrimeNumber(1234567);" but that function is in a library, then the compiler can't inline it and can't reduce it all the way down to "myBool = false;" and can't avoid a massive amount of run-time overhead. The other disadvantage is that you can upgrade the shared library without upgrading the processes that use it; which may lead to processes failing because they relied on quirks/bugs in the old library, and may lead to "dependency hell", and may also lead to malicious code (e.g. a virus) being able to inject code into all processes that use the library at the same time (or a trojan thing that comes with a slightly newer version of the shared library).

For your case, the first advantage probably doesn't matter and the second advantage can't matter (as there's no multi-tasking); and all you'd get is the disadvantages with no advantages. It might make more sense to forget about shared libraries and statically link everything instead (which would also make the "~1 MiB per section wasted for shared libraries" problem disappear completely).

Cheers,

Brendan

johnsa · Post by **johnsa** » Thu Feb 19, 2015 5:34 am

My idea is basically this:
single-tasking, in that you launch a process and that is the only process running, so there is no scheduler/time-slicing going on. That process can make use of multiple cores using a library to initiate tasks on specified cores:
execute(CORE0, myFunction(0));
execute(CORE1, myFunction(1));
wait_for_cores(0..1);

The process remains in memory, if you switch to another process, code executing on AP cores continues to run and will update it's latches/semaphores etc.. so if the main process resumes it can continue where it left off in a wait_for_cores OR
if the AP core code was meant to be fully async it will have a callback assigned so that on completion it does what it needs to and frees up the core.

It's a bit like a dos TSR, but using multiple cores with some async/callback capability.

The "single tasking" switch is basically a user-initiated task switch instead of a scheduler.

(One catch obviously is that you cannot have a "window" manager type model with applications running event loops if it's single tasking).

core0: ftp client app starts
user initiates a file download (which will take 5min)
core1: (file download code runs on core1) -> callback assigned = complete_file_download();
core0: user switches to textpad application (ftp client stays in mem, textpad app loaded in).
core1: complete_file_download() is called
core1: halt.

I was planning on having libraries work the way they did on the Amiga 68k..
so the os loads the library (anywhere in memory) each library includes a version, so you can have 2 or more version of a library in memory at the same time.
the os returns a base pointer to the library, to which you have a set of vectors relative to that base to all the library functions.

invoke Load_Library, 'exec.library',100 ;Load exec library v1.00
LIB_INVOKE [rax+EXECLIB.OpenFile], 'myfile.dat'

So while the library can be "shared" in that the os would return the same pointer to any app requesting it if already loaded it wouldn't be paged into the app's address space.

OSDev.org

Paging, PAT and MTRRs

Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs

Re: Paging, PAT and MTRRs