TLB entries, page faults and cached paging structures

cwillems · Post by **cwillems** » Mon Apr 16, 2012 5:23 am

Hi,

I have a few questions on which I was not able to find consistent answers in the Intel specs and/or on the web.
It would be great if someone could help me out with them:

- in general a TLB entry is created after a successful address translation. however, does this also happen if the resulting memory
is not accessible due to access protections (=> page fault)? I mean: the address translation itself worked properly ...

- is an existing TLB entry invalidated if a page fault occurs for that address? Is there different behavior for a) no memory mapped
or b) an access protection occurs ?

- besides the TLB entries: are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?

- if so: are there any circumstances under which the MMU flushes already cached PDE/PTE entries from the L1/L2/LLC caches?

Thanks a lot for your help in advance ...
cw

gravaera · Post by **gravaera** » Mon Apr 16, 2012 6:43 am

Yo:

1. If the "present" bit is set in the page table entry, then yes, it is loaded into the CPU's TLB.

2. Yes, an entry is replaced based on an LRU algorithm. For (a), if by that you mean, "the present bit is unset" then the answer is that the entry is not loaded into the TLB on Intel/AMD CPUs. There are x86 clones manufactured by less commonly known vendors which would still load the TLB entry when the table-walk found a "not present" page. For (b), no, there is no variation.

3. To my knowledge, yes they are read through the cache.

4. They should face the same eviction policy as other cache lines.

--Peace out,
gravaera

Solar · Post by **Solar** » Mon Apr 16, 2012 6:50 am

Cross-posting. It would be helpful if you had provided links to any significant non-consistent information you might have found.

I haven't looked at the Intel manuals in ages (since P-IV times, anyway), so take this with a grain of salt.

The TLB caches successful lookups of an address in the page directory (so the structures don't have to be "walked" the next time around). A "page fault" happens if the lookup is not successful. Invalid access rights is a different thing, is it not?

That being said, I am not aware of an "additional" cache for page directories / tables.

turdus · Post by **turdus** » Mon Apr 16, 2012 7:32 am

cwillems wrote:are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?

That depends. They are mixed with ordinary data, but separated according to the PS bit. 4k pages use a different cache than 2MB pages, so they will be separated if big pages used (for 64 bit x86 at least).

stlw · Post by **stlw** » Mon Apr 16, 2012 7:55 am

turdus wrote:
cwillems wrote:are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?
That depends. They are mixed with ordinary data, but separated according to the PS bit. 4k pages use a different cache than 2MB pages, so they will be separated if big pages used (for 64 bit x86 at least).

You mixing between TLB (which has separate entries for 4K, 2M/4M and 1G pages) and storing of PDE/PTE in the processor's L1/L2/LCC data caches.

The answer is:

- PML4E/PDPTE/PDE/PTE entries are also stored in the processor caches (L1/L2/LLC) as regular data and can be evicted through regular cache line eviction policy.

- In addition to it PMH (Page Miss Handler, hardware page walker) also has internal caches for PML4E/PDPTE/PDE entries. They are called PML4E cache/PDPTE cache/PDE cache (no surprise).
Intel manuals have very detailed description about these PMH caches and when they are invalidated.

Stanislav

P.S. In general first gravaera's answer is just perfect and fully explains what actually happens without misunderstanding.

cwillems · Post by **cwillems** » Tue Apr 17, 2012 3:23 am

Hi,

thanks a lot for your answers so far. The observed behavior now looks much more plausible to me.
I still have two questions left regarding the PML4E cache/PDPTE cache/PDE caches.
Though I know have found the corresponding Intel spec. I still wonder ..

- is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

Thanks again for your support
cw

Rudster816 · Post by **Rudster816** » Tue Apr 17, 2012 4:56 am

cwillems wrote:Hi,

thanks a lot for your answers so far. The observed behavior now looks much more plausible to me.
I still have two questions left regarding the PML4E cache/PDPTE cache/PDE caches.
Though I know have found the corresponding Intel spec. I still wonder ..

- is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

Thanks again for your support
cw

"Large page structures" have their own entries inside of the TLB. PML4E's, PDPTE's (that arent 1GB pages), and PDPE's (that aren't 2MB pages) are stored in a different cache. Just like pages, page structures will only be cached if the present bit is set. If you change a paging structure entry, you have to invalidate any pages that are present inside of that paging structure, or flush the entire TLB. There is no way to explicitly flush page structure caches.

Details about sizes\associativity of caches can be found in Chapter 11.1 volume 3A of the Intel manuals. I'm not sure what you plan to gain from reading that, but if you're looking for a way to avoid flushing pages that have their present bit set, forget about it, there is no way to do it safely.

gravaera · Post by **gravaera** » Tue Apr 17, 2012 7:22 am

Yo:

cwillems wrote: - is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

It seems like you're trying to consider a performance use-case for page table walks and decide how to best ensure that the internal caches aren't thrashed often. I'd say that's not worth the time. The best way to avoid unnecessary d-cache cooling due to page table walks is to avoid having the CPU need to do table walks altogether.

For the kernel, this would mean optimizing and tightening loops; or, when processing large amounts of data, try not to iterate over a multiple-page range repeatedly. Attempt for example, to finish all processing on data on the first page of data, then move on to doing all processing on the second page of data, etc until you've finished the data processing.

You can also use strategies such as mapping the entire kernel image in RAM using a single 4Mib page, and making it global so that kernel static code and data fetches never require a TLB load.

For userspace, this would mean depending on the userspace process to have optimized its own use of memory expanses so that it doesn't jump around the address space excessively; you can also use "non-migrating" scheduling techniques to increase the likelihood of a process running on a CPU that has a warm TLB for its address space. Basically, don't try to consider the fine details of how the CPU's i/d-cache or TLB work. The manuals even say that you should not assume the presence or behavioural details of a CPU's TLB. To answer your questions:

1. Tbh, I have no idea, sorry.
2. Again, no idea. I never paid attention to that kind of detail -- generalized optimization strategies will most likely outperform any kind of model-specific strategies in the common case, unless you're writing software which only needs to run on a known, small group of CPUs.

--Peace out
gravaera

turdus · Post by **turdus** » Tue Apr 17, 2012 8:49 am

stlw wrote:You mixing between TLB (which has separate entries for 4K, 2M/4M and 1G pages) and storing of PDE/PTE in the processor's L1/L2/LCC data caches.

You misunderstand. As far as I know, paging structure pages are cached like any other pages (as opposite to 2nd part of your sentence, that suggests they are not. But I'm not sure, possibly you're correct).
What I wanted to say, if you have big data pages (2M/4M, 1G), then the paging structures (as all being 4k) will be definitely placed in different caches (lines up with the 1st part of your sentence).

stlw · Post by **stlw** » Tue Apr 17, 2012 9:12 am

cwillems wrote:Hi,
- is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

Thanks again for your support
cw

1. yes
2. I believe Intel optimization guide have the exact details.
But in fact you don't need to read the optimization guide because the information is provided through CPUID as well.
Detailed information about CPUID can be found in the same Intel Software Developers Manual (vol2a)

Stanislav

Brendan · Post by **Brendan** » Tue Apr 17, 2012 9:56 am

Hi,

Ok, I was too lazy/busy to read all the previous replies and figure out what has been answered so far and what hasn't, so I covered everything asked. I apologise for any/all redundant answers.

cwillems wrote:- in general a TLB entry is created after a successful address translation. however, does this also happen if the resulting memory
is not accessible due to access protections (=> page fault)? I mean: the address translation itself worked properly ...

In general the CPU may cache any address translation. For all 80x86 CPUs this is likely to include translations that failed due to permission problems (e.g. if you write to a read-only page, then the CPU will remember the translation for the read-only page).

Intel CPUs (and I assume AMD CPUs too) will not remember address translations that result in "page not present"; however modern CPUs (with higher level TLB caches) may remember where to find the page directory or page directory pointer table, even if they don't remember that the page table itself refers to a "not present" page. In addition, other CPUs from other manufacturers (specifically old Cyrix CPUs) will remember that a page wasn't present.

cwillems wrote:- is an existing TLB entry invalidated if a page fault occurs for that address? Is there different behavior for a) no memory mapped
or b) an access protection occurs ?

A page fault won't cause any TLB invalidation. As an optimisation, some OS's do something called "lazy TLB invalidation" where they don't invalidate the TLB in some cases and then (if a page fault occurs due to stale data in a TLB) will invalidate the TLB in the page fault handler instead.

cwillems wrote:- besides the TLB entries: are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?

Yes and yes. The paging structures are also cached in the L1/L2/L3 data caches; and for modern CPUs there may be higher level TLB caches (e.g. the normal TLB entries that cache normal translations, plus a higher level TLB cache that remembers which page directory or page directory pointer table to use for different areas of the virtual address space). This means that (for example), if you access the virtual address 0x00000000 then the CPU might remember which physical page corresponds to virtual addresses from 0x00000000 to 0x00000FFF; and also remember which page directory corresponds to virtual addresses from 0x00000000 to 0x3FFFFFFF, so that if you then access the virtual address 0x22334455 it doesn't need to check the PLM4 and PDPT a second time (and only needs to look at the page directory and page table).

cwillems wrote:- if so: are there any circumstances under which the MMU flushes already cached PDE/PTE entries from the L1/L2/LLC caches?

CPU may flush anything from L1/L2/L3 (or from any TLB) whenever it feels like it. Most CPUs use (a variation of) a "least recently used" eviction strategy, where things that haven't been used recently are removed from the cache to make room for recently used things. This means that (for an example) a simple loop that reads every 64th byte of a large enough area (as large as the cache size) can completely fill the L1/L2/L3 data cache/s and all previous data will be evicted. This problem is called cache pollution and it is the reason why modern CPUs have things like the CLFLUSH instruction (so that software can explicitly flush a cache line, to try to minimise cache pollution). Note: Due to the way caches are designed, even with CLFLUSH the effects of cache pollution can't be entirely avoided - e.g. with an "8-way associative" cache, a loop like the one I described would wipe out 12.5% of the cache.

cwillems wrote: - is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?

No. L1/L2/L3 can cache anything the CPU touches, regardless of what it is (and regardless of whether it's a page table entry for a "not present" page or not). More specifically, what a CPU will cache in the L1/L2/L3 caches is only determined by access patterns, physical addresses and the MTRR/PAT settings.

cwillems wrote: - are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

CPUID (for modern CPUs) will tell you the size, associativity and how many (logical) CPUs share each level of cache. For older CPUs you may only be able to determine size, and associativity. For even older CPUs you may not be able to determine anything (without using "vendorID

model" as an index into your own lookup table/s derived from hours of searching through datasheets). I don't think there's any way to determine the replacement strategy of any cache (but I'd assume "least recently used"). Also note that some CPUs use inclusive caches (mostly Intel) and some use exclusive caches (mostly AMD). For an inclusive cache, a specific piece of data may be in multiple caches at the same time (e.g. in the L1, L2 and L3); while for an exclusive cache a specific piece of data can only be in one cache (e.g. if something is in the L1 cache then it can't also be in the L2 or L3 cache).

Cheers,

Brendan

OSDev.org

TLB entries, page faults and cached paging structures

TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures

Re: TLB entries, page faults and cached paging structures