TLB entries, page faults and cached paging structures

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
cwillems
Posts: 2
Joined: Mon Apr 16, 2012 5:22 am

TLB entries, page faults and cached paging structures

Post by cwillems »

Hi,

I have a few questions on which I was not able to find consistent answers in the Intel specs and/or on the web.
It would be great if someone could help me out with them:

- in general a TLB entry is created after a successful address translation. however, does this also happen if the resulting memory
is not accessible due to access protections (=> page fault)? I mean: the address translation itself worked properly ...

- is an existing TLB entry invalidated if a page fault occurs for that address? Is there different behavior for a) no memory mapped
or b) an access protection occurs ?

- besides the TLB entries: are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?

- if so: are there any circumstances under which the MMU flushes already cached PDE/PTE entries from the L1/L2/LLC caches?

Thanks a lot for your help in advance ...
cw
User avatar
gravaera
Member
Member
Posts: 737
Joined: Tue Jun 02, 2009 4:35 pm
Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.

Re: TLB entries, page faults and cached paging structures

Post by gravaera »

Yo:

1. If the "present" bit is set in the page table entry, then yes, it is loaded into the CPU's TLB.

2. Yes, an entry is replaced based on an LRU algorithm. For (a), if by that you mean, "the present bit is unset" then the answer is that the entry is not loaded into the TLB on Intel/AMD CPUs. There are x86 clones manufactured by less commonly known vendors which would still load the TLB entry when the table-walk found a "not present" page. For (b), no, there is no variation.

3. To my knowledge, yes they are read through the cache.

4. They should face the same eviction policy as other cache lines.

--Peace out,
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: TLB entries, page faults and cached paging structures

Post by Solar »

Cross-posting. It would be helpful if you had provided links to any significant non-consistent information you might have found.

I haven't looked at the Intel manuals in ages (since P-IV times, anyway), so take this with a grain of salt.

The TLB caches successful lookups of an address in the page directory (so the structures don't have to be "walked" the next time around). A "page fault" happens if the lookup is not successful. Invalid access rights is a different thing, is it not?

That being said, I am not aware of an "additional" cache for page directories / tables.
Every good solution is obvious once you've found it.
User avatar
turdus
Member
Member
Posts: 496
Joined: Tue Feb 08, 2011 1:58 pm

Re: TLB entries, page faults and cached paging structures

Post by turdus »

cwillems wrote:are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?
That depends. They are mixed with ordinary data, but separated according to the PS bit. 4k pages use a different cache than 2MB pages, so they will be separated if big pages used (for 64 bit x86 at least).
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: TLB entries, page faults and cached paging structures

Post by stlw »

turdus wrote:
cwillems wrote:are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?
That depends. They are mixed with ordinary data, but separated according to the PS bit. 4k pages use a different cache than 2MB pages, so they will be separated if big pages used (for 64 bit x86 at least).
You mixing between TLB (which has separate entries for 4K, 2M/4M and 1G pages) and storing of PDE/PTE in the processor's L1/L2/LCC data caches.

The answer is:

- PML4E/PDPTE/PDE/PTE entries are also stored in the processor caches (L1/L2/LLC) as regular data and can be evicted through regular cache line eviction policy.

- In addition to it PMH (Page Miss Handler, hardware page walker) also has internal caches for PML4E/PDPTE/PDE entries. They are called PML4E cache/PDPTE cache/PDE cache (no surprise).
Intel manuals have very detailed description about these PMH caches and when they are invalidated.

Stanislav

P.S. In general first gravaera's answer is just perfect and fully explains what actually happens without misunderstanding.
cwillems
Posts: 2
Joined: Mon Apr 16, 2012 5:22 am

Re: TLB entries, page faults and cached paging structures

Post by cwillems »

Hi,

thanks a lot for your answers so far. The observed behavior now looks much more plausible to me.
I still have two questions left regarding the PML4E cache/PDPTE cache/PDE caches.
Though I know have found the corresponding Intel spec. I still wonder ..

- is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

Thanks again for your support
cw
Rudster816
Member
Member
Posts: 141
Joined: Thu Jun 17, 2010 2:36 am

Re: TLB entries, page faults and cached paging structures

Post by Rudster816 »

cwillems wrote:Hi,

thanks a lot for your answers so far. The observed behavior now looks much more plausible to me.
I still have two questions left regarding the PML4E cache/PDPTE cache/PDE caches.
Though I know have found the corresponding Intel spec. I still wonder ..

- is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

Thanks again for your support
cw
"Large page structures" have their own entries inside of the TLB. PML4E's, PDPTE's (that arent 1GB pages), and PDPE's (that aren't 2MB pages) are stored in a different cache. Just like pages, page structures will only be cached if the present bit is set. If you change a paging structure entry, you have to invalidate any pages that are present inside of that paging structure, or flush the entire TLB. There is no way to explicitly flush page structure caches.

Details about sizes\associativity of caches can be found in Chapter 11.1 volume 3A of the Intel manuals. I'm not sure what you plan to gain from reading that, but if you're looking for a way to avoid flushing pages that have their present bit set, forget about it, there is no way to do it safely.
User avatar
gravaera
Member
Member
Posts: 737
Joined: Tue Jun 02, 2009 4:35 pm
Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.

Re: TLB entries, page faults and cached paging structures

Post by gravaera »

Yo:
cwillems wrote: - is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?
It seems like you're trying to consider a performance use-case for page table walks and decide how to best ensure that the internal caches aren't thrashed often. I'd say that's not worth the time. The best way to avoid unnecessary d-cache cooling due to page table walks is to avoid having the CPU need to do table walks altogether.

For the kernel, this would mean optimizing and tightening loops; or, when processing large amounts of data, try not to iterate over a multiple-page range repeatedly. Attempt for example, to finish all processing on data on the first page of data, then move on to doing all processing on the second page of data, etc until you've finished the data processing.

You can also use strategies such as mapping the entire kernel image in RAM using a single 4Mib page, and making it global so that kernel static code and data fetches never require a TLB load.

For userspace, this would mean depending on the userspace process to have optimized its own use of memory expanses so that it doesn't jump around the address space excessively; you can also use "non-migrating" scheduling techniques to increase the likelihood of a process running on a CPU that has a warm TLB for its address space. Basically, don't try to consider the fine details of how the CPU's i/d-cache or TLB work. The manuals even say that you should not assume the presence or behavioural details of a CPU's TLB. To answer your questions:

1. Tbh, I have no idea, sorry.
2. Again, no idea. I never paid attention to that kind of detail -- generalized optimization strategies will most likely outperform any kind of model-specific strategies in the common case, unless you're writing software which only needs to run on a known, small group of CPUs.

--Peace out
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
User avatar
turdus
Member
Member
Posts: 496
Joined: Tue Feb 08, 2011 1:58 pm

Re: TLB entries, page faults and cached paging structures

Post by turdus »

stlw wrote:You mixing between TLB (which has separate entries for 4K, 2M/4M and 1G pages) and storing of PDE/PTE in the processor's L1/L2/LCC data caches.
You misunderstand. As far as I know, paging structure pages are cached like any other pages (as opposite to 2nd part of your sentence, that suggests they are not. But I'm not sure, possibly you're correct).
What I wanted to say, if you have big data pages (2M/4M, 1G), then the paging structures (as all being 4k) will be definitely placed in different caches (lines up with the 1st part of your sentence).
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: TLB entries, page faults and cached paging structures

Post by stlw »

cwillems wrote:Hi,
- is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
- are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?

Thanks again for your support
cw

1. yes
2. I believe Intel optimization guide have the exact details.
But in fact you don't need to read the optimization guide because the information is provided through CPUID as well.
Detailed information about CPUID can be found in the same Intel Software Developers Manual (vol2a)

Stanislav
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: TLB entries, page faults and cached paging structures

Post by Brendan »

Hi,

Ok, I was too lazy/busy to read all the previous replies and figure out what has been answered so far and what hasn't, so I covered everything asked. I apologise for any/all redundant answers.
cwillems wrote:- in general a TLB entry is created after a successful address translation. however, does this also happen if the resulting memory
is not accessible due to access protections (=> page fault)? I mean: the address translation itself worked properly ...
In general the CPU may cache any address translation. For all 80x86 CPUs this is likely to include translations that failed due to permission problems (e.g. if you write to a read-only page, then the CPU will remember the translation for the read-only page).

Intel CPUs (and I assume AMD CPUs too) will not remember address translations that result in "page not present"; however modern CPUs (with higher level TLB caches) may remember where to find the page directory or page directory pointer table, even if they don't remember that the page table itself refers to a "not present" page. In addition, other CPUs from other manufacturers (specifically old Cyrix CPUs) will remember that a page wasn't present.
cwillems wrote:- is an existing TLB entry invalidated if a page fault occurs for that address? Is there different behavior for a) no memory mapped
or b) an access protection occurs ?
A page fault won't cause any TLB invalidation. As an optimisation, some OS's do something called "lazy TLB invalidation" where they don't invalidate the TLB in some cases and then (if a page fault occurs due to stale data in a TLB) will invalidate the TLB in the page fault handler instead.
cwillems wrote:- besides the TLB entries: are the PDE/PTEs also stored (as regular data) in the L1/L2/LLC Caches? Or is there an additional cache
for the paging structures?
Yes and yes. The paging structures are also cached in the L1/L2/L3 data caches; and for modern CPUs there may be higher level TLB caches (e.g. the normal TLB entries that cache normal translations, plus a higher level TLB cache that remembers which page directory or page directory pointer table to use for different areas of the virtual address space). This means that (for example), if you access the virtual address 0x00000000 then the CPU might remember which physical page corresponds to virtual addresses from 0x00000000 to 0x00000FFF; and also remember which page directory corresponds to virtual addresses from 0x00000000 to 0x3FFFFFFF, so that if you then access the virtual address 0x22334455 it doesn't need to check the PLM4 and PDPT a second time (and only needs to look at the page directory and page table).
cwillems wrote:- if so: are there any circumstances under which the MMU flushes already cached PDE/PTE entries from the L1/L2/LLC caches?
CPU may flush anything from L1/L2/L3 (or from any TLB) whenever it feels like it. Most CPUs use (a variation of) a "least recently used" eviction strategy, where things that haven't been used recently are removed from the cache to make room for recently used things. This means that (for an example) a simple loop that reads every 64th byte of a large enough area (as large as the cache size) can completely fill the L1/L2/L3 data cache/s and all previous data will be evicted. This problem is called cache pollution and it is the reason why modern CPUs have things like the CLFLUSH instruction (so that software can explicitly flush a cache line, to try to minimise cache pollution). Note: Due to the way caches are designed, even with CLFLUSH the effects of cache pollution can't be entirely avoided - e.g. with an "8-way associative" cache, a loop like the one I described would wipe out 12.5% of the cache.
cwillems wrote: - is a paging structure cached under the same circumstances as TLB entries created, i.e. only if "present bit set" and also cached "on valid address translation but resulting access violation"?
No. L1/L2/L3 can cache anything the CPU touches, regardless of what it is (and regardless of whether it's a page table entry for a "not present" page or not). More specifically, what a CPU will cache in the L1/L2/L3 caches is only determined by access patterns, physical addresses and the MTRR/PAT settings.
cwillems wrote: - are there any public information regarding the specific implementation details of current Intel cpus (SandyBridge/Nehalem) about the size, associativity or replacement strartegy of those caches?
CPUID (for modern CPUs) will tell you the size, associativity and how many (logical) CPUs share each level of cache. For older CPUs you may only be able to determine size, and associativity. For even older CPUs you may not be able to determine anything (without using "vendorID:family:model" as an index into your own lookup table/s derived from hours of searching through datasheets). I don't think there's any way to determine the replacement strategy of any cache (but I'd assume "least recently used"). Also note that some CPUs use inclusive caches (mostly Intel) and some use exclusive caches (mostly AMD). For an inclusive cache, a specific piece of data may be in multiple caches at the same time (e.g. in the L1, L2 and L3); while for an exclusive cache a specific piece of data can only be in one cache (e.g. if something is in the L1 cache then it can't also be in the L2 or L3 cache).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply