OSDev.org

Posted: **Sun Nov 27, 2005 1:41 pm**

Does anyone have a feeling what's the (task switch) cost if you load new page tables via a CR3 update?

Or is it cheaper to (manually) change existing page table entries only and invalidate the appropriate TLB-entries accordingly?

I'm interested in AMD64, but could not find any latency information for it, except INVLPG which is said to use 80 ticks...

Any help appreciated - I'm curious why detailled information about timings is so rare - or at least, rare to me

Posted: **Sun Nov 27, 2005 2:56 pm**

I'm not really sure of the exacts. But considering 99% of the operating systems in use today (as far as I know) simply reload CR3, I'd say Intel/AMD (/Cyrix?) have made it faster than anything you could do on the software side.

Your best bet is to just help out the processor, if you ask me -> If something isn't going to change (EG., location of kernel) then make sure you mark it as global. Considering Kernels usually take up 0xC0000000 -> 0xFFFFFFFF, you could save 1/4 of the processing time for a CR3-reload simply by marking them global.

Also, what would you gain from all that complexity? 1 tick? I mean, really. Not only that, but you'd need memory extra for an entire set of page tables and a page directory (just over 4mb) so that you could switch out the needed data from the page tables/directories that the tasks themselves have. Then you'd have to check which ones need to be changed...and considering you use physical addresses at that level, you can't simply map the new page table...you must copy memory in some cases, meaning you'd waste even more time.

Posted: **Sun Nov 27, 2005 3:12 pm**

Well, I think my mind is *really* messed up after reading more specs than I ought to..

I was just suspicious to read the latency for AMD64 for the cache invalidating command, WBINVD on VectorPath consuming 9474 (!) ticks, as I understood it, and I thought, well, changing CR3 does flush the TLB except the global entries, and should (here I may be wrong) clear the cache as well.

If 99% of all OSses just use CR3 without any "manual" fiddling, I think I'm happy with the mainstream solution.

Thanks, Cjmovie

Posted: **Sun Nov 27, 2005 4:21 pm**

Hello tigujo,
I don't think that reloading the cr3 register with a new page-directory address is especially expensive at all, what however does take several thousand cycles is reloading the tlb with new entries after it has been flushed. But these are indirect costs that are hard to estimate as they depend on the working-set of the process in question.

I was just suspicious to read the latency for AMD64 for the cache invalidating command, WBINVD on VectorPath consuming 9474 (!) ticks, as I understood it, and I thought, well, changing CR3 does flush the TLB except the global entries, and should (here I may be wrong) clear the cache as well.

To my knowledge wbinvd doesn't flush the tlb but rather the L1 and L2 caches. What makes it so expensive is that all data from these caches is written back before they are invalidated to keep the memory consistent (use invd if you don't want this). The 9474 ticks can therefore only be a rough estimation as the real number depends on the cache's size which varies depending on the cpu model. If you clear the tlb by reloading cr3 the L1 and L2 caches btw shouldn't be touched as they work on physical addresses and are therefore independant from the mmu's state.

regards,
gaf

Posted: **Sun Nov 27, 2005 6:05 pm**

hello Gaf,

thanks for clarifying, sure, dcache works "physical", and won't add to the latency of CR3.

The 'indirect costs' of a CR3 change by new TLB fills and 'table walking' is something which I'd like to get a feeling for. I think I will have to simply test it...

Posted: **Sun Nov 27, 2005 6:27 pm**

Gee, more questions...

What happens, if I change a PTE 'manually' - I'd guess, I have to tell that to the TLB somehow, that the 'old' mapping is invalid (INVLPG flushes all).
Is there a way of just flushing a single TLB entry?

Quote from somewhere:
"For the Opteron a local TLB invalidation costs around 95 CPU cycles when the PTE exists in the data cache, and 320 cycles when it does not....This so-called Table-Walk is a very lengthy procedure indeed"

Posted: **Sun Nov 27, 2005 6:51 pm**

You mean just flush one page? INVLPG _does_ do that. You use it as such:

Code: Select all

INVLPG [AddressToInvalidate] ;Makes that virtual address invalid, CPU reloads it from w/e

Posted: **Sun Nov 27, 2005 8:08 pm**

Oops,
thanks Cjmovie

Just found something to convince me of a pure CR3 update:
"The AMD Athlon 64 and AMD Opteron processors utilize a two-level TLB structure. A flush filter?new on the AMD Athlon 64 and AMD Opteron processors?eliminates unnecessary TLB flushes when loading the CR3 register."

tigujo

Posted: **Sun Nov 27, 2005 8:47 pm**

That doesn't mean it isn't a bad idea to give the processor a hint: In my OS, I have an area of 2mb that will never move. This contains page tables for the upper-half of memory.
Then if I ever need to override something for a task, I simply have it not include that section and make sure it isn't marked as global anymore. But at the start, the entire upper half _is_ globalized. And there isn't even a chance the top 1GB of virtual memory will be different between tasks, so that still saves a MINIMUM of 1/4 processing time.

I'd say that's a fallback and is no excuse for passing up easy optimization techniques. Of course, I myself have an AMD64

.

OSDev.org

CR3 change latency

CR3 change latency

Re:CR3 change latency

Re:CR3 change latency

Re:CR3 change latency

Re:CR3 change latency

Re:CR3 change latency

Re:CR3 change latency

Re:CR3 change latency

Re:CR3 change latency