Page 1 of 1
refreshing CR3 and CALL procedure costs.
Posted: Sun Oct 08, 2006 4:05 am
by mrkaktus
Hi, I have a question, where I can find (or maybe you can tell me that) how many tacts processor needs to refresh all TLB's when I'm changing CR3 state? And how many tacts processor needs to execute assempler CALL procedure when it is call from one part of kernel code to another in PL0 ofcourse.
All that I would be interested on 386 processor in which I can't refresh TLB's only in some place, but I need to refresh them all (If I understand this properly, because Intel manuals are poor about that things).
Thank's.
Posted: Sun Oct 08, 2006 7:58 am
by gaf
I have a question, where I can find (or maybe you can tell me that) how many tacts processor needs to refresh all TLB's when I'm changing CR3 state?
Clearing the TLB itself shouldn't take much more than couple of clock cycles. The expensive part is rebulding it from the page-tables with a new working set: As all mappings are gone the process will cause a lot of TLB misses until all its pages are loaded back in. This does cause some huge overhead, which is why you should really try to avoid TLB flushes wherever possible.
And how many tacts processor needs to execute assempler CALL procedure when it is call from one part of kernel code to another in PL0 ofcourse.
The exact timing always depends on your processor model aswell as numerous other criteria. In general a regular call instruction however shouldn't be that expensive at all..
All that I would be interested on 386 processor in which I can't refresh TLB's only in some place, but I need to refresh them all (If I understand this properly, because Intel manuals are poor about that things).
From what I know the invlpg instruction was first introduced on the 486 processor. Machines that don't support it are that old and so rarely used that you should really ask yourself if it's actually worth optimizing your code for them.
regards,
gaf
Re: refreshing CR3 and CALL procedure costs.
Posted: Mon Oct 09, 2006 8:27 am
by MNemo
mrkaktus wrote:Hi, I have a question, where I can find (or maybe you can tell me that) how many tacts processor needs to refresh all TLB's when I'm changing CR3 state? And how many tacts processor needs to execute assempler CALL procedure when it is call from one part of kernel code to another in PL0 ofcourse.
All that I would be interested on 386 processor in which I can't refresh TLB's only in some place, but I need to refresh them all (If I understand this properly, because Intel manuals are poor about that things).
Thank's.
Since Pentium there is a 'time stamp counter' register counting the tacts.
The assembler command 'RDTSC' reads this register into EAX:EDX. With this command it should be easy to find out how many tacts an oparation take.
Posted: Mon Oct 09, 2006 1:07 pm
by mrkaktus
Yes I know that there is such procedure, but I'm serching for timing on 80386 processor, and 80486 also if there are.
Posted: Mon Oct 09, 2006 1:22 pm
by MNemo
386 is very old, i don't think there are any funktions like this.
(why do you look for 386, do you still have one?)
Posted: Tue Oct 10, 2006 10:34 am
by mrkaktus
There is no such functions on 386, I know that, but I'm searching for TIMINGS - how many processor tick's is needed to proceed CALL or full TLBs refresh.
And I'm searching this because I decided that my microkernel will work on all 32bit processors. And yes, I have one 386 (even two, and every other processor based PC that shows till now). I don't understand why people are always telling me 386 is old, etc. I know that, and it is my choose to develop under this chip. So if you don't know anything in this subject, and you can't help me, please don't write, than write posts without useful knowledge.
Posted: Tue Oct 10, 2006 10:42 am
by gaf
The "mov cr3, 0" instruction itself will be relativly fast - it's the TLB misses that cause most of the overhead. After your kernel has returned, the next few hundred instructions of the user applications will most likely access areas that are no longer present in the cache. Each time a TLB miss occures the processor has to refer to the paging structures in memory to reload the mapping, which of course causes some overhead. This means that the reduced performance is not caused by a single extremly expensive intruction, but by a huge number of small instructions that now run slightly slower than they did with a loaded TLB.
There's thus no way to estimate the overhead with a ยต-benchmark. In order to get some reliable numbers, you'd have to compile two versions of your kernel (invlpg | flush) and compare their performance. Also keep in mind that the difference in performance entirely depends on the user application: Programs with a high degree of locality by far won't be as affected by a flush as applications with a huge working-set.
regards,
gaf
Posted: Tue Oct 10, 2006 11:00 am
by gaf
I'm searching for TIMINGS - how many processor tick's is needed to proceed CALL or full TLBs refresh.
Instruction timings are different for each CPU model. At least for your 386 there should still be some tables on the internet - more modern architectures don't have something like fixed cycle costs as the speed of an instruction depends on to many other factors. Run some real-life benchmarks if you really need hard numbers, otherwise just rely on your coder's instincts..
regards,
gaf
Posted: Tue Oct 10, 2006 11:20 am
by mrkaktus
Thank's gaf. I know that from Pentium you cannot have timing because of UV, PV channels, cache etc. I'm thinking about such timings in relay to changing paging, and calls because I'm thinking about in such way I schould go with developing my internal kernel architecture. Now when I know the overhead is puted in time, my point of view is changing
.
At now my kernel switches to it's own paging space, where it has physical ram maped in paging space as it is, so every operation is very simple to implement. There is a little more overhead than if it would contain all it's code in process space (without two CR3 reloads), but I think I can optimize it in such way that it will be as fast as one CR3 switch.
In other hand I could have like normal kernel's all it's memory maped
to process paging but it will complicate several procedures like creating process by extracting it's image from fathers paging space to new one.
So going to point, I'm now wondering if someone have also developed kernel that gives to process almost all 4GB of virtual space and works in his own virtual space (on the mechanizm of switching pagings)?
Posted: Tue Oct 10, 2006 12:33 pm
by gaf
If you use a seperate context for your kernel, the TLB will get flushed on every single systemcall: When the user calls the kernel its user-space mappings get lost., when the call returns the kernel mappings get flushed. Neither kernel nor application can thus build up their working-set in the TLB. Especially for modern computer architectures the poor TLB cache utilization that follows will mean a major performance problem. It's thus safe to say that removing the context switch would make your systemcalls several/many times faster.
According to
this L4 related paper (tranparency #3) kernel enter/exit costs roughly 20-200 cycles, while TLb flushed range between 100 and 4000 cycles (even more on more modern architectures).
regards,
gaf
Posted: Tue Oct 10, 2006 2:11 pm
by mrkaktus
Yes, it gives to think
.