Timings of complex cpu operations

Combuster · Post by **Combuster** » Sun Feb 25, 2007 3:56 pm

I was looking for documents containing the timings of complex kernel-specific actions (cycle counts, uop breakdowns, average execution time). I have been browsing google but i couldn't find anything useful. Also, the optimisation manuals hardly contain anything useful regarding these topics.

A list of things I'd like to have timings of:
- software interrupt (real mode, protected mode/trap gate, long mode)
- inter-privilege-level jumps, calls, iret (and far branches in general)
- entry/exit of v8086 mode
- hardware task switches
- f(f)(x)save/f(f)(x)rstor
- wbinvd, cr3 (re)loads, page table walks, invlpg
- syscall, sysenter, sysexit, sysret (lmode & pmode)
- read/write msr, control registers, debug registers
- operation mode changes: realmode<>pmode, pmode<>lmode
- disabling/enabling paging
- lidt/lldt/lgdt/ltr

In case you wonder what this is for, i'm investigating the option of having an operation mode attached to a virtual address space (i.e. the program can choose wether to run in longmode, pae or non-pae protected mode, v8086 mode - exokernel principles), and corresponding optimisation priorities in the scheduler.

All links, pointers, pdfs or related material are appreciated.

Thanks in advance

earlz · Post by **earlz** » Sun Feb 25, 2007 4:08 pm

http://www.ousob.com/ng/iapx86/
that may help a bit, look under isntruction set, and then an instruction and at the bottom some timing info is included

m · Post by m » Sun Feb 25, 2007 9:32 pm

I intended to use RDTSC,but it seems that you can't get it work accurately since modern CPUs are highly optimised with several pipelines.

And also since RDTSC is disordered with its surroundings(for it doesn't depend on any other operations,I thought Intel should make it an exception,but after reading their document I found that they make RDTSC the same as other non-dependant operations)...

Or code many many same instructions to get a accumulative total time with PIC?Maybe it's not accurate or totally meaningless because of the optimisation...

Currently do you have any possible available methods?

Combuster · Post by **Combuster** » Mon Feb 26, 2007 3:23 am

well, timestamp counting is an option, but i rather not build an entire kernel to test these things when the info seems to be available (brendan for one seemed to know the hardware switch timings)

Besides, i will only be able to collect information on the cpu's i own. (which do not include later pentiums)

Solar · Post by **Solar** » Mon Feb 26, 2007 3:52 am

You will find it very difficult to find timing info on later Pentium architectures, in no small part because you'd have to add a "depends..." to each and every one of them. Execution is done not only out-of-order, but also with varying side-effects among the pipelines. I remember reading that some stage in the PIV architecture that runs on three pipelines can only execute a certain kind of micro-op in one pipeline simultaneously, and another kind of micro-op in two, but not three.

Means, not only might your instructions be sliced and diced every which way, your exact execution time also depends on what instructions came before and after the instruction in question.

Not to even speak about cache effects.

In the beginning, there was a table - instruction A takes X cycles, instruction B takes Y cycles. This is no more.

Combuster · Post by **Combuster** » Mon Feb 26, 2007 4:53 am

I'm very well aware of that. However, due to the large amount of uops involved in some operations the variance should be relatively small. Hence i don't expect exact execution times. Just an indication of how certain operations relate to each other. (like, how much faster is syscall over int xx)

urxae · Post by **urxae** » Mon Feb 26, 2007 6:01 am

Combuster wrote:(like, how much faster is syscall over int xx)

I thought I read that somewhere recently, but it turned out they compared to "call far/ret far" instead of "int xx/iret":

AMD64 Architecture Programmer’s Manual Volume 2: System Programming wrote:As a result, SYSCALL and SYSRET can take fewer than one-fourth the number of internal clock cycles to complete than the legacy CALL and RET instructions

Note: I assumed they meant the "far" variants, since it's meant for privilege level elevation, and because that's what they compare to in vol 3 in the description of SYSCALL (they don't mention specific numbers there though, just "considerably fewer clock cycles").

HTH.