Timings of complex cpu operations

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Timings of complex cpu operations

Post by Combuster »

I was looking for documents containing the timings of complex kernel-specific actions (cycle counts, uop breakdowns, average execution time). I have been browsing google but i couldn't find anything useful. Also, the optimisation manuals hardly contain anything useful regarding these topics.

A list of things I'd like to have timings of:
- software interrupt (real mode, protected mode/trap gate, long mode)
- inter-privilege-level jumps, calls, iret (and far branches in general)
- entry/exit of v8086 mode
- hardware task switches
- f(f)(x)save/f(f)(x)rstor
- wbinvd, cr3 (re)loads, page table walks, invlpg
- syscall, sysenter, sysexit, sysret (lmode & pmode)
- read/write msr, control registers, debug registers
- operation mode changes: realmode<>pmode, pmode<>lmode
- disabling/enabling paging
- lidt/lldt/lgdt/ltr

In case you wonder what this is for, i'm investigating the option of having an operation mode attached to a virtual address space (i.e. the program can choose wether to run in longmode, pae or non-pae protected mode, v8086 mode - exokernel principles), and corresponding optimisation priorities in the scheduler.

All links, pointers, pdfs or related material are appreciated.

Thanks in advance
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
earlz
Member
Member
Posts: 1546
Joined: Thu Jul 07, 2005 11:00 pm
Contact:

Post by earlz »

http://www.ousob.com/ng/iapx86/
that may help a bit, look under isntruction set, and then an instruction and at the bottom some timing info is included
m
Member
Member
Posts: 67
Joined: Sat Nov 25, 2006 6:33 am
Location: PRC

Post by m »

I intended to use RDTSC,but it seems that you can't get it work accurately since modern CPUs are highly optimised with several pipelines.

And also since RDTSC is disordered with its surroundings(for it doesn't depend on any other operations,I thought Intel should make it an exception,but after reading their document I found that they make RDTSC the same as other non-dependant operations)...

Or code many many same instructions to get a accumulative total time with PIC?Maybe it's not accurate or totally meaningless because of the optimisation...

Currently do you have any possible available methods? :)
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

well, timestamp counting is an option, but i rather not build an entire kernel to test these things when the info seems to be available (brendan for one seemed to know the hardware switch timings)

Besides, i will only be able to collect information on the cpu's i own. (which do not include later pentiums)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Post by Solar »

You will find it very difficult to find timing info on later Pentium architectures, in no small part because you'd have to add a "depends..." to each and every one of them. Execution is done not only out-of-order, but also with varying side-effects among the pipelines. I remember reading that some stage in the PIV architecture that runs on three pipelines can only execute a certain kind of micro-op in one pipeline simultaneously, and another kind of micro-op in two, but not three.

Means, not only might your instructions be sliced and diced every which way, your exact execution time also depends on what instructions came before and after the instruction in question.

Not to even speak about cache effects.

In the beginning, there was a table - instruction A takes X cycles, instruction B takes Y cycles. This is no more.
Every good solution is obvious once you've found it.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

I'm very well aware of that. However, due to the large amount of uops involved in some operations the variance should be relatively small. Hence i don't expect exact execution times. Just an indication of how certain operations relate to each other. (like, how much faster is syscall over int xx)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
urxae
Member
Member
Posts: 149
Joined: Sun Jul 30, 2006 8:16 am
Location: The Netherlands

Post by urxae »

Combuster wrote:(like, how much faster is syscall over int xx)
I thought I read that somewhere recently, but it turned out they compared to "call far/ret far" instead of "int xx/iret":
AMD64 Architecture Programmer’s Manual Volume 2: System Programming wrote:As a result, SYSCALL and SYSRET can take fewer than one-fourth the number of internal clock cycles to complete than the legacy CALL and RET instructions
Note: I assumed they meant the "far" variants, since it's meant for privilege level elevation, and because that's what they compare to in vol 3 in the description of SYSCALL (they don't mention specific numbers there though, just "considerably fewer clock cycles").

HTH.
Post Reply