I was looking for documents containing the timings of complex kernel-specific actions (cycle counts, uop breakdowns, average execution time). I have been browsing google but i couldn't find anything useful. Also, the optimisation manuals hardly contain anything useful regarding these topics.
A list of things I'd like to have timings of:
- software interrupt (real mode, protected mode/trap gate, long mode)
- inter-privilege-level jumps, calls, iret (and far branches in general)
- entry/exit of v8086 mode
- hardware task switches
- f(f)(x)save/f(f)(x)rstor
- wbinvd, cr3 (re)loads, page table walks, invlpg
- syscall, sysenter, sysexit, sysret (lmode & pmode)
- read/write msr, control registers, debug registers
- operation mode changes: realmode<>pmode, pmode<>lmode
- disabling/enabling paging
- lidt/lldt/lgdt/ltr
In case you wonder what this is for, i'm investigating the option of having an operation mode attached to a virtual address space (i.e. the program can choose wether to run in longmode, pae or non-pae protected mode, v8086 mode - exokernel principles), and corresponding optimisation priorities in the scheduler.
All links, pointers, pdfs or related material are appreciated.
Thanks in advance
Timings of complex cpu operations
http://www.ousob.com/ng/iapx86/
that may help a bit, look under isntruction set, and then an instruction and at the bottom some timing info is included
that may help a bit, look under isntruction set, and then an instruction and at the bottom some timing info is included
I intended to use RDTSC,but it seems that you can't get it work accurately since modern CPUs are highly optimised with several pipelines.
And also since RDTSC is disordered with its surroundings(for it doesn't depend on any other operations,I thought Intel should make it an exception,but after reading their document I found that they make RDTSC the same as other non-dependant operations)...
Or code many many same instructions to get a accumulative total time with PIC?Maybe it's not accurate or totally meaningless because of the optimisation...
Currently do you have any possible available methods?
And also since RDTSC is disordered with its surroundings(for it doesn't depend on any other operations,I thought Intel should make it an exception,but after reading their document I found that they make RDTSC the same as other non-dependant operations)...
Or code many many same instructions to get a accumulative total time with PIC?Maybe it's not accurate or totally meaningless because of the optimisation...
Currently do you have any possible available methods?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
well, timestamp counting is an option, but i rather not build an entire kernel to test these things when the info seems to be available (brendan for one seemed to know the hardware switch timings)
Besides, i will only be able to collect information on the cpu's i own. (which do not include later pentiums)
Besides, i will only be able to collect information on the cpu's i own. (which do not include later pentiums)
You will find it very difficult to find timing info on later Pentium architectures, in no small part because you'd have to add a "depends..." to each and every one of them. Execution is done not only out-of-order, but also with varying side-effects among the pipelines. I remember reading that some stage in the PIV architecture that runs on three pipelines can only execute a certain kind of micro-op in one pipeline simultaneously, and another kind of micro-op in two, but not three.
Means, not only might your instructions be sliced and diced every which way, your exact execution time also depends on what instructions came before and after the instruction in question.
Not to even speak about cache effects.
In the beginning, there was a table - instruction A takes X cycles, instruction B takes Y cycles. This is no more.
Means, not only might your instructions be sliced and diced every which way, your exact execution time also depends on what instructions came before and after the instruction in question.
Not to even speak about cache effects.
In the beginning, there was a table - instruction A takes X cycles, instruction B takes Y cycles. This is no more.
Every good solution is obvious once you've found it.
I thought I read that somewhere recently, but it turned out they compared to "call far/ret far" instead of "int xx/iret":Combuster wrote:(like, how much faster is syscall over int xx)
Note: I assumed they meant the "far" variants, since it's meant for privilege level elevation, and because that's what they compare to in vol 3 in the description of SYSCALL (they don't mention specific numbers there though, just "considerably fewer clock cycles").AMD64 Architecture Programmer’s Manual Volume 2: System Programming wrote:As a result, SYSCALL and SYSRET can take fewer than one-fourth the number of internal clock cycles to complete than the legacy CALL and RET instructions
HTH.