In my own OS, I supported SSE extension. I would like to get the maximum possible throughput of cpu from the SSE instructions and for that I tried to reorder the instruction based on their latency and reciprocal throughput from Agner Fog's excellent article https://www.agner.org/optimize/instruction_tables.pdf.
I have a minimal code showing my desire and try.
listing 1:
Code: Select all
; CPU: Intel Core i7, MICROARCHITECTURE: NEHALEM
movaps xmm0, [some_mem_0] ;latency = 2
addps xmm0, xmm1 ;latency = 3
movss [m32], xmm0 ;latency = 3
movaps xmm2, [some_mem_1] ;latency = 2
addps xmm2, xmm3 ;latency = 3
movss [m32], xmm2 ;latency = 3
; total latency = 16
Code: Select all
; CPU: Intel Core i7, MICROARCHITECTURE: NEHALEM
movaps xmm0, [some_mem_2] ;latency = 2
movaps xmm2, [some_mem_3] ;latency = 2
addps xmm0, xmm1 ;latency = 3
addps xmm2, xmm3 ;latency = 3
movss [m32], xmm0 ;latency = 3
movss [m32], xmm2 ;latency = 3
; total latency = 10
If I put these small codes in an iterative loop running for 300,000 times, I expect to lower the total cpu cycles latency counts from 4,800,000 to 3,000,000. I did the bench-marking of both codes, but in the end, I came up with the exactly same time of execution.
Did I miscalculated the latency counts or there might be other factors limiting the maximum throughput of the cpu in this case?
I really appreciate if somebody can explain the pushing cpu to its limit by getting the maximum throughput coming from the lower counts of latency.
Best regards.
Iman.