push CPU to its maximum throughput by SSE reordering

iman · Post by **iman** » Mon Dec 02, 2019 7:43 am

Dear.

In my own OS, I supported SSE extension. I would like to get the maximum possible throughput of cpu from the SSE instructions and for that I tried to reorder the instruction based on their latency and reciprocal throughput from Agner Fog's excellent article https://www.agner.org/optimize/instruction_tables.pdf.

I have a minimal code showing my desire and try.

listing 1:

Code: Select all

; CPU: Intel Core i7, MICROARCHITECTURE: NEHALEM
movaps xmm0, [some_mem_0] ;latency = 2 
addps xmm0, xmm1 ;latency = 3 
movss [m32], xmm0 ;latency = 3 
movaps xmm2, [some_mem_1] ;latency = 2 
addps xmm2, xmm3 ;latency = 3 
movss [m32], xmm2 ;latency = 3 
; total latency = 16

listing 2:

Code: Select all

; CPU: Intel Core i7, MICROARCHITECTURE: NEHALEM
movaps xmm0, [some_mem_2] ;latency = 2 
movaps xmm2, [some_mem_3] ;latency = 2 
addps  xmm0,  xmm1 ;latency = 3 
addps  xmm2,  xmm3 ;latency = 3 
movss  [m32], xmm0 ;latency = 3 
movss  [m32], xmm2 ;latency = 3 
; total latency = 10

In the first listing, I have the initial ordering of the instructions. In the second listing, on the other hand, I reordered them in a way to get lower latency counts. For instance, I have the latency of 3 for addps xmm0, xmm1 and 3 for movss [m32], xmm0. Since they are in a dependence chain, the total latency would be counted as 6, but if I put another completely independent instruction such as addps xmm2, xmm3, the final total latency would be lower. This is in order to perform another instruction right during the cpu cycles dedicated for addps xmm0, xmm1.

If I put these small codes in an iterative loop running for 300,000 times, I expect to lower the total cpu cycles latency counts from 4,800,000 to 3,000,000. I did the bench-marking of both codes, but in the end, I came up with the exactly same time of execution.
Did I miscalculated the latency counts or there might be other factors limiting the maximum throughput of the cpu in this case?

I really appreciate if somebody can explain the pushing cpu to its limit by getting the maximum throughput coming from the lower counts of latency.

Best regards.
Iman.

iansjack · Post by **iansjack** » Mon Dec 02, 2019 8:08 am

Have you read the Intel Optimization Manual? You probably should. It's not a simple question of adding up latencies, and the processor itself is probably doing some instruction reordering. If the tests show that your order is no more efficient than another, for your processor, that's the real test.

iman · Post by **iman** » Mon Dec 02, 2019 9:30 am

iansjack wrote:Have you read the Intel Optimization Manual? You probably should. It's not a simple question of adding up latencies, and the processor itself is probably doing some instruction reordering. If the tests show that your order is no more efficient than another, for your processor, that's the real test.

I have read some chapters of the Intel Optimization Manual as much as I could quite some time ago.

iansjack wrote:If the tests show that your order is no more efficient than another, for your processor, that's the real test.

Does it mean that the processor now is working at its maximum throughput?

iansjack · Post by **iansjack** » Mon Dec 02, 2019 9:53 am

iman wrote: Does it mean that the processor now is working at its maximum throughput?

No - just that the two orderings that you tested are equally efficient. But, as I pointed out, the processor may well have reordered them to be the same.

I'm not convinced that - for an operating system - getting the absolute maximum throughput is paramount. Particularly if doing so was to make the code less clear.

LtG · Post by **LtG** » Thu Dec 05, 2019 12:28 pm

You expected 4,8M and 3M cycles, what were the actual results?

Are you hitting only cache or also RAM, in which case RAM B/W could be the issue?

Craze Frog · Post by **Craze Frog** » Sun Dec 08, 2019 1:07 pm

You should always post your benchmarking code, for two reasons:
1. It makes it easier to help you, because we can just copy your benchmark code, instead of writing our own from scratch.
2. The error could be in your benchmark code.

In this case, you are making a typical error in your benchmark code: using too few iterations. Try using 10000 times as many iterations (300000*10000).

Another way to improve the accuracy of your measurement, is to unroll the loop. Try repeating the tested code 10 times inside each loop.

Also, makes sure you are compiling with optimizations on, if using a C compiler.

At the same time, compiling with optimizations on, removes code where the inputs are known at compile time, or the output is not used. So make sure the input is read from the command line, and the output is returned in some way.

Craze Frog · Post by **Craze Frog** » Sun Dec 08, 2019 6:51 pm

iman wrote:Does it mean that the processor now is working at its maximum throughput?

You could try to improve throughput by interleaving computations using the normal fpu, or integer calculations.

Assuming you want to process 4 floats in a single SSE instruction, in a loop (over an array with many such 4-float structs), unroll the loop with 4 of these computations, and within each of them, interleave normal fpu instructions for the same computation, using just 1 float.

This will leave you with 4 * 5 floats computed, using only 4 SSE iterations, instead of 5.

Octocontrabass · Post by **Octocontrabass** » Mon Dec 09, 2019 3:40 pm

Craze Frog wrote:Another way to improve the accuracy of your measurement, is to unroll the loop. Try repeating the tested code 10 times inside each loop.

Nehalem has a μop cache that can be used to accelerate loops by removing the cost of instruction decoding. Increasing the size of the loop beyond the size of this cache may hurt performance, so it's a good idea to benchmark several different loop sizes to see what works best.

Craze Frog wrote:Assuming you want to process 4 floats in a single SSE instruction, in a loop (over an array with many such 4-float structs), unroll the loop with 4 of these computations, and within each of them, interleave normal fpu instructions for the same computation, using just 1 float.

On Nehalem, all floating-point addition is handled in the same execution unit, so using x87 instructions to dispatch additional operations can only help if it reduces the load on the instruction decoder. In this particular case, I suspect it won't since the example code is already small enough to easily fit in the μop cache.

However, the example code only saves the result from one of the four additions being performed in each ADDPS instruction. Saving the result from all four may provide additional throughput for free, if the code isn't already hitting the limits of memory bandwidth.

If rearranging the data to allow processing four floats in parallel would be prohibitively expensive, there may still be some gain by replacing MOVAPS with MOVSS. Additionally, replacing ADDPS with ADDSS will ensure there are no penalties related to special values (especially denormals) for results that are going to be discarded.

If denormals aren't needed at all, there may be some additional gain by setting the DAZ and FTZ bits in MXCSR.

Of course, for all of these changes, benchmarks must be performed to make sure they don't end up hurting performance instead.

OSDev.org

push CPU to its maximum throughput by SSE reordering

push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering

Re: push CPU to its maximum throughput by SSE reordering