OSDev.org

Posted: **Sun Apr 10, 2022 6:18 pm**

So the intel manuals aren't very helpful when it comes to actually using instructions like MOVSD (Move or Merge Scalar Double-Precision Floating-Point Value) and related instructions (ADDSD, SUBSD, MULSD, DIVSD, ROUNDSD, MAXSD, MINSD, etc.) MOVSD, for example, requires either a register or a 64-bit memory location. Does this mean that to use these instructions properly I need to do something like this (adding 32768 to 32500 for example):

Code: Select all

MOV EAX, 0xbddf40
MOV EBX, 0xe040
CVTSI2SD XMM0, EAX
CVTSI2SD XMM1, EBX
ADDSD XMM0, XMM1
CVTSD2SI EAX, XMM0

I know I could just use the old FPU instructions, but it makes sense to learn how to use the SSE/AVX ones as a replacement.

Posted: **Sun Apr 10, 2022 8:42 pm**

GCC says that looks right.

SSE (and AVX) floating-point instructions are a lot more like the basic x86 instruction set than the x87 FPU instruction set. SSE moves copy the exact bit pattern into or out of the SSE register, unlike x87 moves that convert data on the way in and out. If you need to convert, SSE has separate conversion instructions. When you want to operate on the contents of an SSE register, you have to choose the instruction according to the format of the data you've placed in the register.

That last part is pretty important for optimization: it's perfectly valid to mix instructions meant for different types of data, but on modern CPUs there is a performance penalty. (There's also a performance penalty for mixing SSE and AVX instructions.)

Posted: **Mon Apr 11, 2022 8:37 am**

Thanks! On an unrelated note, since we're talking about assembly language, is there a reason that (from my experience) compilers are bad at optimizing certain mathematical operations down to their assembly instruction versions? For example, if I write this code:

Code: Select all

sqrt(2.0+5.0+449201938193.500039218);

The fastest optimization I could imagine would be the compiler transforming that into:

Code: Select all

MOV RAX, 0x120c667a1255a42
CVTSI2SD XMM0, RAX
SQRTSD XMM0, XMM0

(VSQRTSD would probably be faster, but it neeeds 3 operands, not two.)
Is there a reason compilers (to my knowledge and from what I've seen) don't optimize mathematical operations like this down to their instructions (e.g. SQRTSD for square root, ROUNDSD for rounding, etc.) when they encounter patterns of code that might indicate them, like a call to sqrt or round? Is it inaccuracies in the instructions themselves or something else? Or does this happen and I've just never had the right flags turned on or something? I usually compile my software with -mtune=native -march=native -O3 (sometimes -O2 though), but I'd think even at -O2 these kinds of patterns would be detectable. I know that instructions like FSIN aren't used because of possible inaccuracies but maybe that also applies to these?

Posted: **Mon Apr 11, 2022 1:16 pm**

Ethin wrote:On an unrelated note, since we're talking about assembly language, is there a reason that (from my experience) compilers are bad at optimizing certain mathematical operations down to their assembly instruction versions?

C library functions aren't identical to the assembly instructions. In most cases, the only difference is that the C library function sets errno when the parameter is outside the function's domain, so a lot of the optimizations you're looking for will appear if you add "-fno-math-errno" to your compiler flags. There are a whole set of floating-point math optimization flags for nonstandard behavior like this.

Ethin wrote:
Code: Select all
MOV RAX, 0x120c667a1255a42
CVTSI2SD XMM0, RAX

Is that supposed to be a floating-point constant? If it is, you've got the bytes backwards, and you should use MOVQ instead of CVTSI2SD because it's already a scalar double and doesn't require conversion from signed integer.

Ethin wrote:(VSQRTSD would probably be faster, but it neeeds 3 operands, not two.)

Any half-baked assembler will allow you to specify AVX instructions with two operands as a shorthand for the AVX equivalent of an SSE instruction. GNU AS even has a flag to directly translate SSE mnemonics into AVX instructions. I can't promise it would actually be faster, though - that depends on the rest of your program (and maybe your CPU microarchitecture).

Posted: **Tue Apr 12, 2022 1:24 pm**

Ethin wrote:For example, if I write this code:
Code: Select all
sqrt(2.0+5.0+449201938193.500039218);
The fastest optimization I could imagine would be the compiler transforming that into:
Code: Select all
MOV RAX, 0x120c667a1255a42
CVTSI2SD XMM0, RAX
SQRTSD XMM0, XMM0

Well no, the fastest for this particular call would be to hardcode the result into the output. But it is entirely possible that the additions or any of the calculations inside of sqrt() would have observable side effects in the floating-point environment, and so the compiler cannot remove either of those.

OSDev.org

How do you properly use MOVSD/ADDSD/...?

How do you properly use MOVSD/ADDSD/...?

Re: How do you properly use MOVSD/ADDSD/...?

Re: How do you properly use MOVSD/ADDSD/...?

Re: How do you properly use MOVSD/ADDSD/...?

Re: How do you properly use MOVSD/ADDSD/...?