Brendan wrote:Hi,
fano1 wrote:The problem is that on X86 CVTSS2SI convert a double to an Int32 not for an Int64! They have extended it to work with Int64 for X64 apparently...
Online I've seen references to a CVTSS2SIQ but NASM refuses to create the assembly using it... I suppose this is for Int64 too.
For AT&T syntax they append a "size" letter to various instructions, like appending a "Q" to "CVTSS2SI" to get a "CVTSS2SIQ" instruction. For Intel syntax the assembler is smarter and figures it out from the operands, so "CVTSS2SI RAX, XMM0" is automatically 64-bit because RAX is a 64-bit register.
For "CVTSS2SI" the destination must be a general purpose register, so you can't do something like (e.g.) "CVTSS2SI XMM1, XMM0".
OK so CVTSS2SIQ was a pseudo-instruction the real instruction is CVTSS2SI and being on X86 I have no access to 64 Bit registers
fano1 wrote:So any idea to how do a double to Int64 conversion using SSE on X86 / 32 Bit architecture?
For 32-bit code (that can't use 64-bit registers and therefore can't use a 64-bit "CVTSS2SI") it's normal to consider splitting it into a pair of 32-bit halves. The problem is that SSE doesn't seem to support "positive double to unsigned 32-bit integer", and if you use "double to signed 32-bit integer" conversion for the low 32-bits you end up with a sign bit in the middle of your 64-bit integer.
Brendan wrote:
To fix that you'd probably need to:
- extract the sign bit and store it somewhere, and make the double positive if it was negative
- extract the high 31 bits (divide by (1<<32), convert to 32-bit integer), then subtract the high 31-bits from your double (multiply the integer by (1 << 32) and subtract from the original double).
- extract the middle 31 bits, subtract the middle 31 bits from the double
- extract the remaining lowest 1 bit (unless you don't need that extra 1-bit of precision?)
- OR the 3 pieces together to get a "63-bit unsigned integer" (or a "64-bit positive signed integer")
- If the sign bit was originally set; negate the 64-bit signed integer.
Mmh de facto is a software emulation of the truncation operation that SSE should already integrate!
Yet if possible to implement it using SSE / MMX instructions I'm unsure if it will be more faster that call the x87 fisttp instruction added with SSE that we are actually using.
Brendan wrote:
Alternatively; if you know the double is in a "nice" range (magnitude not too large or too small, not zero, not NaN) it might be faster to store the double (as a double) in 2 memory locations, remove the exponent and sign from the first copy to get the significand bits and OR the implied bit into the significand, then obtain the exponent from the second copy and add the "exponent bias" and use that as a shift count to shift the significand into its correct place; then obtain the sign bit and negate if sign bit set.
This probably is more faster and in the end NaN and +- infinite should not give meaningful results if converted in a long number...
Brendan wrote:
Of course using the FPU (which does allow you to convert double to 64-bit integer in 32-bit code) is probably easier and faster than all the other options. Note that:
- if you have SSE2, there's no good reason to bother with MMX, so you wouldn't be switching from FPU to MMX anyway
- FPU is able to work on 80-bit "extended double" format (which has 16 bits more precision than a crusty old "double") while SSE doesn't, and therefore FPU is better than SSE in cases where the extra precision matters more than performance (even when you're only working with "double" because internally the FPU can/will use the 80-bit format for intermediate values)
- SSE doesn't provide things like sine, cosine, tangent, square root (but FPU does); so without FPU you'd have to implement these using extremely slow algorithms (that do give precise results) and/or lookup tables (where the precision is bad because you can't afford to blow away several GiB of RAM for lookup tables that are able to give "as precise as FPU" results).
In reality to use SSE integer instructions you should write in the MMX register so FPU could interfere... I was interested in the CMP instructions thinking they were faster that the X86 equivalent (as they returned directly a result that with OR could easy become 0 or 1 as C# wanted) but they operate only with "packed integers" so with scalar values probably doesn't work. In the end I'm unsure if makes sense with deal with MMX altogheter
FPU yes uses 80 bits but and this means that to obtain real double / float should do casting so any extra precision is always lost...
Regarding trigonometric function for now they are software implemented but in future we could use SSE for this for example this library:
http://gruntthepeon.free.fr/ssemath/sse_mathfun.h
Brendan wrote:
In any case; I'd be looking for a way to avoid the need to convert double into a 64-bit integer. For a random example; if you know the value will be within a certain range and don't care about some precision loss; then you can do "K = max_magnitude / (1 << 31)", then divide the double by K, convert that to a 32-bit integer, then multiply the 32-bit integer by K to get a "less precise than possible 64-bit integer" version of the original value. Of course if you choose a "max_magnitude" that is a power of 2 (or round it up to the nearest power of 2) your K will be a power of 2 and you can replace the division and multiplication with shifts.
To be clear the IL instruction conv.i8 is the equivalent of a cast to long:
Code: Select all
double = 42.42;
long a = (long) b; // a == 42
so the expected result is truncation.
In the end for now we have decided to use the fisttp instruction and so the legacy FPU should be enabled (and the CPU should support SSE3 instruction set too).
We will not use the FPU for the x64 version of Cosmos as the other OS vendor has done were CTTSI will work for all data types.
Brendan wrote:
Cheers,
Brendan
Thank you for your help.