It is possible to write an X86 OS without using X87 FPU?

fano1 · Post by **fano1** » Sun May 01, 2016 7:41 am

Hello!

I'm working on the Cosmos OS project (a C# operating system) that in end AOT compiles IL to X86 assembler.
As in X64 SSE is mandatory and X87 is no more used I tough would have making sense as we were using already SSE for float32 (called Single / float in C#) to use them for float64 (Called Double / double in C#) too so we will not need X87 anymore.
All was going well until the conv.i8 instruction was needed to be implemented and suddenly I discovered that there was no SSE instructions to convert float and double to long / Int64! For all the other integer types there is CVTTSS2SI but this does not works for Int64... on internet I have seen some sites referring to a CVTTSS2SIQ but NASM says this instruction does not exist. Probably it has been added for X64...

I've tried to implement a Int64 conversion using double to two packed integer but I obtain a wrong value interpreting the two integers as long:

Code: Select all

		SystemVoidCosmosCompilerTestsBclSystemDoubleTestExecute.IL_01BD.0B: ;Asm
			movsd XMM0, [ESP]

		SystemVoidCosmosCompilerTestsBclSystemDoubleTestExecute.IL_01BD.0C: ;Asm
			UNPCKLPD XMM0, XMM0 # This should create two copies of the same double in the low and in the high part of the XMM register

		SystemVoidCosmosCompilerTestsBclSystemDoubleTestExecute.IL_01BD.0D: ;Asm
			CVTTPD2DQ XMM1, XMM0

		SystemVoidCosmosCompilerTestsBclSystemDoubleTestExecute.IL_01BD.0E: ;Asm
			movlpd [ESP], XMM1

My idea is to never initialize X87 so to use MMX instructions and register freely without having to have always to clear the MMX registers after any usage or FPU will be in a corrupt state.

Thanks for your help.

iansjack · Post by **iansjack** » Sun May 01, 2016 8:33 am

You should read the Intel Programmer's Manual; it documents the CVTSS2SI instructions and tells you what you need to know.

fano1 · Post by **fano1** » Sun May 01, 2016 8:55 am

The problem is that on X86 CVTSS2SI convert a double to an Int32 not for an Int64! They have extended it to work with Int64 for X64 apparently...

Online I've seen references to a CVTSS2SIQ but NASM refuses to create the assembly using it... I suppose this is for Int64 too.

So any idea to how do a double to Int64 conversion using SSE on X86 / 32 Bit architecture?
Thanks for your help.

Brendan · Post by **Brendan** » Sun May 01, 2016 8:16 pm

Hi,

fano1 wrote:The problem is that on X86 CVTSS2SI convert a double to an Int32 not for an Int64! They have extended it to work with Int64 for X64 apparently...

Online I've seen references to a CVTSS2SIQ but NASM refuses to create the assembly using it... I suppose this is for Int64 too.

For AT&T syntax they append a "size" letter to various instructions, like appending a "Q" to "CVTSS2SI" to get a "CVTSS2SIQ" instruction. For Intel syntax the assembler is smarter and figures it out from the operands, so "CVTSS2SI RAX, XMM0" is automatically 64-bit because RAX is a 64-bit register.

For "CVTSS2SI" the destination must be a general purpose register, so you can't do something like (e.g.) "CVTSS2SI XMM1, XMM0".

fano1 wrote:So any idea to how do a double to Int64 conversion using SSE on X86 / 32 Bit architecture?

For 32-bit code (that can't use 64-bit registers and therefore can't use a 64-bit "CVTSS2SI") it's normal to consider splitting it into a pair of 32-bit halves. The problem is that SSE doesn't seem to support "positive double to unsigned 32-bit integer", and if you use "double to signed 32-bit integer" conversion for the low 32-bits you end up with a sign bit in the middle of your 64-bit integer.

To fix that you'd probably need to:

extract the sign bit and store it somewhere, and make the double positive if it was negative
extract the high 31 bits (divide by (1<<32), convert to 32-bit integer), then subtract the high 31-bits from your double (multiply the integer by (1 << 32) and subtract from the original double).
extract the middle 31 bits, subtract the middle 31 bits from the double
extract the remaining lowest 1 bit (unless you don't need that extra 1-bit of precision?)
OR the 3 pieces together to get a "63-bit unsigned integer" (or a "64-bit positive signed integer")
If the sign bit was originally set; negate the 64-bit signed integer.

Alternatively; if you know the double is in a "nice" range (magnitude not too large or too small, not zero, not NaN) it might be faster to store the double (as a double) in 2 memory locations, remove the exponent and sign from the first copy to get the significand bits and OR the implied bit into the significand, then obtain the exponent from the second copy and add the "exponent bias" and use that as a shift count to shift the significand into its correct place; then obtain the sign bit and negate if sign bit set.

Of course using the FPU (which does allow you to convert double to 64-bit integer in 32-bit code) is probably easier and faster than all the other options. Note that:

if you have SSE2, there's no good reason to bother with MMX, so you wouldn't be switching from FPU to MMX anyway
FPU is able to work on 80-bit "extended double" format (which has 16 bits more precision than a crusty old "double") while SSE doesn't, and therefore FPU is better than SSE in cases where the extra precision matters more than performance (even when you're only working with "double" because internally the FPU can/will use the 80-bit format for intermediate values)
SSE doesn't provide things like sine, cosine, tangent, square root (but FPU does); so without FPU you'd have to implement these using extremely slow algorithms (that do give precise results) and/or lookup tables (where the precision is bad because you can't afford to blow away several GiB of RAM for lookup tables that are able to give "as precise as FPU" results).

In any case; I'd be looking for a way to avoid the need to convert double into a 64-bit integer. For a random example; if you know the value will be within a certain range and don't care about some precision loss; then you can do "K = max_magnitude / (1 << 31)", then divide the double by K, convert that to a 32-bit integer, then multiply the 32-bit integer by K to get a "less precise than possible 64-bit integer" version of the original value. Of course if you choose a "max_magnitude" that is a power of 2 (or round it up to the nearest power of 2) your K will be a power of 2 and you can replace the division and multiplication with shifts.

Cheers,

Brendan

zdz · Post by **zdz** » Mon May 02, 2016 2:21 am

To support what Brendan said this is explained in the manual (I'm reading from "Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, 3C and 3D", page 753)

CVTSS2SI has the followin forms:

Code: Select all

F3 0F 2D /r => CVTSS2SI r32, xmm/m32, valid in both 32 and 64-bit modes
F3 REX.W 0F 2D /r => CVTSS2SI r64, xmm/m32, valid only in 64-bit mode
VEX.LIG.F3.0F.W0 2D /r => VCVTSS2SI r32, xmm1/m32, valid in both 32 and 64-bit modes
VEX.LIG.F3.0F.W1 2D /r => VCVTSS2SI r64, xmm1/m32, valid only in 64-bit

I don't think there is a instruction that can do what you want, but I can't claim that I'm extremely familiar with SSE.

fano1 · Post by **fano1** » Mon May 02, 2016 1:14 pm

Brendan wrote:Hi,

fano1 wrote:The problem is that on X86 CVTSS2SI convert a double to an Int32 not for an Int64! They have extended it to work with Int64 for X64 apparently...

Online I've seen references to a CVTSS2SIQ but NASM refuses to create the assembly using it... I suppose this is for Int64 too.
For AT&T syntax they append a "size" letter to various instructions, like appending a "Q" to "CVTSS2SI" to get a "CVTSS2SIQ" instruction. For Intel syntax the assembler is smarter and figures it out from the operands, so "CVTSS2SI RAX, XMM0" is automatically 64-bit because RAX is a 64-bit register.

For "CVTSS2SI" the destination must be a general purpose register, so you can't do something like (e.g.) "CVTSS2SI XMM1, XMM0".

OK so CVTSS2SIQ was a pseudo-instruction the real instruction is CVTSS2SI and being on X86 I have no access to 64 Bit registers

fano1 wrote:So any idea to how do a double to Int64 conversion using SSE on X86 / 32 Bit architecture?

For 32-bit code (that can't use 64-bit registers and therefore can't use a 64-bit "CVTSS2SI") it's normal to consider splitting it into a pair of 32-bit halves. The problem is that SSE doesn't seem to support "positive double to unsigned 32-bit integer", and if you use "double to signed 32-bit integer" conversion for the low 32-bits you end up with a sign bit in the middle of your 64-bit integer.

Brendan wrote: To fix that you'd probably need to:
extract the sign bit and store it somewhere, and make the double positive if it was negative

extract the high 31 bits (divide by (1<<32), convert to 32-bit integer), then subtract the high 31-bits from your double (multiply the integer by (1 << 32) and subtract from the original double).

extract the middle 31 bits, subtract the middle 31 bits from the double

extract the remaining lowest 1 bit (unless you don't need that extra 1-bit of precision?)

OR the 3 pieces together to get a "63-bit unsigned integer" (or a "64-bit positive signed integer")

If the sign bit was originally set; negate the 64-bit signed integer.

Mmh de facto is a software emulation of the truncation operation that SSE should already integrate!
Yet if possible to implement it using SSE / MMX instructions I'm unsure if it will be more faster that call the x87 fisttp instruction added with SSE that we are actually using.

Brendan wrote: Alternatively; if you know the double is in a "nice" range (magnitude not too large or too small, not zero, not NaN) it might be faster to store the double (as a double) in 2 memory locations, remove the exponent and sign from the first copy to get the significand bits and OR the implied bit into the significand, then obtain the exponent from the second copy and add the "exponent bias" and use that as a shift count to shift the significand into its correct place; then obtain the sign bit and negate if sign bit set.

This probably is more faster and in the end NaN and +- infinite should not give meaningful results if converted in a long number...

Brendan wrote: Of course using the FPU (which does allow you to convert double to 64-bit integer in 32-bit code) is probably easier and faster than all the other options. Note that:

if you have SSE2, there's no good reason to bother with MMX, so you wouldn't be switching from FPU to MMX anyway

FPU is able to work on 80-bit "extended double" format (which has 16 bits more precision than a crusty old "double") while SSE doesn't, and therefore FPU is better than SSE in cases where the extra precision matters more than performance (even when you're only working with "double" because internally the FPU can/will use the 80-bit format for intermediate values)

SSE doesn't provide things like sine, cosine, tangent, square root (but FPU does); so without FPU you'd have to implement these using extremely slow algorithms (that do give precise results) and/or lookup tables (where the precision is bad because you can't afford to blow away several GiB of RAM for lookup tables that are able to give "as precise as FPU" results).

In reality to use SSE integer instructions you should write in the MMX register so FPU could interfere... I was interested in the CMP instructions thinking they were faster that the X86 equivalent (as they returned directly a result that with OR could easy become 0 or 1 as C# wanted) but they operate only with "packed integers" so with scalar values probably doesn't work. In the end I'm unsure if makes sense with deal with MMX altogheter

FPU yes uses 80 bits but and this means that to obtain real double / float should do casting so any extra precision is always lost...
Regarding trigonometric function for now they are software implemented but in future we could use SSE for this for example this library:
http://gruntthepeon.free.fr/ssemath/sse_mathfun.h

Brendan wrote: In any case; I'd be looking for a way to avoid the need to convert double into a 64-bit integer. For a random example; if you know the value will be within a certain range and don't care about some precision loss; then you can do "K = max_magnitude / (1 << 31)", then divide the double by K, convert that to a 32-bit integer, then multiply the 32-bit integer by K to get a "less precise than possible 64-bit integer" version of the original value. Of course if you choose a "max_magnitude" that is a power of 2 (or round it up to the nearest power of 2) your K will be a power of 2 and you can replace the division and multiplication with shifts.

To be clear the IL instruction conv.i8 is the equivalent of a cast to long:

Code: Select all

double = 42.42;
long a = (long) b; // a == 42

so the expected result is truncation.

In the end for now we have decided to use the fisttp instruction and so the legacy FPU should be enabled (and the CPU should support SSE3 instruction set too).
We will not use the FPU for the x64 version of Cosmos as the other OS vendor has done were CTTSI will work for all data types.

Brendan wrote: Cheers,

Brendan

Thank you for your help.

jnc100 · Post by **jnc100** » Mon May 02, 2016 1:17 pm

If you're in 64-bit mode then the instruction is cvtsd2si (rather than cvtss2si) as this utilises 64 bit floats rather than 32 bit ones. In 32-bit mode, there is no direct instruction to do what you want. Typically, you'd want to do it is some run-time library (gcc uses libgcc). There is a function called __fixdfdi that does what you want (see fixdfdi.c for the clang implementation and here for the typedefs used in that code).

Note that you need to be careful if you use this directly as C code however, because the x86 32-bit ABI expects the floating point argument to be passed in in FPU registers, and I don't know what ABI 32-bit COSMOS uses for interfacing with external functions.

Regards,
John.

Brendan · Post by **Brendan** » Mon May 02, 2016 10:38 pm

Hi,

fano1 wrote:FPU yes uses 80 bits but and this means that to obtain real double / float should do casting so any extra precision is always lost...

No.

Even if you never use the 80-bit floating point format in your source code; expressions still benefit from "more precise" intermediate values that are kept in FPU registers, and this helps to reduce the "almost every step of evaluating an expression increases the precision lost by previous steps" problem.

fano1 wrote:Regarding trigonometric function for now they are software implemented but in future we could use SSE for this for example this library:
http://gruntthepeon.free.fr/ssemath/sse_mathfun.h

The library is fine for things like graphics (e.g. "pixel pounding") where you care about performance and don't care much about precision, and want to use SIMD to do many pixels at a time. However; for precise calculations that library is completely unusable.

Consider "result = sin(x) * y". With FPU you'd get a "sin(x)" intermediate value (that uses the 80-bit floating point format and has a 63-bit fraction) that's multiplied by y to get a second intermediate value (that uses the 80-bit floating point format and has a 63-bit fraction), which is then converted to "double"; and the result will have the full precision that a double can have because all the calculations were done in much higher precision. For that library you'd get a "sin(x)" intermediate value (that uses the 32-bit floating point format and has a 23-bit fraction); and then the multiplication will increase the precision loss a little more.

Essentially you'd be comparing "51 bits of precision worst case" to "21 bits of precision best case".

Cheers,

Brendan

fano1 · Post by **fano1** » Tue May 03, 2016 2:18 am

jnc100 wrote: If you're in 64-bit mode then the instruction is cvtsd2si (rather than cvtss2si) as this utilises 64 bit floats rather than 32 bit ones. In 32-bit mode, there is no direct instruction to do what you want. Typically, you'd want to do it is some run-time library (gcc uses libgcc). There is a function called __fixdfdi that does what you want (see fixdfdi.c for the clang implementation and here for the typedefs used in that code).

In that way de facto you are doing floating point emulation... well if it is possible to make execute that instructions using SSE probably we could no more talk of "software". Is this possible in your opinion?

Please note that SSE3 added a specific instruction to do double / float truncation that accept a 64 bit operand as destination... in the end is this we are using but so we cannot avoid to enable FPU (and we need SSE3 that could be too much "new" as instruction set a lowest common denominator).

jnc100 wrote: Note that you need to be careful if you use this directly as C code however, because the x86 32-bit ABI expects the floating point argument to be passed in in FPU registers, and I don't know what ABI 32-bit COSMOS uses for interfacing with external functions.

Regards,
John.

In Cosmos you cannot call C code only C# or X# (a special version of assembler) could be used so in any case the code should be ported. If I'm not mistaken for argument passing all is onto the stack in Cosmos double and float too as follow that .NET (.NET is a stack based virtual machine).

@Brendan
I had not noticed that the SSE version used 32 bit floating points well I think it should be ported to use double easily.
Regarding the precision of FPU sin() function:

https://randomascii.wordpress.com/2014/ ... intillion/

Brendan · Post by **Brendan** » Tue May 03, 2016 4:37 am

Hi,

fano1 wrote:@Brendan
I had not noticed that the SSE version used 32 bit floating points well I think it should be ported to use double easily.
Regarding the precision of FPU sin() function:

https://randomascii.wordpress.com/2014/ ... intillion/

That's funny; but 1 ulp error when fsin is applied to the reduced argument, and 1 ulp error when fsin is applied to the unreduced argument when the value of "PI" being used is the only value of PI the FPU is intended to use, is many orders of magnitude more precise than you'll get from a "cheap" (in terms of CPU time) approximation using SSE and doubles (not least of all because you'll be stuck with a far worse approximation of PI for argument reduction than Intel's).

Cheers,

Brendan

Schol-R-LEA · Post by **Schol-R-LEA** » Sat May 07, 2016 7:51 am

OK, I am going to be pretty pedantic here, but I think the point is relevant, so please bear with me:

fano1 wrote:OK so CVTSS2SIQ was a pseudo-instruction the real instruction is CVTSS2SI

Not exactly; the names are different because the assemblers are different, and have different ways of representing a given opcode, and different ways of identifying categories of opcodes.

You need to understand a few things:

The x86 and x86-64 instruction set designs, and the designs of the assembly languages for them by Intel (and AMD), have been touched by many hands, and aren't particularly consistent overall.
While we usually say that an assembly mnemonic has a one-to-one correspondence to an opcode, this isn't actually true in many if not most cases. Many assemblers use a single mnemonic for several related opcodes, and conversely, different mnemonics, naming what are conceptually different operations, may represent the same opcode if the operations have the same actual affect; e.g., JZ and JE are the same test of the zero flag, and thus represent the same opcode. The real correspondence is with the instruction as a whole to the opcode plus its operands.
An assembler is not required to represent the opcodes the same way that the manufacturer's documentation does; different assemblers can use very different syntax for the same opcodes. For example, to represent the opcode which sets RAX to a value on the stack pointed to by RBP with an offset of 8 bytes, an assembler might use
Code: Select all
```
mov rax, [rbp + 8]
```
or
Code: Select all
```
movq 8(%rbp), %rax
```
or
Code: Select all
```
rax <- rbp[8] 
```
or
Code: Select all
```
(move-to rax (index rbp 8))
```
As long as the correspondence from each instruction - not the mnemonic, but the instruction as a whole - is one-to-one to the opcode and set of operands it represents, it is still assembly language.

In the Intel syntax, the assembler chooses the correct opcode implicitly from the arguments, unless given an explicit indicator such as BYTE; if the instruction is ambiguous, it is a syntax error. The AT&T syntax,on the other hand, requires a size indicator on the end of any instruction that can take operands of different sizes, e.g., 'movl' for moving a 32-bit value, or, in the case you are discussing, 'cvtss2siq' for the 64-bit form of CVTSS2SI. Not also that AT&T syntax is case sensitive, and mnemonics are always lowercase.

fano1 · Post by **fano1** » Sun Jun 19, 2016 7:19 am

Thank you for your help so in the end as you have convinced me that there is no way I have retained this hybrid between SSE and x87 instructions on Cosmos OS (at least for the x86 version).

Now I'm having another problem I should implement this apparently simple instruction:

Code: Select all

ulong x = 42;
double d = (double) x;

Well it is not so easy as there is not x87 instruction to convert an unsigned long value to a double (and sadly there is not in x64 too as SSE does not support this operation too! You should use the AVX instruction set to obtain this but then we need too much new CPU to run).

The idea seems to do the operation as if the value is signed long and then remove the sign from the double obtain the correct double, usually this is done adding a constant to the double but what is this constant?

Code: Select all

  mov EAX, ESP + 4
  fild  qword ptr [esp]
  shr EAX, 31
  cmp ESP, 0
  jpe LabelSign_Bit_Unset
  LabelSign_Bit_Unset:
  fadd  dword ptr __ulong2double_const
  fstp ESP

But what is the value of '__ulong2double_const'?

I've tried using 'simple' mathematics and I got this numbers using -1 as it seems simpler (a pattern of all FFF... as long):

Code: Select all

long                                 double (as hex)   double (as binary)
-1                                    F0BF             1111000010111111
18446744073709551615d                 F043             1111000001000011

Simply doing a difference between F043 and F0BF gives this value 0xFFFFFFFFFFFFFF93 that reconverted as double is 0x0000000000405BC0.

But when I execute my code I got this weird bit pattern as result: 0x80CBC0.
That does not make any sense! Any idea what is happening?

Thanks for your help.

Nable · Post by **Nable** » Sun Jun 19, 2016 8:50 am

There are several strange things in your code:

Code: Select all

  mov EAX, ESP + 4 // did you mean [ESP + 4] ?
  fild  qword ptr [esp]
  shr EAX, 31 // why didn't you just do "test eax, eax ++ jns not_signed" ?
  cmp ESP, 0 // why do you compare ESP instead of EAX?
  jpe LabelSign_Bit_Unset // parity isn't about the least-significant bit, it's about the number of set bits
  LabelSign_Bit_Unset: // label is just after the jump, so compare + jump are doing effectively nothing
  fadd  dword ptr __ulong2double_const
  fstp ESP // did you mean "qword[esp]" or "qword ptr esp" ?

Simply doing a difference between F043 and F0BF gives this value

One cannot simply subtract floating point variables as integers. They are structures that consist of three members: sign, mantissa and exponent (see articles about IEEE-754 for the further details). I wonder if you didn't know this before.

fano1 · Post by **fano1** » Sun Jun 19, 2016 12:55 pm

Yes I've copied the wrong code this is the one really used:

Code: Select all

mov dword EAX, [ESP + 4]

fild qword [ESP]

shl dword EAX, 0x1F

cmp dword EAX, 0x0

je near LabelSign_Bit_Unset

fadd qword [__ulong2double_const]

LabelSign_Bit_Unset:

fstp qword [ESP]

I don't understand these notes: "parity isn't about the least-significant bit, it's about the number of set bits" I don't check if the number is pair (?) but if it is signed or unsigned. A signed number has the last bit set as 1 or not?
Here I'm pretending that effectively the unsigned long is a signed long so I can use fild and then I check the sign and I do the fadd.

That fadd could be used to "clear" the sign from a double is sure is that it does C# for unsigned int to double and that does GCC for this case (sadly I cannot copy that code as Cosmos uses NASM and seems to have a bug in which disalign the stack).
For example this code works for unsigned long:

Code: Select all

mov dword EAX, [ESP]

mov dword [EBP - 224], EAX

cvtsi2sd XMM0, [EBP - 224]
	
mov dword ECX, [EBP - 224]

shr dword ECX, 0x1F

cmp dword ECX, 0x0

je near LabelSign_Bit_Unset

addsd xmm0, [__xmm@41f0000000000000]

LabelSign_Bit_Unset:

sub dword ESP, 0x4

movsd [ESP], XMM0

__xmm@41f0000000000000 has this value 0x000000000000f041.

My problem is what is the constant I need to sum to obtain the right value? If subtraction does not work what is the operation to go from 0xF0BF (-1 as double) to 0xF043 (18446744073709551615 as double)?

Thanks for your help.

Octocontrabass · Post by **Octocontrabass** » Sun Jun 19, 2016 1:41 pm

fano1 wrote:I don't understand these notes: "parity isn't about the least-significant bit, it's about the number of set bits" I don't check if the number is pair (?) but if it is signed or unsigned. A signed number has the last bit set as 1 or not?

You were using JPE (Jump Parity Even) instead of JE (Jump Equal), so the comparison was checking the parity flag instead of the zero flag.

The sign bit is the most-significant bit, not the least-significant bit. If you want to check the sign bit, you should be shifting to the right instead of shifting to the left.

OSDev.org

It is possible to write an X86 OS without using X87 FPU?

It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?

Re: It is possible to write an X86 OS without using X87 FPU?