Page 1 of 1

compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 5:08 am
by dlarudgus20
As we know, until we write floating-point support to our OS, we should disable using these when we use compile optimization. (e.g.

Code: Select all

-O3 -mno-mmx -mno-sse -mno-sse2 -mno-sse3
)

But after supporting floating-point, it's good for speed to enable these? Of course, they are used when compiler thinks it's good for speed. But if we use these, OS should perform saving/restoring MMX register.

Despite OS's overhead, using MMX/SSE is good for speed? or not? All your advice will be appreciated.

Re: compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 5:21 am
by sortie

Code: Select all

-mno-mmx -mno-sse -mno-sse2 -mno-sse3
Note that compilers like i686-elf doesn't enable any of those by default. Only x86_64 compilers do because x86_64 requires MMX, SSE and SSE2 (I think, look it up) to be available, so the compiler assumes they are.

There is absolutely nothing wrong with using floating point registers in the kernel and removing those. However, that means your interrupt handlers should take care not to overwrite the values and reset the floating point environment to a safe state. Usually, you don't really need floating point stuff in a kernel and can do without.

Re: compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 5:35 am
by dlarudgus20
Sometimes compiler use these for optimization. (I have got a bug because of this before http://stackoverflow.com/questions/1961 ... 0-register :cry: )

My question is whether that is really helpful for optimization...

Re: compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 5:38 am
by Nable
sortie wrote:Usually, you don't really need floating point stuff in a kernel and can do without.
SIMD instructions provide not only floating-point operations. At least, fast memcpy is a very nice thing for kernel. Some other parts of kernel can also gain some speed-up from SIMD instructions (e.g.: crypto modules).

Re: compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 5:55 am
by sortie
Yes, it could potentially speed up memcpy, at the cost of having to save floating point registers. There is a trade-off here that depends on what exactly you want to do. In some ways, it would be cleanest if the kernel can use floating point registers, and I am suspecting doing so is not nearly as expensive as it used to be since computers are faster today. Do your own profiling to determine what gains are available. :-)

Re: compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 6:24 am
by Bender
I don't think that should cause a problem (unless you've a hidden bug with your code), it may cause something like 'Invalid Opcode' if you haven't enabled SSE or initialized the FPU (not sure about the latter). Adding FPU support is a trivial task, AFAIK you just need to initialize the FPU and/or SSE and have nice handlers for floatbloats/faults, let the compiler handle the rest. :)

Re: compiler optimization - With MMX/SSE or without

Posted: Mon Apr 28, 2014 2:02 pm
by Owen
The cost of enabling FPU usage in kernel has kept steady over time because FPU state has grown. SSE grew it to ~256 bytes, AMD64 grew it to ~512 bytes, AVX ~doubled that; AVX512 will double it again.

Hence, saving the FPU state on every context switch is expensive. Note that, for Haswell as one example, saving AVX state corresponds in cost to saving 3% of L1 cache. This isn't overwhelming, but it is large. Also note that the size of L1 caches have been fairly static (All Intel CPUs have had 64kB L1 for the last decaded), so the relative cost is growing.

For a Haswell at max turbo (3.9Ghz), a 1kB burst read or write to RAM takes 156 cycles (plus setup - expect ~16 *RAM* cycles (~400Mhz) of latency before anything happens while the DRAM's internal state machine prepares the DRAM row for burst, never mind whatever latency is required in order for the request to bubble up through L1, L2, L3, maybe L4). Now, either you are doing a "routine" context switch (in which case the state is going to have to be evicted into RAM), or you're context switching back pretty quickly (in which case you haven't done enough work to justify 2kB of L2 traffic). In pure cache bandwidth times, it takes 32 cycles for that data to transit the L1->L2 bus, and 16 cycles for it to transit back.

Never mind the fact that XSAVE/XRESTOR are, like many system mode instructions, what people in general would call "Microcode monstrosities" which require the whole out-of-order pipeline to drain before execution (and no further instructions may start until they complete. Note also that on HyperThreaded CPUs the other thread is paused also)

This is one of those areas where, in real terms, the CPU hasn't really gotten faster. For every increase in speed, a corresponding increase of expense has occurred.