compiler optimization - With MMX/SSE or without

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
dlarudgus20
Member
Member
Posts: 36
Joined: Sat Oct 26, 2013 4:14 am

compiler optimization - With MMX/SSE or without

Post by dlarudgus20 »

As we know, until we write floating-point support to our OS, we should disable using these when we use compile optimization. (e.g.

Code: Select all

-O3 -mno-mmx -mno-sse -mno-sse2 -mno-sse3
)

But after supporting floating-point, it's good for speed to enable these? Of course, they are used when compiler thinks it's good for speed. But if we use these, OS should perform saving/restoring MMX register.

Despite OS's overhead, using MMX/SSE is good for speed? or not? All your advice will be appreciated.
User avatar
sortie
Member
Member
Posts: 931
Joined: Wed Mar 21, 2012 3:01 pm
Libera.chat IRC: sortie

Re: compiler optimization - With MMX/SSE or without

Post by sortie »

Code: Select all

-mno-mmx -mno-sse -mno-sse2 -mno-sse3
Note that compilers like i686-elf doesn't enable any of those by default. Only x86_64 compilers do because x86_64 requires MMX, SSE and SSE2 (I think, look it up) to be available, so the compiler assumes they are.

There is absolutely nothing wrong with using floating point registers in the kernel and removing those. However, that means your interrupt handlers should take care not to overwrite the values and reset the floating point environment to a safe state. Usually, you don't really need floating point stuff in a kernel and can do without.
dlarudgus20
Member
Member
Posts: 36
Joined: Sat Oct 26, 2013 4:14 am

Re: compiler optimization - With MMX/SSE or without

Post by dlarudgus20 »

Sometimes compiler use these for optimization. (I have got a bug because of this before http://stackoverflow.com/questions/1961 ... 0-register :cry: )

My question is whether that is really helpful for optimization...
Nable
Member
Member
Posts: 453
Joined: Tue Nov 08, 2011 11:35 am

Re: compiler optimization - With MMX/SSE or without

Post by Nable »

sortie wrote:Usually, you don't really need floating point stuff in a kernel and can do without.
SIMD instructions provide not only floating-point operations. At least, fast memcpy is a very nice thing for kernel. Some other parts of kernel can also gain some speed-up from SIMD instructions (e.g.: crypto modules).
User avatar
sortie
Member
Member
Posts: 931
Joined: Wed Mar 21, 2012 3:01 pm
Libera.chat IRC: sortie

Re: compiler optimization - With MMX/SSE or without

Post by sortie »

Yes, it could potentially speed up memcpy, at the cost of having to save floating point registers. There is a trade-off here that depends on what exactly you want to do. In some ways, it would be cleanest if the kernel can use floating point registers, and I am suspecting doing so is not nearly as expensive as it used to be since computers are faster today. Do your own profiling to determine what gains are available. :-)
User avatar
Bender
Member
Member
Posts: 449
Joined: Wed Aug 21, 2013 3:53 am
Libera.chat IRC: bender|
Location: Asia, Singapore

Re: compiler optimization - With MMX/SSE or without

Post by Bender »

I don't think that should cause a problem (unless you've a hidden bug with your code), it may cause something like 'Invalid Opcode' if you haven't enabled SSE or initialized the FPU (not sure about the latter). Adding FPU support is a trivial task, AFAIK you just need to initialize the FPU and/or SSE and have nice handlers for floatbloats/faults, let the compiler handle the rest. :)
"In a time of universal deceit - telling the truth is a revolutionary act." -- George Orwell
(R3X Runtime VM)(CHIP8 Interpreter OS)
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: compiler optimization - With MMX/SSE or without

Post by Owen »

The cost of enabling FPU usage in kernel has kept steady over time because FPU state has grown. SSE grew it to ~256 bytes, AMD64 grew it to ~512 bytes, AVX ~doubled that; AVX512 will double it again.

Hence, saving the FPU state on every context switch is expensive. Note that, for Haswell as one example, saving AVX state corresponds in cost to saving 3% of L1 cache. This isn't overwhelming, but it is large. Also note that the size of L1 caches have been fairly static (All Intel CPUs have had 64kB L1 for the last decaded), so the relative cost is growing.

For a Haswell at max turbo (3.9Ghz), a 1kB burst read or write to RAM takes 156 cycles (plus setup - expect ~16 *RAM* cycles (~400Mhz) of latency before anything happens while the DRAM's internal state machine prepares the DRAM row for burst, never mind whatever latency is required in order for the request to bubble up through L1, L2, L3, maybe L4). Now, either you are doing a "routine" context switch (in which case the state is going to have to be evicted into RAM), or you're context switching back pretty quickly (in which case you haven't done enough work to justify 2kB of L2 traffic). In pure cache bandwidth times, it takes 32 cycles for that data to transit the L1->L2 bus, and 16 cycles for it to transit back.

Never mind the fact that XSAVE/XRESTOR are, like many system mode instructions, what people in general would call "Microcode monstrosities" which require the whole out-of-order pipeline to drain before execution (and no further instructions may start until they complete. Note also that on HyperThreaded CPUs the other thread is paused also)

This is one of those areas where, in real terms, the CPU hasn't really gotten faster. For every increase in speed, a corresponding increase of expense has occurred.
Post Reply