rdos wrote:JamesM wrote:rdos wrote:Assembly is best, of course. Some less important and complex drivers could be in C/C++, but nothing critical should be.
Yes, because you're better than an optimizing and auto-vectorising compiler.
****.
Of course. Vectorising or not, C-compilers simple cannot handle global optimizations (like automatic register allocation between functions) very well, and therefore will suck at function-call performance. And even if they have an automatic register allocation scheme, these are bound to be less than optimal, requiring shuffling things between memory/stack/registers. I clearly can see the difference (in performance) between my old call interface that used stack for parameters, and needed an intermediate step, with the new fine-tuned interface that tells the compiler exactly which registers to pass parameters in. Additionally, C compilers also suck when there is more than one out-parameter, as these need to be passed as pointers. With an assembly-interface, they would be passed in registers.
Not to speak of the fact that C compilers really suck at handling segmented memory-models.
EDIT: Forgot to mention that C cannot handle using the CY flag to signal return status. This is clearly superior to using a whole return-register for status (especially when combined with ordinary results which then needs to be passed by reference).
These problems are built-into C, and thus cannot be solved with any kind of optimizations available.
compiler nowdays have options to do 'global program optimization', and can optimize the code between function call, even between different files, and you can also mannage to pass all parameters as a structure pointer, it still can polute the cache and add slight overhead to the code, but it's not dramatic either to push a pointer on the stack, and the number of registers is also limited on 32 bit x86 architecture, i think i have read that there is a new calling convention for x64 that make use of registers to pass parameters in a standard way because x64 have more registers, but then yay for recursive function =)
for return status, you could just make a macro to make the return that setup some bit flag in the return value to indicate the status and store the return value in lower bits, it would not allow for a full 32 bit return type, it would not work if it has to return a pointer, unless it is done as in most hardware to require a certain alignement from pointer and use the lower bits as flags, or could use a global variable to store some status even if that's rather ugly, or defining some kind of execution context in which function can store some informations, anyway it would really matter if there is lot of function call to be made, and in that case some other way can be thought off to pass parameters and handle status
there is a good series of book that cover the whole topic very in depth, it's called 'how to optimize software in c++', there are also version about C and assembler, they are very well made and explain the whole problematic
the way i see it, to really optimize asm nowdays is very different than it used to be at the time of DOS before the pentium architecture, because now, there are pipelined execution, and to really optimize the asm code correctly, you'd need to know how the instruction will be broke down into micro opcode and how that would fit into the different pipelines, there is the whole thing of throughput/latency to mannage, to see what kind of instruction will create latency with the next one, and to really understand the logic, it need to understand how it's broke down into micro opcode and handled by the different piepline, which can be very dependent on the type/brand/generation of cpu, and very hard to optimize manually for each architecture
instruction ordering, and the way branching is done, and good handling of cacheline could have more impact on the performance than traditional way of 'linear' optimisation as counting clock cycle instruction/instruction, and regarding the whole number of different cpu architecture around, between intel/amd, and the whole different generation of cpus, it would make it very hard to have generic asm code that are optimized for all architectures
when you look at what can be done with intel compiler and kernel math lib and openMP, with whole code path detected at runtime and optimized specifically for the specific cpu it is executed on, it make assembler speed optimization pretty useless, and with the intrinsec functions ( <xmm.h>), even sse and simd can be handled in pure C , with good management of inlining and global program optimization, it can give very good result
even at a scale as large as an OS, global optimization can be better achieved with good data organisation, good scheduling, and better handling of whole lot of feature and asynchronous things at high level rather than on saving a stack access to make a function call, i mean there is no little profits as they say, but asm can also prevent to have a good large scale mangement of all ressource in a way that is easy to debug and maintain, and it can prevent to implement as well high level algorythm that can do more to improve global performance than the little thing you can gain optimizing function calls, and there can be many way to optimize code to avoid to have to pass many parameters many times to a function within a loop