Page 2 of 2

Posted: Tue Jun 03, 2008 4:21 am
by JamesM
Side note: Out of curiosity, how many people here realize that Intel's newest 80x86 CPU architecture (Atom) does not do "out of order" execution?
Yes, it's in-order, but it can run at 2GHz! ;)

(Also, IIRC ARM didn't used to have OoO execution either, it relied solely on a half-decent branch predictor. I may be wrong though, it's a while since I looked up that information)

Posted: Tue Jun 03, 2008 6:50 am
by Combuster
Brendan wrote:How about an assembler with a basic peephole optimizer, that will also track instruction dependencies and rearrange "basic blocks" so that I don't need to write unmaintainable spaghetti code (unless I'm working on the small part of my project where I actually want to hand-optimize)?
Well, the assembler should first be able to allow parts to be optimized, while others are not. There are always cases where you do not want reordering of instructions (I/O, locking), while you do like that in some other parts.

Even then, reordering of instructions is not always trivial. What bottleneck should the optimizer opt for? Maximizing pipes? Decoder fitting? AGI avoidance? Cache size? 586 or Netburst? Optimizing out one thing correctly can degrade performance in other places. So "basic" is something you should watch out for.

Even your example will still cause stalls because there are two shifts together that'll always stall the pentium's V-pipe, While it is possible to reorder it such that you avoid that.
Side note: Out of curiosity, how many people here realize that Intel's newest 80x86 CPU architecture (Atom) does not do "out of order" execution? :shock:
I'm never up to date with those things :)

Posted: Tue Jun 03, 2008 7:36 am
by Brendan
Hi,
Combuster wrote:
Brendan wrote:How about an assembler with a basic peephole optimizer, that will also track instruction dependencies and rearrange "basic blocks" so that I don't need to write unmaintainable spaghetti code (unless I'm working on the small part of my project where I actually want to hand-optimize)?
Well, the assembler should first be able to allow parts to be optimized, while others are not.
Definitely, although I'd be tempted to go further: allow the programmer to explicitly select which optimizations are enabled/disabled for arbitrary pieces of code (potentially including which CPU/s to optimize for).
Combuster wrote:There are always cases where you do not want reordering of instructions (I/O, locking), while you do like that in some other parts.
The assembler can split the code into "basic blocks", where all instructions within a basic block can be reordered. Certain instructions can't be part of a basic block - things like control flow instructions (JMP, CALL, Jcc, INT n, RET, etc), targets of control flow instructions (e.g. for "JMP FOO" an instruction that's after the label "FOO" can't be shifted before the label "FOO"), I/O port instructions, anything with a LOCK prefix, FENCE, MFENCE, SFENCE, RDMSR, RDPMC, RDTSC, WBINVD, INVPLG, MOV, CRn, etc.

You'd also need to be very careful with reads and writes to memory. For e.g. consider the instructions "mov eax,[foo]" and "mov ebx,[ecx]" - the assembler can't change the order because it can't know that "[ecx]" and "[foo]" aren't the same address (and can't know that both instructions don't touch a memory mapped device).
Combuster wrote:Even then, reordering of instructions is not always trivial. What bottleneck should the optimizer opt for? Maximizing pipes? Decoder fitting? AGI avoidance? Cache size? 586 or Netburst? Optimizing out one thing correctly can degrade performance in other places. So "basic" is something you should watch out for.
Even generic optimizations would be a huge bonus.

Consider, um, me. For most of the (assembly language) code I write I do "half-assed" hand-optimization, mostly caring about the algorithm and not really caring too much about the micro-optimizations. The micro-optimization I end up doing is mostly out of habit (I don't really think about what I'm doing, and I never actually benchmark/profile). This is partly because the code I write is almost always intended for a wide range of different CPUs (for e.g. there's no real reason to spend ages getting the best performance on Pentium 4 when it could be run on a K6 or Pentium M or 80486 or ...).

The idea (at least initially) would be for the assembler to do "half-assed" micro-optimizations for me, so that I can write easy to read (and easy to maintain) code instead of complicating the source code with my own "half-assed" micro-optimizations.

Of course it would be entirely possible for the optimizer to take into account all sorts of things, but IMHO perfection isn't strictly necessary and the extra complexity may not be worth worrying about for an assembler's initial release (maybe in the second release?). ;)


Cheers,

Brendan

Posted: Tue Jun 03, 2008 8:30 am
by JamesM
It's interesting to note that GAS does do some instruction reordering - on the MIPS and SPARC architectures there is a delay slot following several instructions - especially branch instructions. This is because once the branch instruction has been decoded and it has been decided to take the branch, the next instruction is already decoded so is executed anyway.

GAS will reorder your instructions so that delay slots are loaded with useful work and not NOPs, if it can. Again, you can manually override that behaviour (for example if you're coding an interrupt handler or something else low-level) by using the ".set noreorder" command.

Posted: Tue Jun 03, 2008 12:58 pm
by bewing
I'd say that the sort of "halfassed" micro-optimization that Brendan is talking about should be done as a post-processing pass on the object file. That post processing should be done by a completely separate program than the assembler -- because it can be run on the object files that are output by many different compilers/assemblers.

To do a smarter job of optimization than that requires the compiler to have more understanding of what the code is doing -- which requires either a somewhat higher-level language, or an inordinate number of "hints" from the programmer -- which almost certainly takes more time than it's worth.

In most cases (as has been pointed out many times), it will take each of us so long to create our OS that the chip architecture at the time of OS debut is unknowable. We cannot know how many pipes to code for, or what will stall a pipe, or whether the chip will do automatic OO execution, or anything. So, I'm writing my code with no hand micro-optimizations at all in the ASM. At release time, if I can run the object file through some kind of micro-optimizer that is targeted at current CPU designs, then that would be great. Until then, I want my code to be readable by me, and to always singlestep in exactly the same opcode order that I wrote it in.

Posted: Tue Jun 03, 2008 1:38 pm
by ollie123
Will this new assembler assemble a Flat Binary as well as Elf?

Posted: Tue Jun 03, 2008 7:15 pm
by Ryu
ollie123 wrote:Will this new assembler assemble a Flat Binary as well as Elf?
Producing the flat binary is on the linker side. :)


It would be cool if the assembler can assemble basic definitions in C like, for example structure definitions ( but the varibles can stay basic byte word dword etc) can be C style and function prototypes are declared like C and somehow be able to define your own calling convention. Allow C style comments "/* */", frankly I hate also how assemblers like MASM only allow single line instructions. Those two things I find would greatly tidey up assembly code. Its somewhat the same idea as inline assembly in MSVC but they stay in cdecl, fastcall and stdcall calling conventions and annoy you with return values, and only allow single instruction per line.

Heres poorly written example of code for a clearer picture:

Code: Select all

DATA SEGMENT USE32;

struct my_struct {
/* 00h */ dword d;
/* 04h */ word w[2];
/* 08h */ byte b[4];
};

my_struct garbage{ 0, 1, 2, 3, 4, 5, 6 };

DATA ENDS;

CODE SEGMENT USE32;

__cdecl InitializeGarbage( dword d, word w1, word w2, byte b1, byte b2, byte b3, byte b4)
{
    mov eax, [d];  /* d translates to [esp+4h] */
    mov [dword ptr garbage + 00h], eax;
    mov ax, [w1];
    mov [word ptr garbage + 04h], ax;
    mov ax, [w2];
    mov [word ptr garbage + 06h], ax;
    mov al, [b1]; mov [byte ptr garbage + 08h], al;
    mov al, [b2]; mov [byte ptr garbage + 09h], al;
    mov al, [b3]; mov [byte ptr garbage + 0Ah], al;
    mov al, [b4]; mov [byte ptr garbage + 0Bh], al;

    mov eax, 1 /* success */
}

CODE ENDS;

Since assembly usually utilize more then a simple return register like carry flags etc it should be noted of the return registers and flags per fuction. "CODE SEGMENT USE32" is MASM's syntax for segments just to serve as an example of mixing with C in segments, anything to do with actual code I personally like it straight assembly.

edit: fixed esp comment for cdecl convention