Assembler syntax

SoLDMG · Post by **SoLDMG** » Sat Sep 06, 2014 1:34 pm

Hi everyone,

So as some of you know, I'm developing a toolchain (assembler, linker, C compiler for now). So far for the assembler I have the lexer + part of the preprocessor done, but now I'm stuck on the assemblers syntax. I like the Intel syntax best, but I'd like to hear from you guys what you'd like to see in the assembler, not that any of you would actually use it, but still. Maybe a compromise between AT&T and Intel? Or just pick one and supply a translator of sorts for the other?

With regards,
SoLDMG

Roman · Post by **Roman** » Sat Sep 06, 2014 3:40 pm

In my opinion the Intel syntax is better, because it is more intuitive and human-readable.

Octocontrabass · Post by **Octocontrabass** » Sat Sep 06, 2014 4:56 pm

I prefer Intel syntax. All that extra punctuation in AT&T syntax is a pain.

I have to give AT&T some credit, though; they did manage to use the same syntax for both x86 and 68k.

SoLDMG · Post by **SoLDMG** » Sun Sep 07, 2014 5:30 am

What about other instruction sets, like ARM? Instead of using the AT&T/ARM syntax, use the Intel syntax but with ARM/MIPS/m68k instructions?

Brendan · Post by **Brendan** » Sun Sep 07, 2014 6:50 am

Hi,

Let's look at this in a logical/scientific way!

For programming languages, the single most important thing is how easy it is to read (how easy it is to write is far less important). When you are writing a piece of code you frequently refer back to previously written lines (whether you're aware of it or not); then when you're finished you "proof read" it to see if you messed something up or overlooked something; then it ends up being maintained (e.g. read many times over the several years) afterwards.

For assembly; there are actually 4 distinct cases of "reading code":

Determining if the intention is sane. This is the reason for the time honoured tradition of "column of comments down the right", where comments are separated from the code (and where the comments document the programmer's intent, which often has nothing to do with the implementation - e.g. "add eax,eax ;Multiply the number of chickens by 2").
Comparing the implementation to the intent. In this case you mostly read the instruction then the comment on a "line by line" basis.
Looking at the sequence of instructions (e.g. to get a general feel for the implementation, find "slow" instructions, etc). In this case you're scanning vertically from top to bottom only looking at the instruction (without paying much attention to any instruction's operands). For this reason, it's important to have all the instruction aligned nicely so they form an "easily scanned" column. For this case you also want to make sure control flow (e.g. branch targets) stand out; which is why labels are to the far left of a line.
Searching the "destination" operand/s. This happens a lot more than people realise - e.g. you'll see an instruction like "mov eax,ebx" and wonder what EBX contains at that point, and search backwards to see the instruction that modified EBX last. For this reason you'd want a "destination operand" column where they're all lined up nicely.

Now; if you think about this, neither Intel syntax nor AT&T syntax are ideal - they're both "not optimal" for the last case (searching the destination operand/s). A "more ideal in theory" syntax might look like this:

Code: Select all

        dx  mov  0
        ax  mov  [nextLBA]                ;dx:ax = next LBA address
        bx  mov  [sectorsPerTrack]        ;bx = sectors per track
    ax, dx  div  ax, dx,bx                ;ax = head number * cylinder number, dx = sector number - 1
            push  dx
        dx  mov  0                        ;dx:ax = head number * cylinder number
    ax, dx  div  ax, dx, di               ;ax = cylinder number, dx = head number
        dh  mov  dl                       ;dh = head number
        cx  pop                           ;cx = sector number - 1
        bx  sub  cx                       ;bx = number of sectors that could be loaded before end of track
        ch  mov  al                       ;ch = cylinder number
        cl  inc                           ;cl = sector number

This makes it easier to scan the destination operand/s vertically, without making any of the other cases harder.

There are 2 relatively obvious problems with this. The first is that nobody likes syntax they aren't used to and everyone's first reaction will be "it's ugly". This is only a temporary phenomenon - it just takes people a while to adjust (and is no worse than someone switching from Intel to AT&T syntax, or from AT&T syntax to Intel syntax).

The other problem is that "plain text" sucks. I remember a presentation (which was actually C++ syntax, and may have been about a code sanitiser that Google built out of parts of the LLVM project) where they investigated where programmer's time is spent and found that most programmers spend about 20% of their time just diddling with white-space. If people are limited to "plain text" editors and need to right-justify the destination operand column by hand, then it would be a significant additional chore.

All of the above went through my head when I was trying to decide what my own assembler's syntax would look like. Of course my project is different in that "source code" is in a tokenised binary format and not plain text (where the IDE formats it, and programmers won't be wasting time diddling with white-space at all); so the main disadvantage of the syntax I described above doesn't exist.

Despite this, I chose Intel syntax in the end. I'm still not quite sure if this was a good idea, or if my decision was based on an what I'm used to rather than what is better, or if my decision was caused by simple cowardice.

Cheers,

Brendan

Owen · Post by **Owen** » Sun Sep 07, 2014 7:15 am

SoLDMG wrote:What about other instruction sets, like ARM? Instead of using the AT&T/ARM syntax, use the Intel syntax but with ARM/MIPS/m68k instructions?

Most other architectures have far less syntax variations. M68K and PowerPC have two distinct syntaxes (Apple vs AT&T and Apple vs IBM respectively), but otherwise syntax variations are quite rare. There is one MIPS assembler syntax.

ARM (AArch32) has two syntaxes: traditional and unified, and today everyone uses unified. Pretty much every assembler supports them both or just the unified syntax; the GNU assembler defaults to old (for legacy compatibility), but can be switched using the ".syntax unified" directive. ARM AArch64 has one syntax, which everyone implements.

One important thing is to distinguish the syntax from the directives. One advantage of using the GNU Assembler (and therefore most likely the AT&T syntax on x86 - while you can switch it to Intel-style, its' the road less travelled and has some issues) is that the directive syntax is mostly the same across targets (some historical reasons mean that there are some oddities - AT&T seemingly arbitrarily decided between whether align took a number of bytes or a power of two across various targets, for example). The other thing is that GAS is generally more capable - it has directives for generating exception unwinding info and DWARF debug info, for example.

One other bit of reasoning - if you're using GCC or Clang, you already have an AT&T syntax capable assembler. Also, the rest of the toolchain defaults to AT&T syntax.

Its' a bit weird, but you get used to it very quickly.

Antti · Post by **Antti** » Sun Sep 07, 2014 11:40 am

I would like to have an easy way to choose a longer version of an instruction. For example, instead of having "mov ax, [si+0x10]", I may want to have "mov ax, [si+0x0010]".

I have spent a huge amount of time arranging machine instructions really nicely. I start with a sketch unit that is usually less than 64 bytes and is written in "normal" assembly. After the functionality of that unit is ready, I start "machine code optimization phase". There are several rules I follow, for example, instructions do not cross 16-byte boundary and jump targets are as "aligned" as reasonably possible. The size of a unit is always 16-byte aligned. I try to write optimal code but at times I do the exact opposite. For example, I may use longer instructions to meet the requirements. I do not use NOPs. Those units are like pieces of a puzzle and I try to make them as unit testable as possible (but that is something I have to improve a lot). I put units together to create bigger units and this step is, of course, more high-level than adjusting single instructions.

sandras · Post by **sandras** » Sun Sep 07, 2014 1:39 pm

This is similar of what I'm thinking of doing. I want to develop a set of machine code macros. These macros would be primitives for a Forth-like language. In fact the Forth inventor himself has done something like this. Macros would mostly be operandless - the implied arguments would be found on the stack. As for the macro that puts stuff on the stack - the programmer sees no "push 123", he only needs to write "123" and it will be pushed to the stack, in other words, the compiler can figure out what is a number and what is a name of a primitive. Underneath however, there would have to be a machine code chunk, with a place to insert an operand to. What I'm heading at is that there would not need to be any kind of syntax, which is one of the defining characteristics of Forth. Just words separated by spaces.

As to answer the original question, Intel syntax seems clearly more suited for a human being, while AT&T seems more suited for the computer. But as we all know, human time is much more valuable than computer time, so I would have to go with Intel.

Octocontrabass · Post by **Octocontrabass** » Sun Sep 07, 2014 3:35 pm

Antti wrote:instructions do not cross 16-byte boundary

I will admit I haven't read all the way through these optimization manuals, but I don't think that improves performance on any x86 CPU.

SpyderTL · Post by **SpyderTL** » Sun Sep 07, 2014 8:40 pm

This thread gives me some ideas for my ideal ASM format, although it may be a bit off topic for this particular thread. If anyone is interested in discussing it, I'll start a new thread.

Expanding on the ideas above, I'd like to throw in a few ideas of my own:

1. One thing that ASM is missing is the concept of code blocks. The obvious example is "functions", but an even simpler example would be, say, a comment stating what the next few lines will be doing, and then the instructions to perform that particular task. The initial "comment" could even be "promoted" to a higher-level "pseudo-code" line that could be used by the IDE to summarize the code block so that you could quickly scan to the block that you wanted to view the details on.

2. Since the ASM format essentially forces one instruction per line, it would be nice if the IDE could give you some visual clues about what is on each line. Putting an icon in the left margin of each line, or color coding the line, or both based on whether the line was an instruction that a) modified a register, b) modified a memory address, c) wrote to an I/O port, d) read from an I/O port, e) tested a value and set flags, f) performed an operation based on particular flags being set, etc. This would make it easy to scan for all of the places that, say, a value was written to the IDE controller in a large block of code.

3. Visual Studio has an option to highlight all references to a variable, class or function. ASM could benefit from something similar, where all references to a label, a register, or an instruction would be highlighted by simply clicking one of these elements.

4. Tooltips showing all of the details of an instruction including a description, the registers and flags that are modified, and the exceptions that could be triggered by that instruction.

I realize after looking over these items that these are mostly IDE features, but I still think that they are worth considering.

As for the particular ASM flavor, you might want to consider allowing the user to choose which one they want, and automatically converting between them when the user switches from one format to another.

Just some ideas. Maybe it'll help.

If all of this already existed in a free IDE, I wouldn't have had to write my own XML based solution.

sandras · Post by **sandras** » Sun Sep 07, 2014 9:52 pm

SpyderTL wrote:3. Visual Studio has an option to highlight all references to a variable, class or function. ASM could benefit from something similar, where all references to a label, a register, or an instruction would be highlighted by simply clicking one of these elements.

That seems like an overcomplication to me. Ctrl+F and just enter the name of the label, register, or an instruction. That works across many applications.

SpyderTL · Post by **SpyderTL** » Mon Sep 08, 2014 1:22 am

Ctrl-F doesn't understand your code. But the IDE does. A simple text search is going to give you a lot of "false positives".

Plus, clicking on a word is a lot simpler than pressing Ctrl-F and typing in a word, and hitting Enter over and over and over and over.

SoLDMG · Post by **SoLDMG** » Wed Sep 10, 2014 1:23 pm

*crowdsurfs you around*

It'd be a little like Ken Thompsons car.

Code: Select all

Ken Thompson has an automobile which he helped design. Unlike most automobiles, it has neither speedometer, nor gas gauge, nor any of the numerous idiot lights which plague the modern driver. Rather, if the driver makes any mistake, a giant "?" lights up in the center of the dashboard. "The experienced driver", he says, "will usually know what's wrong.

Just in case anyone doesn't know what I'm talking about.

Jezze · Post by **Jezze** » Thu Sep 11, 2014 12:40 am

I totally agree with you there SolDMG. C is so close to the perfect language for me that there isnt much I would like to be different in the language itself besides adding more syntactic restrictions and add something better to define data structures besides just using structs and/or bitfields together with either enums or defines for register definitions. Where I think the problem is today is in the tools where they are just too big and too bloated with options.

Bencz · Post by **Bencz** » Thu Sep 11, 2014 7:20 am

SoLDMG wrote:Hi everyone,

So as some of you know, I'm developing a toolchain (assembler, linker, C compiler for now). So far for the assembler I have the lexer + part of the preprocessor done, but now I'm stuck on the assemblers syntax. I like the Intel syntax best, but I'd like to hear from you guys what you'd like to see in the assembler, not that any of you would actually use it, but still. Maybe a compromise between AT&T and Intel? Or just pick one and supply a translator of sorts for the other?

With regards,
SoLDMG

Hi!

Why u not generate a ".obj" file, using the OMF obj format ?
http://en.wikipedia.org/wiki/Relocatabl ... ule_Format

In code-gen of your C compiler, u can make a struct, with machine code and text asm code..., in that struct, u can út the both sintax, AT&T or Intel, the user choice for generate machine coide or asm text code

Code: Select all

enum 
{ 
    push_eax=0,....
}

intructions instru[] =
{
    { "50", "push eax", "pushl %eax"}, ....
}

OSDev.org

Assembler syntax

Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax

Re: Assembler syntax