Proper parsing of assembly

neon · Post by **neon** » Sun Jul 17, 2011 9:31 pm

Hello everyone,

I am in need for parsing an assembly language file but am finding it difficult to find a proper solution. I was first going with a recursive decent following the EBNF of MASM however I am unsure if this would be proper and would result in a parser that is quite complex. Another solution would be just to parse it in a ad-hock way; not following any former parsing algorithms. This can work; unsure on what negative effect this might have on scalability though.

I have been on this particular problem for over a month looking into different methods and am now looking for community feedback, suggestions, or recommended approaches for what you would consider the proper method of parsing would be.

Thanks for any feedback.

turdus · Post by **turdus** » Tue Jul 19, 2011 9:33 am

First, you should tokenize the ascii source, things become easier if you do. Parsing assembly is quite simple, uses dummy sentences:
(name) (dblpunct) - a label
(mnemomic) (arg) - mnemonic with 1 op
"begin" (name) - start of a macro
...etc.
Once you have collected all the variations, you're done. I suggest to implement syntax parser (the one that get tokens as input) in a way that it just prints out what it gets. You can began to match sentences, and if found, do not print the tokens. This way you'll always see what's left out.
Oh, and also don't forget to pre-tokenize (I mean if your most complicated sentece contains let's say 5 tokens, then do a 5 tokens read-ahead before passing it to syntax analyzer).

neon · Post by **neon** » Tue Jul 19, 2011 10:51 am

Hello,

I appreciate your feedback. We already have implemented a lexer that recognizes characters, numerical values, symbols, and terminals. The issue in question is the proper parsing of the tokens.

What concerns me is the possibility of a multitude of variations of a syntax meaning the same thing but all valid. The parser would need to watch for all possible variations; Im unsure if this can be done in a clean and efficient way if the parser if written in a -- what appears to be suggested in your post -- an ad-hoc way. It'll work though if its the only good solution.

Any more suggestions are appreciated.

Owen · Post by **Owen** » Tue Jul 19, 2011 2:54 pm

I have two suggestions:

Boost::Spirit, recommended, is awesome, except for the compiler times (Keep the parser core in its own source file!)
Hand code it. A custom recursive descent parser isn't that difficult

Other options will probably end in pain. In particular, my time with Lex+Yacc has convinced me that never again. (However, RE2C makes a good lexer. So does Boost::Spirit)

bluemoon · Post by **bluemoon** » Tue Jul 19, 2011 5:07 pm

neon wrote:What concerns me is the possibility of a multitude of variations of a syntax meaning the same thing but all valid. The parser would need to watch for all possible variations

Wait, that would be more than a parser IMO. A parse don't care variations, it turns texts into programming friendly data structures; it's the assembler who handle the business logic - and you simple handle all possible cases there (including defaults/omitted token).

eddyb · Post by **eddyb** » Wed Jul 20, 2011 1:28 pm

Are you kidding me? Assembly? In JavaScript, that's like a few lines with a loop and some basic regexp. It would be harder in C++, but under 50 lines (I think/hope).
</shameless_plug>
PS: Thing I learnt about parsers: just f*cking do it. It's fun and you'll learn a thing or two. Right now I have my own weird parser that can handle almost decently some C++ code (in 40 lines).

Love4Boobies · Post by **Love4Boobies** » Wed Jul 20, 2011 1:48 pm

A MASM parser would be slightly bigger than just 50 lines---JWasm's is around 3000 lines long IIRC (hand-written, too). I don't know exactly how big the C++ subset you're handling is, but it can't be much in 40 lines; C++ is one of the most complicated grammars around, there's a reason for which so few parsers for it exist.

neon · Post by **neon** » Wed Jul 20, 2011 3:38 pm

Hello,

I appreciate the suggestions so far.

eddyb wrote:PS: Thing I learnt about parsers: just f*cking do it.

The purpose of the discussion is the proper parsing of a language not "how to write a basic parser".

3000 lines for a parser for MASM syntax sounds about right considering the grammer. This was my concern about going recursive decent - having to debug several thousand lines of highly recursive code.

Brendan · Post by **Brendan** » Thu Jul 21, 2011 3:29 am

Hi,

Start with about 5 routines:

One to discard white space from the start of a string
One to remove an identifier from the start of a string (where the identifier could be a label, a mnemonic or a directive at this stage), and return it
One to remove a numerical constant from the start of a string and return it. This would handle all bases, and could treat both integers and floating point as a "big number" type (until later)
One to remove a string literal from the start of a string and return it.
One to remove a special symbol from the start of a string and return it. This is characters in the "[]()-+*/%&|^!~<>=" set, but you need to take the longest possible substring - e.g. "<<" is one symbol, and not two '<' symbols.

Start from the start of the string (a source line) and call all these routines in any order you like until the entire string is gone or until the remainder of the string begins with the comment character (';' or '#' depending on dialect); and while you're doing that generate a set of tokens representing the full line.

Once that's done, try to extract a label and an instruction from the tokens, and group any remaining tokens into "operands". Maybe something like:

Code: Select all

 struct sourceLine {
    int lineNumber;      // Only needed for displaying errors
    char *fileName;      // Only needed for displaying errors
    char *label;
    int directive_or_mnemonic_number;
    int operands;                // Number of operands
    TOKEN **operandList;         // Array of tokenised expressions representing operands
}

You'd probably also check if the number of operands is valid for the directive or mnemonic.

Now, for each tokenised expression in the operand list, try to simplify it. First try to replace any identifiers with numbers (e.g. if you know that "foo" equals 123 but don't know what "bar" is yet, then the expression "foo * 4 + (8 + bar) + eax" would become "123 * 4 + (8 + bar) + eax"). This is mostly for later on because initially you won't know the value/s of any labels. Then try to simplify the expression (e.g. "123 * 4 + (8 + bar) + eax" becomes "500 + bar + eax").

That's about all you need to do for each line.

Once you've done this for every line (and you've got maybe a linked list of those structures in memory) you'd start assembling; which is basically just trying to convert those structures into a series of bytes. You'd also want to add an "address at start of line" field to the structure and set that field if the address is known; and as you assemble more you can finish simplifying operands/expressions because the values of labels become known. That's the idea of "multi-pass" assembly anyway.

Cheers,

Brendan

neon · Post by **neon** » Fri Jul 22, 2011 10:48 am

Hello,

I appreciate your responses. The original parser [this is a rewrite] was somewhat like what Brendan posted above. Although we separated the lexer from the parser. Complications in design arose with the parsing of operands; the thought to separate them each as a separate expression (token array) is an excellent idea.

Thanks again for the suggestions.

Love4Boobies · Post by **Love4Boobies** » Sat Jul 23, 2011 7:02 am

Parsing assembly instructions is obviously very easy. A MASM grammar needs to properly handle many other things, however. For example:

Macros
Typedef's
High-level control flow constructs (remember, MASM is a high-level assembler)
Segment definitions, which have optional parameters that can appear in any order
Nested segments, subroutines, etc.
etc.

I wrote an ASM386 (Intel's original assembler; it's very similar to MASM) clone using TBNF.

OSDev.org

Proper parsing of assembly

Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly

Re: Proper parsing of assembly