Proper parsing of assembly
Proper parsing of assembly
Hello everyone,
I am in need for parsing an assembly language file but am finding it difficult to find a proper solution. I was first going with a recursive decent following the EBNF of MASM however I am unsure if this would be proper and would result in a parser that is quite complex. Another solution would be just to parse it in a ad-hock way; not following any former parsing algorithms. This can work; unsure on what negative effect this might have on scalability though.
I have been on this particular problem for over a month looking into different methods and am now looking for community feedback, suggestions, or recommended approaches for what you would consider the proper method of parsing would be.
Thanks for any feedback.
I am in need for parsing an assembly language file but am finding it difficult to find a proper solution. I was first going with a recursive decent following the EBNF of MASM however I am unsure if this would be proper and would result in a parser that is quite complex. Another solution would be just to parse it in a ad-hock way; not following any former parsing algorithms. This can work; unsure on what negative effect this might have on scalability though.
I have been on this particular problem for over a month looking into different methods and am now looking for community feedback, suggestions, or recommended approaches for what you would consider the proper method of parsing would be.
Thanks for any feedback.
OS Development Series | Wiki | os | ncc
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
Re: Proper parsing of assembly
First, you should tokenize the ascii source, things become easier if you do. Parsing assembly is quite simple, uses dummy sentences:
(name) (dblpunct) - a label
(mnemomic) (arg) - mnemonic with 1 op
"begin" (name) - start of a macro
...etc.
Once you have collected all the variations, you're done. I suggest to implement syntax parser (the one that get tokens as input) in a way that it just prints out what it gets. You can began to match sentences, and if found, do not print the tokens. This way you'll always see what's left out.
Oh, and also don't forget to pre-tokenize (I mean if your most complicated sentece contains let's say 5 tokens, then do a 5 tokens read-ahead before passing it to syntax analyzer).
(name) (dblpunct) - a label
(mnemomic) (arg) - mnemonic with 1 op
"begin" (name) - start of a macro
...etc.
Once you have collected all the variations, you're done. I suggest to implement syntax parser (the one that get tokens as input) in a way that it just prints out what it gets. You can began to match sentences, and if found, do not print the tokens. This way you'll always see what's left out.
Oh, and also don't forget to pre-tokenize (I mean if your most complicated sentece contains let's say 5 tokens, then do a 5 tokens read-ahead before passing it to syntax analyzer).
Re: Proper parsing of assembly
Hello,
I appreciate your feedback. We already have implemented a lexer that recognizes characters, numerical values, symbols, and terminals. The issue in question is the proper parsing of the tokens.
What concerns me is the possibility of a multitude of variations of a syntax meaning the same thing but all valid. The parser would need to watch for all possible variations; Im unsure if this can be done in a clean and efficient way if the parser if written in a -- what appears to be suggested in your post -- an ad-hoc way. It'll work though if its the only good solution.
Any more suggestions are appreciated.
I appreciate your feedback. We already have implemented a lexer that recognizes characters, numerical values, symbols, and terminals. The issue in question is the proper parsing of the tokens.
What concerns me is the possibility of a multitude of variations of a syntax meaning the same thing but all valid. The parser would need to watch for all possible variations; Im unsure if this can be done in a clean and efficient way if the parser if written in a -- what appears to be suggested in your post -- an ad-hoc way. It'll work though if its the only good solution.
Any more suggestions are appreciated.
OS Development Series | Wiki | os | ncc
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: Proper parsing of assembly
I have two suggestions:
- Boost::Spirit, recommended, is awesome, except for the compiler times (Keep the parser core in its own source file!)
- Hand code it. A custom recursive descent parser isn't that difficult
Re: Proper parsing of assembly
Wait, that would be more than a parser IMO. A parse don't care variations, it turns texts into programming friendly data structures; it's the assembler who handle the business logic - and you simple handle all possible cases there (including defaults/omitted token).neon wrote:What concerns me is the possibility of a multitude of variations of a syntax meaning the same thing but all valid. The parser would need to watch for all possible variations
Re: Proper parsing of assembly
Are you kidding me? Assembly? In JavaScript, that's like a few lines with a loop and some basic regexp. It would be harder in C++, but under 50 lines (I think/hope).
</shameless_plug>
PS: Thing I learnt about parsers: just f*cking do it. It's fun and you'll learn a thing or two. Right now I have my own weird parser that can handle almost decently some C++ code (in 40 lines).
</shameless_plug>
PS: Thing I learnt about parsers: just f*cking do it. It's fun and you'll learn a thing or two. Right now I have my own weird parser that can handle almost decently some C++ code (in 40 lines).
- Love4Boobies
- Member
- Posts: 2111
- Joined: Fri Mar 07, 2008 5:36 pm
- Location: Bucharest, Romania
Re: Proper parsing of assembly
A MASM parser would be slightly bigger than just 50 lines---JWasm's is around 3000 lines long IIRC (hand-written, too). I don't know exactly how big the C++ subset you're handling is, but it can't be much in 40 lines; C++ is one of the most complicated grammars around, there's a reason for which so few parsers for it exist.
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
[ Project UDI ]
Re: Proper parsing of assembly
Hello,
I appreciate the suggestions so far.
3000 lines for a parser for MASM syntax sounds about right considering the grammer. This was my concern about going recursive decent - having to debug several thousand lines of highly recursive code.
I appreciate the suggestions so far.
The purpose of the discussion is the proper parsing of a language not "how to write a basic parser".eddyb wrote:PS: Thing I learnt about parsers: just f*cking do it.
3000 lines for a parser for MASM syntax sounds about right considering the grammer. This was my concern about going recursive decent - having to debug several thousand lines of highly recursive code.
OS Development Series | Wiki | os | ncc
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
Re: Proper parsing of assembly
Hi,
Start with about 5 routines:
Once that's done, try to extract a label and an instruction from the tokens, and group any remaining tokens into "operands". Maybe something like:
You'd probably also check if the number of operands is valid for the directive or mnemonic.
Now, for each tokenised expression in the operand list, try to simplify it. First try to replace any identifiers with numbers (e.g. if you know that "foo" equals 123 but don't know what "bar" is yet, then the expression "foo * 4 + (8 + bar) + eax" would become "123 * 4 + (8 + bar) + eax"). This is mostly for later on because initially you won't know the value/s of any labels. Then try to simplify the expression (e.g. "123 * 4 + (8 + bar) + eax" becomes "500 + bar + eax").
That's about all you need to do for each line.
Once you've done this for every line (and you've got maybe a linked list of those structures in memory) you'd start assembling; which is basically just trying to convert those structures into a series of bytes. You'd also want to add an "address at start of line" field to the structure and set that field if the address is known; and as you assemble more you can finish simplifying operands/expressions because the values of labels become known. That's the idea of "multi-pass" assembly anyway.
Cheers,
Brendan
Start with about 5 routines:
- One to discard white space from the start of a string
- One to remove an identifier from the start of a string (where the identifier could be a label, a mnemonic or a directive at this stage), and return it
- One to remove a numerical constant from the start of a string and return it. This would handle all bases, and could treat both integers and floating point as a "big number" type (until later)
- One to remove a string literal from the start of a string and return it.
- One to remove a special symbol from the start of a string and return it. This is characters in the "[]()-+*/%&|^!~<>=" set, but you need to take the longest possible substring - e.g. "<<" is one symbol, and not two '<' symbols.
Once that's done, try to extract a label and an instruction from the tokens, and group any remaining tokens into "operands". Maybe something like:
Code: Select all
struct sourceLine {
int lineNumber; // Only needed for displaying errors
char *fileName; // Only needed for displaying errors
char *label;
int directive_or_mnemonic_number;
int operands; // Number of operands
TOKEN **operandList; // Array of tokenised expressions representing operands
}
Now, for each tokenised expression in the operand list, try to simplify it. First try to replace any identifiers with numbers (e.g. if you know that "foo" equals 123 but don't know what "bar" is yet, then the expression "foo * 4 + (8 + bar) + eax" would become "123 * 4 + (8 + bar) + eax"). This is mostly for later on because initially you won't know the value/s of any labels. Then try to simplify the expression (e.g. "123 * 4 + (8 + bar) + eax" becomes "500 + bar + eax").
That's about all you need to do for each line.
Once you've done this for every line (and you've got maybe a linked list of those structures in memory) you'd start assembling; which is basically just trying to convert those structures into a series of bytes. You'd also want to add an "address at start of line" field to the structure and set that field if the address is known; and as you assemble more you can finish simplifying operands/expressions because the values of labels become known. That's the idea of "multi-pass" assembly anyway.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Proper parsing of assembly
Hello,
I appreciate your responses. The original parser [this is a rewrite] was somewhat like what Brendan posted above. Although we separated the lexer from the parser. Complications in design arose with the parsing of operands; the thought to separate them each as a separate expression (token array) is an excellent idea.
Thanks again for the suggestions.
I appreciate your responses. The original parser [this is a rewrite] was somewhat like what Brendan posted above. Although we separated the lexer from the parser. Complications in design arose with the parsing of operands; the thought to separate them each as a separate expression (token array) is an excellent idea.
Thanks again for the suggestions.
OS Development Series | Wiki | os | ncc
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
- Love4Boobies
- Member
- Posts: 2111
- Joined: Fri Mar 07, 2008 5:36 pm
- Location: Bucharest, Romania
Re: Proper parsing of assembly
Parsing assembly instructions is obviously very easy. A MASM grammar needs to properly handle many other things, however. For example:
- Macros
- Typedef's
- High-level control flow constructs (remember, MASM is a high-level assembler)
- Segment definitions, which have optional parameters that can appear in any order
- Nested segments, subroutines, etc.
- etc.
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
[ Project UDI ]