~ wrote:Solar wrote:
Schol-R-LEA explicitly asked you about the token data structures.
I will simply make a file called "elements.dat" that will contain a long at the start with the number of elements and the sequence of elements as they appear on the whole code, their type (start of line, end of line, preprocessor, keyword, identifier, number, operator, blank space block, comment block, string, open parenthesis, close parenthesis, backslash, asterisk...).
OK, I can see that this is indeed a table of the tokens... sort of... but why are you putting it in a file? Unless there is a memory crunch - and generally speaking, even large programs won't eat 64KiB for their symbol tables, and even in real mode I think you can spare a whole data segment for something this important - there is no reason to for a C compiler to save it to a file, and every reason for it to keep it as a tree in memory,
unless you intend to have the lexer and the parser as
separate programs with no sharing of memory.
Such compiler designs have existed in the past; the Microsoft Pascal and Fortran 77 compilers for MS-DOS, circa 1983, comes to mind. But they designed it that way to accommodate the CP/M version of the compiler, and retained it for the first few MS-DOS versions to allow it to run on 64KiB IBM PCs; even by 1983, those were only a small fraction of PCs, with newer ones shipping with at least 256KiB and many IBM PC/XT and Compaq Deskpros already hitting the 640KiB limit (in fact, memory-hungry programs such as Lotus 1-2-3 were already running into problem with that limit, and in 1984 both bank-switched Expanded Memory for 8088s, and Extended Memory for 80286s, were introduced to get around it).
It wasn't really necessary even before that, though. After
Turbo Pascal came around in late 1983 - a single-pass, all in memory compiler that ran in 64KiB under both 8080 CP/M and 8088 MS-DOS even including it's simple full-screen text editor, and which blew the older multi-pass compilers away in terms of speed and useful error messages - that technique vanished even in the 8-bit world (as there were still plenty of
Apple //es and
IIcs, and
Commodore 64s, still in use).
Why you think that bringing that approach back is a good idea isn't at all clear to me, unless you are actually talking about for the program
listing rather than the symbol table, in which case my question becomes, why are you talking about that instead of the symbol table and the in-memory Token struct/class?
~ wrote:I need a main loop with helper functions capable of recognizing every element with precedence (spaces, comments, preprocessor, strings, numbers, keywords, identifiers...).
In other words... the tokens. And yes, you would definitely need a set of helper function for this - specifically, a set of functions that combine to form a lexical analyzer.
Also, you would usually handle precedences later, in the parser. More on this later.
~ wrote:They have to record the start and end of each element, then a tree of IFs will call a specialized program only for each element to fully process it separated from the rest of compiler/language elements.
In other words, a lexical analyzer, specifically an ad-hoc lexer. This is indeed one of the things I was talking about, though I get the sense that you don't know all of the English names for these things, which may be part of the problem we are having. I can't tell whether this is due to a communication problem (from your README file on Archefire, I gather that your native tongue is Spanish, and I get the impression that your English isn't particularly strong - though if this is so, then I have to say your writing is still better than many native English speakers), or because you haven't read up on the existing techniques, or both, and I am willing to give you
some benefit of the doubt on this.
~ wrote:The end of the processing of an element or set of elements results in default assembly to then write to the assembly to assemble with NASM. You will be able to see the on-file structure array to handle #includes because that's what I need to implement now.
OK, now this is a worrying statement, because it sounds as if you are skipping a few steps. My impression is that you are combining three roles - the lexer, the parser, and the code generator - by doing substring matches on the input stream, and structuring the parser so that it is calling the matching function repeatedly against the input strings, walking through the set of possible matches, and then working from there until you collect the end of the expression (what in parsing is called a 'terminal' of the grammar), at which point you output a stream of one or more lines of assembly code.
It is entirely possible to do it this way - it is how the original versions of Small C did it - but you seem to be missing some details as to how you can make that approach work.
The approach in question is called '
recursive decent parsing', a type of
top down Left-to-right, leftmost derivation parsing. It is an old, tried, and true method for writing a simple compiler by hand, and is the starting point for almost very compiler course and textbook that doesn't jump directly into using tools like flex and bison. It was developed in the early to mid 1960s, and was probably first investigated by
Edsger Dijkstra some time before
his 1961 paper on the topic; a number of others experimented with it arond that time, and
Tony Hoare seems to have been one of the first to write a complete compiler in that fashion, the
Elliot Algol compiler.
In the early 1970s,
Niklaus Wirth popularized it for use in the first formal compiler courses, as a method that was easier to use when writing a parser by hand with than the earlier
canonical LR parsing method that was developed in 1965 by
Donald Knuth (canonical parsers, and
bottom-up parsers in general, require large tables to represent the grammar, and are an unholy nightmare to develop entirely by hand, but they are much more efficient that R-D parsers and are well suited to automatically generating the parser).
Recursive-descent works pretty well... for small projects done by hand. It is where just about everyone studying compilers starts out, and I can't fault you for going that route... except that I am not sure if you really understand it yet, as I get the impression that you still haven't read up on formal compiler design yet.
This is almost certainly a mistake. Lexical analysis and parsing are, far and away, the
best understood topics in the entire field of computer programming, with the possible exception of relational algebra, and they have uses far beyond compilers and interpreters. Notice the dates I quoted - most of them are from over 50 years ago. These are topics that academic computer scientists and working programmers alike
understand better than anything else, and the techniques for it a varied, effective, and
solid.
If you don't at least try to learn more about the prior art before tackling writing an actual compiler, even a toy one, then you are doing yourself a disservice.
Maybe I am wrong, and you are simply having trouble expressing what you are doing in a foreign language. But you have to understand that we are trying to help you, trying to give you what we consider the best advice possible.
It's dangerous to go alone - take this!
hands ~ a copy of the free PDF version of Compiler Construction by Wirth