embryo2 wrote: I have a strong feeling of the incompleteness not only of my picture, but also of the "standard" compiler development approach in universities and other computer science organizations. Some impression of a shaky foundation was backed by the way compiler development is taught. I see some theoretical roots from linguistic, I see the detailed parser development approach based on the grammar, and I see a lot of heuristics and experience based advices for how to build your compiler. But I can't see the well shaped theory behind it all. There was some profitable adoption from linguistic which ended up in the very useful grammar generalization, but all the rest just looks so pale in contrast with the neat parser foundation.
Your typical undergrad course, and to some extent a typical graduate level intro course (they were actually the same class at CSU-East Bay, though the grad students got additional exercises), won't even come close to even reviewing every topic in one semester or quarter, never mind going into much depth into one of the most heavily studied areas in computer science. The course I took, which was typical and perhaps a little better than most, did the standard coverage of the Church-Turing thesis, the Chomsky hierarchy of languages, and the relation between regular languages and regular expressions (material that was covered, though less thoroughly, in the department's sophomore Programming Languages Concepts course, which was supposed to be a prereq though you could have fooled me from how some of the other students were acting); had us diagram several increasingly complex FSAs and write increasingly complex regexes, and then convert one to the other and back (also done in the PLC course I took, though apparently at least one of the other professors teaching it didn't cover it); covered how to define a deterministic FSA which can be used in designing a lexical analyzer; slid through a little material on converting an NFSA to a DFSA; spent a week or so on the basics of context-free grammars and normal form descriptions thereof (again, the basics of this should have been covered in PLC, but apparently the idea of a consistent core curriculum is beyond most universities); managed to give an overview of the three or four most important of the literally dozens of parsing methods, but like most focused primarily on
LL(1) grammars suitable for top-down parsing by recursive descent; spent another week having us define an grammar for basic arithmetic expressions, including scheduling the operator priorities, and showed us how to formally determine the ε-productions (terminals) and non-terminal productions, followed by a week showing us how to eliminate the FIRST-FIRST and FIRST-FORWARD conflicts we'd created; and then finally three weeks having us finishing implementing our direct-emit, recursive-descent parser for said expressions, then extending the parser to handle simple conditionals, indefinite iteration, and calls to built-in routines, with optional support for actual procedure definitions and call-by-value as extra credit.
Grad students read the Dragon Book instead of
the one we covered, had to use their choice of parser generator (usually Bison) instead of writing the parser by hand, and the procedure part was required for them, but otherwise it was the same class.
Even as rushed and truncated as this was, most of the students were lost after week one. I had a leg up, in that I'd spent over a decade reading bits and pieces out of different textbooks on the subject and had gone over the original 8080 Small-C code with a fine-toothed comb (both Southern Connecticut University and CSU-East Bay had the reprint collections for DDJ in their stacks, so I read lots of the early articles from them), but then my ambition got the better of me and I got bogged down writing an excessively complex FSM implementation for the lexer. My compiler, which I still tweak from time to time, can be found
in my Github repo.
alexfru wrote:Perhaps, the best thing one can do in order to get this is to actually implement the whole thing once or twice. A short course will definitely avoid certain areas or lack certain activities that compiler writers (and even software developers in general) would do in real life. A parser is a relatively small and simple thing (there are some ugly language features that complicate parsing at times) and IMHO spending much time studying parsing of various grammars is probably a large waste of resources as there are more interesting and pressing problems of practical importance.
I can see how the emphasis on lexical analysis and parsing seem overdone, and rather agree, Spending tons of time in a compiler on parsing sounds, on the surface, a lot like spending a lot of time in OS design on the bootloader - it's a single part of a bigger picture, and not even an especially complex or central part at that.
However, there are several reasons for this overemphasis on these issues. First, there simply isn't time to really cover the full material in one course.
Second, even the basic material requires a huge pile of theory to actually understand (you can use some parts without it, but you're almost certain to get lost somewhere along the way if you try).
Third, lexical analysis and parsing are (by CS standards) solved problems; they focus on them because it is a known quantity with less ambiguity, whereas intermediate representations are heavy wizardry requiring a lot more information and time than a single course provides, and optimization is a flat-out black magic, the sort of bleeding edge material that the course material will be obsolete before you even get to it, and until the late 1990s was mostly taught word-of-mouth from professor to grad student like some sort of esoteric rite. As for automatic semantic analysis, last I checked that was still pretty much a research topic, and one where advances were slow to develop. Those heuristics for optimization and semantic analysis that do exist have filtering down to some of the people actually writing compilers professionally, though rarely do any practical techniques filter back up as even open-source compiler developers tend to take a very proprietary view of their secrets.
Fourth, compiler courses generally haven't changed much since the mid-1980s, and in many cases are taught by the same professors who taught it then, too. There's a lot of resistance to modifying such a well-establish model if there's no real need, and since the course is mostly for laying groundwork rather (as very few CS students will go on to write compilers ever again), the need isn't seen by most of those making such decisions. While there is actually a lot of disagreement in the field on how best to teach compiler design (as demonstrated by the
number of different textbooks, you would never know it given the uniformity with which courses are laid out in 99% of universities.