How to make a compiler

Schol-R-LEA · Post by **Schol-R-LEA** » Thu Mar 28, 2019 5:31 pm

Actually, in terms of the time spent, the dominant part of most compilers is the lexical analyzer - which in most cases, amounts to a Deterministic Finite State Automaton, whether formally constructed or as an ad-hoc equivalent. Either way, I would have a lot of trouble seeing that as a graph coloring problem.

alexfru · Post by **alexfru** » Thu Mar 28, 2019 6:50 pm

mariuszp wrote:It's literally just lots of graph problems; graph coloring, vertex reachability, etc.

Only if your compiler is an optimizing one.

Korona · Post by **Korona** » Fri Mar 29, 2019 12:15 am

Schol-R-LEA wrote:Actually, in terms of the time spent, the dominant part of most compilers is the lexical analyzer - which in most cases, amounts to a Deterministic Finite State Automaton, whether formally constructed or as an ad-hoc equivalent. Either way, I would have a lot of trouble seeing that as a graph coloring problem.

That's only true if you either use an existing backend or don't care about optimization. Lexical analysis is solved and there are so many algorithms to do it easily: recursive descend, PEG, LR parsers generators, or even Earley parsers that can handle any CFG.

mariuszp wrote:It's literally just lots of graph problems; graph coloring, vertex reachability, etc. Prove me wrong.

A LLVM-style Greedy RA is superior to graph coloring RAs in practice, prove me wrong. Graph coloring just cannot easily handle non-homogeneuos register files (and of course, it's also quite expensive). When I wrote an implementation (only a bit over 1k SLOC, can do live range splitting but not register spilling right now) inspired by the LLVM one, I was surprised (i) how natural it is, and (ii) how good the results are if you tune it a bit (for smallish functions with only a few basic blocks/variables, the result is often optimal). It's incredibly easy to tune -- when you see a missed optimization, it's usually straightforward to see how to extend the RA to handle it.

Schol-R-LEA · Post by **Schol-R-LEA** » Fri Mar 29, 2019 9:48 am

mariuszp wrote:
Schol-R-LEA wrote:Actually, in terms of the time spent, the dominant part of most compilers is the lexical analyzer - which in most cases, amounts to a Deterministic Finite State Automaton, whether formally constructed or as an ad-hoc equivalent. Either way, I would have a lot of trouble seeing that as a graph coloring problem.
All problems in NP have a polynomial-time reduction to graph coloring.

facepalm I can't believe I fell for that. Yeah, that can describe more or less any practical algorithm, can't it (as P is a subset of NP).

Korona · Post by **Korona** » Fri Mar 29, 2019 12:04 pm

Doesn't hold when you want to parse C++. (And context-sensitivity gives you a PSPACE-complete problem.)

Qbyte · Post by **Qbyte** » Sat Jun 01, 2019 8:16 am

Schol-R-LEA wrote:Actually, in terms of the time spent, the dominant part of most compilers is the lexical analyzer - which in most cases, amounts to a Deterministic Finite State Automaton, whether formally constructed or as an ad-hoc equivalent. Either way, I would have a lot of trouble seeing that as a graph coloring problem.

The lexer is actually the simplest and least time consuming part of writing a compiler. There's also pretty much only two ways to design a lexer; one which transforms the entire source file into a token array in a single step which is then handed off to the parser, or one which generates tokens for the parser on the fly. I personally prefer the former design for reasons of performance and code simplicity.

Schol-R-LEA · Post by **Schol-R-LEA** » Sat Jun 01, 2019 4:54 pm

Qbyte wrote:
Schol-R-LEA wrote:Actually, in terms of the time spent, the dominant part of most compilers is the lexical analyzer - which in most cases, amounts to a Deterministic Finite State Automaton, whether formally constructed or as an ad-hoc equivalent. Either way, I would have a lot of trouble seeing that as a graph coloring problem.
The lexer is actually the simplest and least time consuming part of writing a compiler. There's also pretty much only two ways to design a lexer; one which transforms the entire source file into a token array in a single step which is then handed off to the parser, or one which generates tokens for the parser on the fly. I personally prefer the former design for reasons of performance and code simplicity.

Sorry if I wasn't clear; I wasn't referring to the implementation time, but the running time during a compile. Depending on the number of potential transitions, and the amount of optimization the compiler performs, lexical analysis can consume up to 80% of the running time (the textbook Modern Compiler Design by Grune, et. al., explains why in detail).

Writing a lexical analyzer is pretty easy, comparatively. Writing an efficient one is another story, though as you said, it still is usually simpler and faster to implement than the other components.

Thomas · Post by **Thomas** » Sun Jun 02, 2019 11:21 pm

Hi mariuszp,

With availability of lot of video courses online, you can do anything you set your mind to. You will need to invest some time and energy for it. With a combination of a book ( a good free one suggested already by others ) and online resources, you should be able to get it done. The classic text by Aho et el is another good reference. Hopefully you can find a low priced used copy online.

I think, these videos are very instructional.
https://courses.cs.washington.edu/cours ... index.html

--Thomas

Thomas · Post by **Thomas** » Sun Jun 02, 2019 11:31 pm

I remember enjoying reading this too.

https://holub.com/compiler/. The author has released the book and sources in public domain. Read this if you want to learn how to write a lex and yacc clone. However i did not like the design of the final compiler he came up with. There are far more modern and more modular ways to do it. But a good read for sure.

--Thomas

Solar · Post by **Solar** » Mon Jun 03, 2019 1:44 am

Boost.Spirit is (IMHO) a strong C++ alternative to the C (lex/flex, yacc/bison) way of going about things.

Upsides:

Integration. No additional syntaxes, no additional tools, no additional building steps.
Direct generation of C++ object hierarchy.
Modularity and re-usability of subparsers. (For example, I've created a Boost.Spirit parser for "quoted strings" that includes the usual C escapes, octal, hexadecimal, and Unicode character constants, and generates a UTF-8 std::string. And re-used it in multiple projects since.)
In combination with Boost iterators, can give very good error messages on failed parses with very little effort.

Downsides:

Compile times. (Can be alleviated significantly by modularization, see above.)
"Classic" C++ template error messages (when building the parser) can be daunting. (Might get better with Contracts.)

If you're willing to really tackle the learning curve, Boost.Spirit can be very rewarding, for both quick off-the-cuff oneliners or complex parse trees. But you have to dig deep to familiarize yourself with its intricacies. As with many other of my favourite tools (Vim, LaTeX, C++ in general), you won't see a payoff if you're only scratching the surface. The benefits are to be found at the deep end.

Additionally, let me mention Crafting Interpreters by Bob Nystrom as an introduction on how to make an interpreter (which isn't that different from making a compiler, really). I found that piece to be more enlightening than many more theoretical approaches, especially if you're a beginner in the art and can't wrap your head around all the related lingo yet.

alexfru · Post by **alexfru** » Mon Jun 03, 2019 3:19 am

Schol-R-LEA wrote:Depending on the number of potential transitions, and the amount of optimization the compiler performs, lexical analysis can consume up to 80% of the running time (the textbook Modern Compiler Design by Grune, et. al., explains why in detail).

I presume, that figure of 80% corresponds to non- or nearly non-optimizing compilers.
The languages that allow writing code with exponential compile times (where the compiler needs to try this or that in order to resolve ambiguity, potentially many times nested) are probably a separate category of their own.

nyc · Post by **nyc** » Thu Jan 23, 2020 6:52 pm

Korona wrote:A LLVM-style Greedy RA is superior to graph coloring RAs in practice, prove me wrong. Graph coloring just cannot easily handle non-homogeneuos register files (and of course, it's also quite expensive). When I wrote an implementation (only a bit over 1k SLOC, can do live range splitting but not register spilling right now) inspired by the LLVM one, I was surprised (i) how natural it is, and (ii) how good the results are if you tune it a bit (for smallish functions with only a few basic blocks/variables, the result is often optimal). It's incredibly easy to tune -- when you see a missed optimization, it's usually straightforward to see how to extend the RA to handle it.

I like things like http://people.cs.pitt.edu/~soffa/research/Comp/gurr.ps as ways to make graph colouring or analysis go through, though I'm admittedly coming from the point of view of someone not really interested enough in compilers to code up an implementation of it or a more recent algorithm in the same vein.

OSDev.org

How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler

Re: How to make a compiler