Can I do (one-pass) code generation with a tree parser? What is the normal code generation strategy?

Ian Kaplan

If your language is complicated and difficult to compile (e.g., C++, Verilog, VHDL) then the [one-pass "syntax-directed" translation] technique is not powerful enough. It also does not work well if you want to generate optimized code. For a complicated language or optimized code, one compiler architecture that works well is to divide the compiler into phases:

  • Parsing. An ANTLR parser/lexer that produces annotated abstract syntax trees. This pass takes care of all the syntax analysis and syntactic error reporting.
  • Semantic analysis. This is a pass over the AST that further annotates or changes the AST and does semantic analysis and error reporting. For example, C++ type incompatibility might be reported here.
  • Transformation. Some language features are really difficult to compile. They can be processed by doing tree to tree transformation in one or more passes. An example from C++ might be templates. In the Hardware Design Languages (Verilog and VHDL) there are many features that must be transformed (or expanded) into trees that later phases of the compiler can handle.
  • Control Flow Graph construction. Transform the ASTs into a control flow graph of extended basic blocks, where each block consists of single assignment statements. The single assignment statements reflect the data flow in the extended basic blocks. Common sub-expression elimination would be done here as well.
  • Optimization.
  • Code selection.
  • Translate the single assignment statements into single assignment machine instructions.
  • Peephole optimization.
  • Register allocation.
  • Code output.

I've left out some detail (e.g., constant propagation, expression optimization). Compiler writers love to argue about pass structure (e.g., where should register allocation go relative to other backend optimizations). So I'm sure that there are other pass structures which work equally well or better, especially for demanding processor architectures.

Of course a compiler like the one described above results in a much larger piece of software. A compiler that does statement at a time code generation is usually around five to ten thousand lines. A compiler with all the phases above will be much larger - probably fourty to one hundred thousand lines. And the bigger a piece of software is, the more bugs it will have, so if you can get away with a "quick and dirty" compiler, you are probably better off.

In theory ANTLR could be used to create tree walkers for many of of the passes I've described above. For example, an ANTLR tree walker could do semantic analysis and tree to tree transformation. I don't understand yet whether ANTLR provides a lot of advantage here, since I've never used a tree parser for these phases.