What's the difference between a parse tree and an abstract syntax tree (AST)? Why doesn't ANTLR generate trees with nodes for grammar rules like JJTree does?

Terence Parr

A parse tree is a record of the rules (and tokens) used to match some input text whereas a syntax tree records the structure of the input and is insensitive to the grammar that produced it. Note that there are an infinite number of grammars for any single language and hence every grammar will result in a different parse tree form for a given input sentence because of all the different intermediate rules. An abstract syntax tree is a far superior intermediate form precisely because of this insensitivity and because it highlights the structure of the language not the grammar.

It's best to look at an example. For input 3+4 you really want to use the following intermediate form:

3  4
That is, the operator at the root and 3 and 4 as operands (children). In ANTLR child-sibling form, you'd have
3 -- 4
Ok, so now a parse tree. I'll pick an extremely simple one out of the infinite number:
 3  +  4
Of course, in a real grammar, you'd have about 8 more levels of nesting in the grammar and hence in the tree. All of these rule nodes are noise and still don't tell you anything about structure of the input--it only tells you what rules were applied to match it (sometimes this is enough to work with for simple translations, but only for simple syntax-directed translations.)

In summary, ASTs are superior because they record the structure of the input not the artifacts of your grammar; consequently they are insensitive to changes in your input grammar, easier to manipulate during translation, and are much smaller.

Loring Craymer adds:
ANTLR's annotation approach for constructing trees is much nicer than the conventional approach--nonterminals represent language features (like "if" for conditional statement blocks) rather than the grammar writer's view of the language. Why should

foo : A B C bar ; bar : D E F ;

generate a different tree than

foobar : A B C D E F ;

by default?

Worse: it is possible to generate both trees in one grammar. Recognizing that a #( foo A B C #(bar D E F) ) tree is the same as #( foobar A B C D E F ) is just an added burden to the programmer.

With ANTLR, "C" can be identified as an important feature for decoding "A B" and "D E F" semantics via the structuring

foobar : A B^ C^ D E F ;

which results in a tree matched by

tfoo : #( C #( B A ) D E F ) ;

that effectively distinguishes C (root), B A (B as first child of C, A as a child of B), and D E F (other children).

Because ANTLR matches trees from the top down, the "rule identifier" nodes are unnecessary--they can be considered implicit in the tree walker rules. Also, ANTLR trees can be manipulated from one language structure to another--tree transformation is a very clean approach for language translation.