What is a tree parser and why would I want to use one?
Created May 7, 2012
Terence Parr Only simple, so-called syntax directed translations can be done with
actions within the parser. These kinds of translations can only spit
out constructs that are functions of information already seen at that
point in the parse. Tree parsers allow you to walk an intermediate
form and manipulate that tree, gradually morphing it over several
translation phases to a final form that can be easily printed back out
as the new translation.
Imagine a simple translation problem where you want to print out an html page whose title is "There are n items" where n is the number of identifiers you found in the input stream. The ids must be printed after the title like this:
Ok, now you know that a tree is a good thing for anything but simple translations. Given an AST, how do you get output from it? Imagine simple expression trees. One way is to make the nodes in the tree specific classes like PlusNode, IntegerNode and so on. Then you just ask each node to print itself out. For input, 3+4 you would have tree:
The main weakness of the heterogeneous AST approach is that it cannot conveniently access context information. In a recursive-descent parser, your context is easily accessed because it can be passed in as a parameter. You also know precisely which rule can invoke which other rule (e.g., is this expression a WHILE condition or an IF condition?) by looking at the grammar. The PlusNode class above exists in a detached, isolated world where it has no idea who will invoke it's toString() method. Worse, the programmer cannot tell in which context it will be invoked by reading it.
In summary, adding actions to your input parser works for very straightforward translations where:
Imagine a simple translation problem where you want to print out an html page whose title is "There are n items" where n is the number of identifiers you found in the input stream. The ids must be printed after the title like this:
<html> <head> <title>There are 3 items</title> </head> <body> <ol> <li>Dog</li> <li>Cat</li> <li>Velociraptor</li> </body> </html>from input
Dog Cat VelociraptorSo with simple actions in your grammar how can you compute the title? You can't without reading the whole input. Ok, so now we know we need an intermediate form. The best is usually an AST I've found since it records the input structure. In this case, it's just a list but it demonstrates my point.
Ok, now you know that a tree is a good thing for anything but simple translations. Given an AST, how do you get output from it? Imagine simple expression trees. One way is to make the nodes in the tree specific classes like PlusNode, IntegerNode and so on. Then you just ask each node to print itself out. For input, 3+4 you would have tree:
+ | 3 -- 4and classes
class PlusNode extends CommonAST { public String toString() { AST left = getFirstChild(); AST right = left.getNextSibling(); return left + " + " + right; } } class IntNode extends CommonAST { public String toString() { return getText(); } }Given an expression tree, you can translate it back to text with t.toString(). SO, what's wrong with this? Seems to work great, right? It appears to work well in this case because it's simple, but I argue that, even for this simple example, tree grammars are more readable and are formalized descriptions of precisely what you coded in the PlusNode.toString().
expr returns [String r] { String left=null, right=null; } : #("+" left=expr right=expr) {r=left + " + " + right;} | i:INT {r=i.getText();} ;Note that the specific class ("heterogeneous AST") approach actually encodes a complete recursive-descent parser for #(+ INT INT) by hand in toString(). As parser generator folks, this should make you cringe. ;)
The main weakness of the heterogeneous AST approach is that it cannot conveniently access context information. In a recursive-descent parser, your context is easily accessed because it can be passed in as a parameter. You also know precisely which rule can invoke which other rule (e.g., is this expression a WHILE condition or an IF condition?) by looking at the grammar. The PlusNode class above exists in a detached, isolated world where it has no idea who will invoke it's toString() method. Worse, the programmer cannot tell in which context it will be invoked by reading it.
In summary, adding actions to your input parser works for very straightforward translations where:
- the order of output constructs is the same as the input order
- all constructs can be generated from information parsed up to the point when you need to spit them out