All I know is that I need to build a parser or translator. Give me an overview of what ANTLR does and how I need to approach building things with ANTLR.
Created May 14, 2012
Terence Parr
[See also Getting started ]
To build a language recognizer, you specify the structure of that langage with a grammar and then have ANTLR generate a Java or C++ class definition that you can use to recognize sentences in that language. You may add simple operators to your grammar to have ANTLR automatically build intermediate form trees, which you can walk later to perform a translation. You may also embed Java or C++ actions within a grammar to collect information or perform a translation.
For simple translations, you will build two grammars: a lexer grammar and a parser grammar from which ANTLR generates a lexer (often called a scanner or tokenizer) and a parser. The lexer breaks up the input character stream into a stream of tokens (vocabulary symbols) and then the parser applies grammatical structure (syntax) to the token stream. Here is a simple lexer that knows how to match commas, integers, and identifiers:
class IntAndIDLexer extends Lexer; INT : ('0'..'9')+ ; ID : ('a'..'z')+ ; COMMA: ',' ;
By repeatedly asking the lexer to get a token, the parser will see a stream of tokens. In particular, the parser will verify that the stream has the appropriate syntactic structure. If your definition of grammatical structure is "a series of integers and identifiers separated by commas", you might want a grammar that looks like:
You will probably want to embed actions in the grammar so that little snippets of code are executed upon seeing certain input constructs. If you want to print out how many elements were found, you could add actions as follows: So what the hell does ANTLR do with the grammars? Well, looking at the SeriesParser, ANTLR would generate the following (minus error handling stuff): Chew on that for a while and you will start to see the one-to-one relationship between the grammar and the ouput, which is similar to what you might build by hand. To use this lexer and parser, you need a main() that creates an instance of each, hooks them up and invokes a rule in the parser: To print out the text for a token matched in the parser, you must label the item whose matched text you want to print out. The label will refer to the Token object built by the lexer. In an action, ask the token object for its text: Given input 32,a,size,28923,i, you would see the following output: More complicated translations typically require multiple passes over the input and so programmers typically build an intermediate form tree called an abstract syntax tree (AST), a structured representation of the text input. You may walk trees by hand or specify an ANTLR tree grammar that describes the tree structure. Actions embedded within the tree grammar are executed when the tree parser has reached the associated position in the input tree. How could we build up a simple tree (albeit a flat one, a list) of the identifier and integer input? Easy. Tell ANTLR to build trees and it will create a flat tree, with one tree node per input token; in other words, ANTLR will build a linked list of your input tokens. To make things a bit more interesting, suffix the COMMA token reference with a '!' to indicate it is not to be included in the output tree: What can you do with this AST? Well, you could subclass ANTLR's common AST definition and add a walk() method or something, but a better way is to use another grammar to specify the AST structure. Tree grammars are like executable comments describing the structure of your intermediate form. Here is a trivial tree grammar that would match trees as generated by our parser grammar: Note that the tree grammar is simpler than the parser grammar. Normally, you will construct trees that are simpler to walk than the text input with all the whitespace and other syntactic sugar used to makes things easier for a human to read. To invoke the tree parser, augment your main(): You can add actions to a tree grammar just as easily as you can to a parser grammar. To print out a list of the integers and identifiers just like you did in the parser grammar, add a few actions (the labels refer to the associated matched AST node in tree grammars): What if you want to walk your AST another time, but with different actions? One way is to define another identical rule and put in different actions: Then, your main() needs to invoke the new rule after invoking the series rule: You would see output: class SeriesParser extends Parser;
/** Match an element (INT or ID) with possibly a
* bunch of ", element" pairs to follow matching
* input that looks like 32,a,size,28923,i
*/
series : element (COMMA element)* ;
/** Match either an INT or ID */
element: INT | ID ;
class SeriesParser extends Parser;
// I'm using Java...
/** Match an element (INT or ID) with possibly a
* bunch of ", element" pairs to follow matching
* input that looks like 32,a,size,28923,i
*/
series
{ /* this is considered an initialization action
* and is done before recognition of this rule
* begins. These look like local variables to
* the resulting method SeriesParser.series()
*/
int n = 1; // how many elements? At least 1
}
: element (COMMA element {n++;})*
{System.out.println("there were "+n+" elements");}
;
/** Match either an INT or ID */
element: INT | ID ;
public class T extends antlr.LLkParser implements TTokenTypes {
// I cut out the usual set of constructors and a few
// other details.
/** Match an element (INT or ID) with possibly a
* bunch of ", element" pairs to follow matching
* input that looks like 32,a,size,28923,i
*/
public final void series() {
element(); // match an element
_loop3:
do {
if ((LA(1)==COMMA)) {
match(COMMA);
element();
}
else {
break _loop3;
}
} while (true);
}
/** Match either an INT or ID */
public final void element() {
switch ( LA(1)) {
case INT:
{
match(INT);
break;
}
case ID:
{
match(ID);
break;
}
}
}
}
main(String[] args) {
DataInputStream input = new DataInputStream(System.in);
// attach lexer to the input stream
IntAndIDexer lexer = new IntAndIDexer(input);
// Create parser attached to lexer
SeriesParser parser = new SeriesParser(lexer);
// start up the parser by calling the rule
// at which you want to begin parsing.
parser.series();
}
class SeriesParser extends Parser;
series : element (COMMA element)* ;
element
: a:INT {System.out.println(a.getText());}
| b:ID {System.out.println(b.getText());}
;
32
a
size
28923
i
class SeriesParser extends Parser;
options {
buildAST = true;
}
series : element (COMMA! element)* ;
element: INT | ID ;
class SeriesTreeParser extends TreeParser;
/** Match a flat tree (a list) of one or more INTs or IDs.
* This rule differs from SeriesParser.series(), which
* is in a different grammar.
*/
series : ( INT | ID )+ ;
main(String[] args) {
DataInputStream input = new DataInputStream(System.in);
// attach lexer to the input stream
IntAndIDexer lexer = new IntAndIDexer(input);
// Create parser attached to lexer
SeriesParser parser = new SeriesParser(lexer);
// start up the parser by calling the rule
// at which you want to begin parsing.
parser.series();
// Get the tree out of the parser
AST resultTree = parser.getAST();
// Make an instance of the tree parser
SeriesTreeParser treeParser = new SeriesTreeParser();
// Begin tree parser at only rule
treeParser.series(resultTree);
}
class SeriesTreeParser extends TreeParser;
series
: ( a:INT {System.out.println(a.getText());}
| b:ID {System.out.println(b.getText());}
)+
;
class SeriesTreeParser extends TreeParser;
series
: ( a:INT {System.out.println(a.getText());}
| b:ID {System.out.println(b.getText());}
)+
;
/** Sum up all the integers for fun. */
passTwo
{
int sum = 0;
}
: ( a:INT {sum += Integer.parseInt(a.getText());}
| ID
)+
{System.out.println("sum is "+sum);}
;
main(String[] args) {
...
// Get the tree out of the parser
AST resultTree = parser.getAST();
// Make an instance of the tree parser
SeriesTreeParser treeParser = new SeriesTreeParser();
treeParser.series(resultTree); // walk AST once
treeParser.passTwo(resultTree); // walk AST again!!
}
32
a
size
28923
i
sum is 28955