All I know is that I need to build a parser or translator. Give me an overview of what ANTLR does and how I need to approach building things with ANTLR.

Terence Parr

 

 

[See also Getting started ]

To build a language recognizer, you specify the structure of that langage with a grammar and then have ANTLR generate a Java or C++ class definition that you can use to recognize sentences in that language. You may add simple operators to your grammar to have ANTLR automatically build intermediate form trees, which you can walk later to perform a translation. You may also embed Java or C++ actions within a grammar to collect information or perform a translation.

For simple translations, you will build two grammars: a lexer grammar and a parser grammar from which ANTLR generates a lexer (often called a scanner or tokenizer) and a parser. The lexer breaks up the input character stream into a stream of tokens (vocabulary symbols) and then the parser applies grammatical structure (syntax) to the token stream. Here is a simple lexer that knows how to match commas, integers, and identifiers:

class IntAndIDLexer extends Lexer;

INT : ('0'..'9')+ ;

ID  : ('a'..'z')+ ;

COMMA: ',' ;

By repeatedly asking the lexer to get a token, the parser will see a stream of tokens. In particular, the parser will verify that the stream has the appropriate syntactic structure. If your definition of grammatical structure is "a series of integers and identifiers separated by commas", you might want a grammar that looks like:

class SeriesParser extends Parser;

/** Match an element (INT or ID) with possibly a
 *  bunch of ", element" pairs to follow matching
 *  input that looks like 32,a,size,28923,i
 */
series : element (COMMA element)* ;

/** Match either an INT or ID */ 
element: INT | ID ;

You will probably want to embed actions in the grammar so that little snippets of code are executed upon seeing certain input constructs. If you want to print out how many elements were found, you could add actions as follows:

class SeriesParser extends Parser;

// I'm using Java...

/** Match an element (INT or ID) with possibly a
 *  bunch of ", element" pairs to follow matching
 *  input that looks like 32,a,size,28923,i
 */
series
{ /* this is considered an initialization action
   * and is done before recognition of this rule
   * begins.  These look like local variables to
   * the resulting method SeriesParser.series()
   */
  int n = 1; // how many elements?  At least 1
}
    :  element (COMMA element {n++;})*
       {System.out.println("there were "+n+" elements");}
    ;

/** Match either an INT or ID */ 
element: INT | ID ;

So what the hell does ANTLR do with the grammars? Well, looking at the SeriesParser, ANTLR would generate the following (minus error handling stuff):

public class T extends antlr.LLkParser implements TTokenTypes {

// I cut out the usual set of constructors and a few
// other details.

/** Match an element (INT or ID) with possibly a
 *  bunch of ", element" pairs to follow matching
 *  input that looks like 32,a,size,28923,i
 */
public final void series() {
  element(); // match an element
  _loop3:
  do {
    if ((LA(1)==COMMA)) {
      match(COMMA);
      element();
    }
    else {
      break _loop3;
    }
  } while (true);
}
  
/** Match either an INT or ID */
public final void element() {
  switch ( LA(1)) {
  case INT:
  {
    match(INT);
    break;
  }
  case ID:
  {
    match(ID);
    break;
  }
  }
}
}

Chew on that for a while and you will start to see the one-to-one relationship between the grammar and the ouput, which is similar to what you might build by hand.

To use this lexer and parser, you need a main() that creates an instance of each, hooks them up and invokes a rule in the parser:

main(String[] args) {
  DataInputStream input = new DataInputStream(System.in);

  // attach lexer to the input stream
  IntAndIDexer lexer = new IntAndIDexer(input);

  // Create parser attached to lexer
  SeriesParser parser = new SeriesParser(lexer);

  // start up the parser by calling the rule
  // at which you want to begin parsing.
  parser.series();
}

To print out the text for a token matched in the parser, you must label the item whose matched text you want to print out. The label will refer to the Token object built by the lexer. In an action, ask the token object for its text:

class SeriesParser extends Parser;

series : element (COMMA element)* ;

element
    :    a:INT  {System.out.println(a.getText());}
    |    b:ID   {System.out.println(b.getText());}
    ;

Given input 32,a,size,28923,i, you would see the following output:

32
a
size
28923
i

More complicated translations typically require multiple passes over the input and so programmers typically build an intermediate form tree called an abstract syntax tree (AST), a structured representation of the text input. You may walk trees by hand or specify an ANTLR tree grammar that describes the tree structure. Actions embedded within the tree grammar are executed when the tree parser has reached the associated position in the input tree.

How could we build up a simple tree (albeit a flat one, a list) of the identifier and integer input? Easy. Tell ANTLR to build trees and it will create a flat tree, with one tree node per input token; in other words, ANTLR will build a linked list of your input tokens. To make things a bit more interesting, suffix the COMMA token reference with a '!' to indicate it is not to be included in the output tree:

class SeriesParser extends Parser;
options {
  buildAST = true;
}

series : element (COMMA! element)* ;

element: INT | ID ;

What can you do with this AST? Well, you could subclass ANTLR's common AST definition and add a walk() method or something, but a better way is to use another grammar to specify the AST structure. Tree grammars are like executable comments describing the structure of your intermediate form. Here is a trivial tree grammar that would match trees as generated by our parser grammar:

class SeriesTreeParser extends TreeParser;

/** Match a flat tree (a list) of one or more INTs or IDs.
 *  This rule differs from SeriesParser.series(), which
 *  is in a different grammar.
 */
series : ( INT | ID )+ ;

Note that the tree grammar is simpler than the parser grammar. Normally, you will construct trees that are simpler to walk than the text input with all the whitespace and other syntactic sugar used to makes things easier for a human to read. To invoke the tree parser, augment your main():

main(String[] args) {
  DataInputStream input = new DataInputStream(System.in);

  // attach lexer to the input stream
  IntAndIDexer lexer = new IntAndIDexer(input);

  // Create parser attached to lexer
  SeriesParser parser = new SeriesParser(lexer);

  // start up the parser by calling the rule
  // at which you want to begin parsing.
  parser.series();

  // Get the tree out of the parser
  AST resultTree = parser.getAST();

  // Make an instance of the tree parser
  SeriesTreeParser treeParser = new SeriesTreeParser();

  // Begin tree parser at only rule
  treeParser.series(resultTree);
}

You can add actions to a tree grammar just as easily as you can to a parser grammar. To print out a list of the integers and identifiers just like you did in the parser grammar, add a few actions (the labels refer to the associated matched AST node in tree grammars):

class SeriesTreeParser extends TreeParser;

series
    :  (  a:INT  {System.out.println(a.getText());}
       |  b:ID   {System.out.println(b.getText());}
       )+
    ;

What if you want to walk your AST another time, but with different actions? One way is to define another identical rule and put in different actions:

class SeriesTreeParser extends TreeParser;

series
    :  (  a:INT  {System.out.println(a.getText());}
       |  b:ID   {System.out.println(b.getText());}
       )+
    ;

/** Sum up all the integers for fun. */
passTwo
{
  int sum = 0;
}
    :  (  a:INT  {sum += Integer.parseInt(a.getText());}
       |  ID
       )+
       {System.out.println("sum is "+sum);}
    ;

Then, your main() needs to invoke the new rule after invoking the series rule:

main(String[] args) {
  ...
  // Get the tree out of the parser
  AST resultTree = parser.getAST();

  // Make an instance of the tree parser
  SeriesTreeParser treeParser = new SeriesTreeParser();

  treeParser.series(resultTree);  // walk AST once
  treeParser.passTwo(resultTree); // walk AST again!!
}

You would see output:

32
a
size
28923
i
sum is 28955

 

0 Comments  (click to add your comment)
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

About | Sitemap | Contact