Why can't I call a random non-start rule in my grammar, such as expression, and have the parser match an isolated subphrase of my language?

Terence Parr

Another way to ask this question is: "When does ANTLR think end-of-file (EOF) can follow execution of a rule?"

ANTLR assumes that EOF can follow any start rule (the rule(s) that you invoke from your Java/C++/Sather code to start the parsing process). Which rules are start rules? ANTLR considers any rule a start rule if it is not referred to by another rule in the grammar.

An example of why you would want to invoke a non-start rule would be reusing the expression evaluator in a Java grammar to parse input from a form text field. You only have a Java expression, not a full Java program. You might want to invoke rule expr, but there can be problems calling a non-start rule.

To make things concrete, take a look at the following parser and lexer grammars:

class P extends Parser;
options {
	k=2;  // notice k>1!!!
}

expr:	atom (PLUS atom)*
	;

atom:	ID
	|	ID DOT ID
	;

class L extends Lexer;

ID : ('a'..'z')+ ;

LPAREN : '(' ;

RPAREN : ')' ;

DOT : '.' ;

WS : (' '|'
')+ {$setType(Token.SKIP);} ;
Here, rule expr is a start rule since there are no references to it from within the grammar. This will match input "a" or "a.b" etc... given a main program like:
public static void main(String[] args)
    throws Exception
{
	L lex = new L(System.in);
	P p = new P(lex);
	p.expr();
}
The key thing I want you to notice is the code generated for the alternatives of rule atom:
if ((LA(1)==ID) && (LA(2)==EOF||LA(2)==PLUS)) {
    match(ID);
}
else if ((LA(1)==ID) && (LA(2)==DOT)) {
    match(ID);
    match(DOT);
    match(ID);
}
ANTLR thinks that EOF can follow the reference to the isolated ID in atom because expr calls atom and expr can be followed by EOF. This is crucial to deciding between ID EOF and ID DOT ID.

What happens when expr is no longer a start rule? EOF is no longer possible after an isolated ID in atom because ANTLR doesn't think expr will be called from the outside world anymore (most of the time this is what you want). Adding a reference to expr in the grammar will prevent simple input ID from matching. For example, adding the following rule, makes it impossible to recognize "a", though "a.b" still works since ANTLR knows DOT follows "a".
ifst:	"if" expr "then" expr "endif"
	;
ANTLR sees that expr is called from ifst and presumes it will not be called from outside the grammar directly. Note that ANTLR can only make it one way: with EOF and without. Unless ANTLR duplicated the rule for each context (referenced within the grammar and as a start rule), it must compute lookahead one way. For the first two alternatives of atom, ANTLR now generates:
if ((LA(1)==ID) &&
    (LA(2)==PLUS||LA(2)==LITERAL_then||LA(2)==LITERAL_endif)) {
    match(ID);
}
else if ((LA(1)==ID) && (LA(2)==DOT)) {
    match(ID);
    match(DOT);
    match(ID);
}
If you haven't guessed yet, the answer is to duplicate a rule. Either duplicate expr (less than optimal), or simply add a new start rule whose only task is to invoke expr. For example, adding startExpr and changing the main() to call it instead of expr, makes the grammar handle both contexts:
startExpr
	:	expr
	;
This simply addition makes ANTLR believe EOF can follow expr since a start rule invokes it (startExpr). The lookahead test for alternative one of atom would again include EOF--it will parse "a" again. The main() would now look like:
public static void main(String[] args)
    throws Exception
{
	L l = new L(System.in);
	P p = new P(l);
	p.startExpr();
}
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

About | Sitemap | Contact