Can you explain more about ANTLR's tricky lexical lookahead issues related to seeing past the end of a token definition into the start of another?

Terence Parr

Consider the following parser grammar:

class MyParser extends Parser;

sequence : (r1|r2)+ ;

r1 : A (B)? ;

r2 : B ;

ANTLR reports

m.g:3: warning: nondeterminism upon
m.g:3:         k==1:B
m.g:3:         between alts 1 and 2 of block

because rule r1 can match B or not, letting r2 match it the next time through the loop in rule sequence.

Unfortunately, the same nondeterminism is not reported for the analogous lexer grammar:

class MyLexer extends Lexer;

T1 : 'a' ('b')? ;

T2 : 'b' ;

[We don't need a rule analogous to sequence since lexers always generate a sequence of tokens.] Why is there no warning in the lexer? Because ANTLR cannot really be sure that T2 will follow T1 in sequence. You might instead have T1 T1. Most of the time the nondeterminism warning is bogus when trying to look past the end of a token definition rule so ANTLR shuts it off.

[In more detail] ANTLR has a problem seeing past the end of a token definition since, depending on what happens in the invoking parser, just about any token could be seen next. This implies that just about any character could start the next token and, hence, just about any character can follow the recognition of a token definition in your grammar. This "any char can follow" situation contributes little to ANTLR's understanding of your lexer grammar rules. Consequently, ANTLR does not mention problems it sees that arise due to characters contributed from (future) tokens match down the input stream. While, ANTLR limits the "follow set" to your character vocabulary, you should assume that trying to see past the end of a rule is not a good idea.

The only time ANTLR does something bizarre is when you need more than a single character of lookahead. In this case, you might run into a limitation of the approximate LL(k) lookahead ANTLR users.

Here is a small grammar that illustrates this special situation with two characters of lookahead (k=2):

MAIN       // only token seen upstream (e.g., by the parser)
  : A
  | B

protected  // only called from another rule, MAIN in this case
A : 'y' 'b'
  | 'a' 'x'

B : 'a'
  | 'b'

ANTLR correctly thinks that sequence 'a' 'b' cannot be matched by A, but when lumped together as a single lookahead set for A (for use in rule MAIN), 'a' 'b' predicts rule A. The real lookahead set is compressed into the approximate lookahead of {'y','a'} followed by {'b','x'}. This is the approximate LL(k>1) lookahead issue. What lookahead predicts rule B? While rule B could be called twice in a row, ANTLR does not think about what follows B; ANTLR only looks k=1 into B. The lookahead set for B is, therefore, {'a','b'}. No error is generated since sequence 'a' 'b' predicts only rule A. The problem is that 'a' 'b' is not valid for A nor B. It is only the sequence B B that can recognize 'a' 'b'. Since there is nothing in your grammar that says "loop around A or B" like (A|B)+, ANTLR has no way to accurately yell about such nondeterminisms.