Why does ANTLR say that two of my lexer rules are ambiguous? Or, why am I having so much trouble with DOT or PERIOD at the left-edge of lexer rules?

Terence Parr

Imagine that your grammar (prolog) has two types of tokens that begin with DOT (the period symbol). Specifically the "." may appear alone as the end token or it may appear as the start of a graphic token type (such as ".$&%"). You may be tempted to simply define the rules as follows:


END_TOKEN: '.' ;

Clearly at k=1 ANTLR cannot distinguish between the tokens since all it sees is the dot. At k=2, ANTLR can see past the dot to whatever follows. In the GRAPHIC_TOKEN case, it can see the graphic char(s) beyond the dot. For rule END_TOKEN, however, ANTLR sees beyond the end of the token to what character will begin the next token. But, ANTLR must assume that any character could potentially appear next. Even if it means an invalid char sequence, ANTLR must assume any vocabulary char can follow recognition of a lexer rule.

This "anything can follow" concept presents a problem. How can ANTLR distinguish between GRAPHIC_TOKEN and END_TOKEN when at lookahead depth k=2 both rules appear to have GRAPHIC_CHAR in their lookahead computations?!

The answer is, ANTLR cannot decide. You must modify your grammar to overcome this limitation. How? Well, just mimic what an automaton-based lexer generator such as Flex/lex would do for you: left-factor the two rules combining identical left-prefixes.
    :    '.' (GRAPHIC_CHAR {$setType(GRAPHIC_TOKEN);})+

Here you essentially assume it will be an END_TOKEN unless you see a bunch of GRAPHIC_CHAR in which case you reset the token type to be GRAPHIC_TOKEN. Note that you must define the type GRAPHIC_TOKEN somewhere in your parser or lexer. Presumably you reference GRAPHIC_TOKEN in your parser grammar, thus, implicitly defining it.