Why does my lexer not recognize a token that is a substring of another token? I get no nondeterminism warnings, but I do get a run-time recognition exception.

Terence Parr

Imagine a grammar with k=3 and the following rules:

STAR : '*' ;
TRISTAR : "***" ;
Given input **, you want the lexer to match STAR twice when in fact you get a run-time exception indicating that it was looking for another * (i.e., it tried to match the TRISTAR). The confusion is, "Do you mean TRISTAR or STAR STAR STAR"? It decides that once it sees two '*' characters, then you are on your way to matching a TRISTAR.

Obviously STAR is a substring of TRISTAR. The language itself is ambiguous, resulting in a nondeterministic lexer decision. So why don't you get a nondeterminism warning? For example, the following parser rule:
a : ( S | S S S )+ ;
gets the following warning:
t.g:6: warning: nondeterminism upon
t.g:6:  k==1:S
t.g:6:  k==2:S
t.g:6:  k==3:S
t.g:6:  between alts 1 and 2 of block
where the (...)+ loop simulates the repeated call to nextToken in the lexer to get more tokens.

The reason that the lexer cannot generate a warning is that ANTLR must assume that literally any character in your vocabulary can follow the recognition of a token. So, literally any character can follow the recognition of the STAR rule, which includes '*'. Generating a warning when the FOLLOW set of a lexer rule collides with another rule means that you'd get a lot of meaningless warnings. For example, with k=2 you would not want a warning here that '=' can follow the PLUS rule:
PLUS : '+' ;
PLUS_ASSIGN : "+=" ;
The generated nextToken rule generates a switch that checks k=2 lookahead and jumps to the PLUS_ASSIGN when it sees a "+=". The difference in our STAR example is that STAR must be matched twice rather than just once like PLUS vs PLUS_ASSIGN. TRISTAR is like the closure of STAR; that is, (STAR)+.

Without backtracking, a single-character lookahead (k=1) lexer cannot properly distinguish between the long string and the single character. Once it sees that the third char is not another *, it must rewind the input and back up to say "oh, it's STAR". Lex and DFA-based lexers do that automatically for you. In ANTLR, we have k=3 lookahead, but ANTLR cannot figure out for itself it needs all k=3 due to the "any char can follow STAR" problem mentioned above. It decides that only two '*' characters predicts TRISTAR since the k=2 lookahead depth is deterministic.

You'll have to be specific about the amount of lookahead with a semantic predicate:
	:	"*" ({LA(1)=='*'&&LA(2)=='*'}? "**")?
which results (partially) in:
	if (((LA(1)=='*'))&&(LA(1)=='*'&&LA(2)=='*')) {
	else {
I left-factored the rules for fun and to reduce the lookahead from k=3 to k=2. Anyway, you can see that I have forced ANTLR to check k=2 lookahead characters after it has seen the first *.

I believe the proper way to handle this is to use a syntactic predicate that specifies the required lookahead (it's easier to read and is less reliant on ANTLR implementation methods like LA(i)):
	:	("***")=>"***"
	|	'*'
ANTLR will generate a backtracking decision that will slow things down. Naturally, ANTLR could easily optimize this fixed lookahead language down to a simple k=3 lookahead test, thus, saving the backtrack.