Why does my lexer not recognize a token that is a substring of another token? I get no nondeterminism warnings, but I do get a run-time recognition exception.
Created Jul 20, 2001
Terence Parr Imagine a grammar with k=3 and the following rules:
Obviously STAR is a substring of TRISTAR. The language itself is ambiguous, resulting in a nondeterministic lexer decision. So why don't you get a nondeterminism warning? For example, the following parser rule:
The reason that the lexer cannot generate a warning is that ANTLR must assume that literally any character in your vocabulary can follow the recognition of a token. So, literally any character can follow the recognition of the STAR rule, which includes '*'. Generating a warning when the FOLLOW set of a lexer rule collides with another rule means that you'd get a lot of meaningless warnings. For example, with k=2 you would not want a warning here that '=' can follow the PLUS rule:
Without backtracking, a single-character lookahead (k=1) lexer cannot properly distinguish between the long string and the single character. Once it sees that the third char is not another *, it must rewind the input and back up to say "oh, it's STAR". Lex and DFA-based lexers do that automatically for you. In ANTLR, we have k=3 lookahead, but ANTLR cannot figure out for itself it needs all k=3 due to the "any char can follow STAR" problem mentioned above. It decides that only two '*' characters predicts TRISTAR since the k=2 lookahead depth is deterministic.
You'll have to be specific about the amount of lookahead with a semantic predicate:
I believe the proper way to handle this is to use a syntactic predicate that specifies the required lookahead (it's easier to read and is less reliant on ANTLR implementation methods like LA(i)):
STAR : '*' ; TRISTAR : "***" ;Given input **, you want the lexer to match STAR twice when in fact you get a run-time exception indicating that it was looking for another * (i.e., it tried to match the TRISTAR). The confusion is, "Do you mean TRISTAR or STAR STAR STAR"? It decides that once it sees two '*' characters, then you are on your way to matching a TRISTAR.
Obviously STAR is a substring of TRISTAR. The language itself is ambiguous, resulting in a nondeterministic lexer decision. So why don't you get a nondeterminism warning? For example, the following parser rule:
... a : ( S | S S S )+ ;gets the following warning:
t.g:6: warning: nondeterminism upon t.g:6: k==1:S t.g:6: k==2:S t.g:6: k==3:S t.g:6: between alts 1 and 2 of blockwhere the (...)+ loop simulates the repeated call to nextToken in the lexer to get more tokens.
The reason that the lexer cannot generate a warning is that ANTLR must assume that literally any character in your vocabulary can follow the recognition of a token. So, literally any character can follow the recognition of the STAR rule, which includes '*'. Generating a warning when the FOLLOW set of a lexer rule collides with another rule means that you'd get a lot of meaningless warnings. For example, with k=2 you would not want a warning here that '=' can follow the PLUS rule:
PLUS : '+' ; PLUS_ASSIGN : "+=" ;The generated nextToken rule generates a switch that checks k=2 lookahead and jumps to the PLUS_ASSIGN when it sees a "+=". The difference in our STAR example is that STAR must be matched twice rather than just once like PLUS vs PLUS_ASSIGN. TRISTAR is like the closure of STAR; that is, (STAR)+.
Without backtracking, a single-character lookahead (k=1) lexer cannot properly distinguish between the long string and the single character. Once it sees that the third char is not another *, it must rewind the input and back up to say "oh, it's STAR". Lex and DFA-based lexers do that automatically for you. In ANTLR, we have k=3 lookahead, but ANTLR cannot figure out for itself it needs all k=3 due to the "any char can follow STAR" problem mentioned above. It decides that only two '*' characters predicts TRISTAR since the k=2 lookahead depth is deterministic.
You'll have to be specific about the amount of lookahead with a semantic predicate:
STARS
: "*" ({LA(1)=='*'&&LA(2)=='*'}? "**")?
;
which results (partially) in:
match("*");
{
if (((LA(1)=='*'))&&(LA(1)=='*'&&LA(2)=='*')) {
match("**");
}
else {
}
I left-factored the rules for fun and to reduce the lookahead from k=3
to k=2. Anyway, you can see that I have forced ANTLR to check k=2
lookahead characters after it has seen the first *.
I believe the proper way to handle this is to use a syntactic predicate that specifies the required lookahead (it's easier to read and is less reliant on ANTLR implementation methods like LA(i)):
STARS
: ("***")=>"***"
| '*'
;
ANTLR will generate a backtracking decision that will slow things
down. Naturally, ANTLR could easily optimize this fixed lookahead
language down to a simple k=3 lookahead test, thus, saving the
backtrack.