Why is the (...)* loop below nondeterministic for any k (is this an example of ANTLR computing linear approximate lookahead)?
Created May 4, 2012
With k=2 and beyond, it's trickier. Input B A again predicts the loop but, because of the way ANTLR computes lookahead, it looks like B A can be matched by exiting rule b as well; it can't in practice, but ANTLR doesn't know that.
ANTLR generates a class of parsers that we call LL(k) for simplicity, but are actually
approximate LALL(k) (you may have heard of the LALR(1) critter used by YACC
and the ilk). First, the LALL parst. LALL means that when ANTLR computes
what input can follow a
reference to a rule, it checks and
combines what follows all references to that rule. Here, ANTLR would compute:
So, still no problem, right? Input S A and B <EOF> can follow the matching
of a reference to rule b. Why do we get a collision with B A? Enter the
"linear approximate lookahead" feature of ANTLR that allows it to approximate
full LALL(k) with a computation that is linear in complexity.
Truly computing LL(k) lookahead (PCCTS is the only one that I know of that actually
works in practice; although Josef Grosch has decoded my dissertation--his work
may do full LR(k) now) is exponentially complex, so ANTLR approximates the computation by
collapsing all kazillion input sequences into k sets of tokens. Here that means we
ge {S B} {A <EOF>}. Now, ask again the question "can B A be matched?" and you'll see
that B is in the first set and A is in the second. I call these "imaginary input
sequences" as they are artifacts of the computation.
So that is the complicated story...your brain is computing LL(k) and ANTLR is actually
computing approximate LALL(k).
FOLLOW_2(b) = { FIRST_2(S b B), FIRST_2(B) }
= { S FIRST_1(b B), B FOLLOW_1(a) }
= { S A, B <EOF> }