Why is the (...)* loop below nondeterministic for any k (is this an example of ANTLR computing linear approximate lookahead)?

Terence Parr

With k=1, it's easy to see. B predicts matching another iteration of the (...)* loop but also predicts exiting rule b and matching B in rule a.

With k=2 and beyond, it's trickier. Input B A again predicts the loop but, because of the way ANTLR computes lookahead, it looks like B A can be matched by exiting rule b as well; it can't in practice, but ANTLR doesn't know that.

ANTLR generates a class of parsers that we call LL(k) for simplicity, but are actually approximate LALL(k) (you may have heard of the LALR(1) critter used by YACC and the ilk). First, the LALL parst. LALL means that when ANTLR computes what input can follow a reference to a rule, it checks and combines what follows all references to that rule. Here, ANTLR would compute:

FOLLOW_2(b) = { FIRST_2(S b B), FIRST_2(B) }
            = { S FIRST_1(b B), B FOLLOW_1(a) }
            = { S A, B <EOF> }

So, still no problem, right? Input S A and B <EOF> can follow the matching of a reference to rule b. Why do we get a collision with B A? Enter the "linear approximate lookahead" feature of ANTLR that allows it to approximate full LALL(k) with a computation that is linear in complexity.

Truly computing LL(k) lookahead (PCCTS is the only one that I know of that actually works in practice; although Josef Grosch has decoded my dissertation--his work may do full LR(k) now) is exponentially complex, so ANTLR approximates the computation by collapsing all kazillion input sequences into k sets of tokens. Here that means we ge {S B} {A <EOF>}. Now, ask again the question "can B A be matched?" and you'll see that B is in the first set and A is in the second. I call these "imaginary input sequences" as they are artifacts of the computation.

So that is the complicated story...your brain is computing LL(k) and ANTLR is actually computing approximate LALL(k).