Flaw in ANTLR v3 LL(*) analysis algorithm
Update Oct 2012: resolved with correct definition of when to terminate prediction lookahead. Turns out we don't need to push this extra stack element.
Over the Christmas holidays, I've been busy building example grammars for ANTLR v4. The thing I noticed immediately is that grammars just work. There are no error messages from ANTLR when generating code and all we can get are true ambiguity errors at runtime. E.g., if you can recognize T(i) as both a function call and a constructor call. The lexers work extremely well and I'm a happy guy. Nothing scares ANTLR v4, "it's pretty bad-ass" like the The Crazy Nastyass Honey Badger.
And then, building an R language grammar, I ran into an ambiguity warning and an improperly parsed R phrase. Since the runtime warnings had already found 2 other errors in my grammar, I was not worried. Then, I noticed that the input was not only unambiguous, so should not have given a message, it should have parsed differently. 1.5 days later, I finally have figured out exactly what's going on and exactly what's wrong and exactly how to fix it. The only issue is that the fix is going to add more overhead to parsing not only in memory but in speed. Well, maybe.
The flaw
Imagine we have R expression a(i)<-x
that we want to parse starting at rule prog in the following grammar.
prog: expr_or_assign* ; expr_or_assign : expr '++' | expr ; expr : expr_primary ('<-' ID)? ; expr_primary : '(' ID ')' | ID '(' ID ')' | ID ;
I got a mismatch token error at '++'. Somehow, ANTLR decided that it should choose alternative 1 in rule expr_or_assign. That was weird. There is no '++' any input and so ANTLR should not have chosen that alternative. If it's confused, it always consumes more input trying to make a decision. What the hell?
I turned on the diagnostic error strategy and it reported ambiguity warning upon "a(i)<-". What? After some playing around, I remembered that ANTLR delays reporting ambiguities now until a situation which the parser can only match ambiguous input. In ANTLR v3, we would have detected the ambiguity immediately at "a(i)", but as Sam Harwell pointed out, that unnecessarily cuts out some opportunities (it would fail over to backtracking unnecessarily occasionally). (That was flaw or lack of strength problem #1 found in v3's LL(*) algorithm). Once I realized that it was actually ambiguous on "a(i)", it made total sense. It's true, "a(i)" can be matched in 2 different ways by the parser.
Here's how. (Let's assume full context parsing here.) Starting in rule prog, we call expr_or_assign, which calls expr, which calls expr_primary. expr_primary can match "a(i)" completely using the second alternative. OR, you can match just the "a" using the third alternative. To do that, means it has to exit and return to expr. Since "(" not "<-" is the next symbol, it exits expr and returns to expr_or_assign. Since "++" it is also not the next token, we exit and return to the loop in prog. That immediately loops back and jumps into expr_or_assign again. This must enter expr, which immediately calls expr_primary. Finally, we match "(i)" to the first alternative. To finish things off, we return to expr, match "<-x" with the optional sub rule and exit back to expr_or_assign. Since there is no more input, we would assume that we had matched expr the of the second alternative not the first (EOF not "++" is the next symbol). Exiting expr_or_assign returns us to prog where we fall out of the loop because we have no more input.
Yes, the grammar is ambiguous because it can match the same input in more than one way. But, the decision in expr_or_assign is never ambiguous. It doesn't matter what we match in expression, even if there are multiple ways to match some input. If it's followed by "++", it should choose the first alternative of expr_or_assign. Period. Any other token following expression should force the parser to predict the second alternative. Whoops!
After much experimentation and creation of software to trace prediction through the augmented transition network (ATN) representation of grammar, I discovered the source of the problem. To highlight the issue, all we have to do is use the recursive equivalent of prog:
prog: expr_or_assign prog | ;
Or, for simplicity, the following manual repetition of expr_or_assign exhibits the same behavior for our input "a(i)<-x".
prog: expr_or_assign expr_or_assign ;
To understand the problem, note that ANTLR uses a specific definition of ambiguity:
If the parser can reach the same state with the same rule invocation stack (context) but starting from more than one alternative, the decision is ambiguous."
The problem is that ANTLR destroys context information when it converts the recursive looping rule into the equivalent EBNF expr_or_assign* loop. Imagine returning from rule expr_or_assign in this version of prog:
prog: expr_or_assign expr_or_assign ;
It will immediately enter expr_or_assign again, but from a different spot in prog so that the rule invocation stack will have a different return address. At the start of prog, the rule invocation state might be state p whereas the rule invocation state for the second invocation of expr_or_assign might be q. The stacks will be different down in expr_or_assign--one will have p and one will have q. That is context that he needs to correctly choose between alternatives in expr_or_assign, believe it or not.
We lose contact information with a expr_or_assign* loop. At the start of prog, we invoke expr_or_assign from state p, let's say, and if we return, the loop sends us right back to state p. That means that two invocations of expr_or_assign back-to-back look identical because they have the same stack! Ooops. Apparently, this is not really a problem very often. It only came up in this case because my grammar is ambiguous (derived from that infernal tendency of language designers to allow optional semicolons!).
To see how this works, since I know you are all on the edge of your seats, check out the following image and its green path. (Graphviz, has alternative to on top of alternative one for expr_or_assign...sorry about that.) That path is how ANTLR can get to state 17 by going through rule and matching "a(i)<-x". So far so good. We can identify the configuration of our ATN simulator as (17,1,[8]) because we called expr_or_assign from state 8 in prog. (I have left out that path to avoid even more lines on the graph.) If we find a configuration (17,2,[8]), indicating we have gotten to the exact same state with the same rule invocation stack but predicting a second alternative, we have an ambiguity.
Now, to see the ambiguity we need to follow the path matching "a(i)<-x" via alternative 2 in expr_or_assign (starting in state 19). In the following, I have created an Orange line emanating from state 19 leading into expr then expr_primary. Is one of my choices, I can pick the alternative that matches a single identifier via state 45. Then I reached the end of that rule and return to state 29 then 2 state 30 and state 5, the stopped state for expr. Because we invoked expr from state 19 in expr_or_assign, we return to the following state 22, not 17. We fall off the end of the rule and end up in state 11 of prog. We loop around back to the rule invocation state 8. We return to expr_or_assign and reach the decision point state 21 again. The crucial piece of information here is that the context is a stack of [8]. This is exactly the same context as when we entered expr_or_assign originally. In other words, we do not differentiate between the first and second invocations of a rule within a * loop. More below.
Remember that we are pursuing alternative 2 of expr_or_assign at the moment so any path we take from 21 is still considered part of the possibilities for our original prediction of decision 1 (d=1 in the graph). Choosing to pursue the first alternative, state 15, is what leads to trouble. The blue line shows how to match the remaining input, "(i)<-x".
It eventually returns to state 17, the return state from rule invocation state 15. The configuration of the ATN is (17,1,[8]), which is directly in conflict with (17,1,[8]) associated with the first alternative of expr_or_assign. ANTLR declares an ambiguity. (actually, ANTLR figures it out a little bit sooner at state 27.) The problem is that it's not ambiguity. The only way to get back into expr_or_assign from prog is after having matched an ID to expr_or_assign and looping back around. The context stack of just [8] is the same both times. It does not encode the fact that we have already called expr_or_assign once already--the input position is different. The first time, we are trying to match starting at "a" and the second time we are trying to match starting at "(".
Because only one rule, prog, calls expr_or_assign, Strong LL(*) and LL(*) are the same. There's no point in pursuing a full context parse. Strong LL(*) should be able to handle this without resorting to a full context parse. On the other hand, the situation is presumably very rare. I think it comes because of a fairly ambiguous grammar. We could probably let it fail with the regular prediction mechanism and then simply update the full context parse to properly deal with context for * loops.
The fix
A solution involves changing the context from a simple rule invocation state p to include the iteration number. That would mean tracking how many times each loop is gone around and for each context and so on. Making sure to use the right iteration number would be complicated. The better solution is simpler but a little more expensive. All we have to do is mimic what the tail recursive version does: push something onto the invocation stack for each iteration.
Of then we have to pop this pseudo-rule invocations off the stack. The ATN has 2 convenient states associated with loops: StarLoopbackState/PlusLoopbackState and LoopEndState. The ATNs built by ANTLR for loops look like this:
To simulate tail recursion, we do implicit call to the start of the loop at StarLoopbackState/PlusLoopbackState by pushing that state onto the stack as if we had called the start of the loop from that position. Now, as we step through, say, 'x'* we will get a larger and larger stack that must be unwound after the loop. The LoopEndState is a great place to do this. Instead of doing an actual return to the loopback state, LoopEndState pops all elements from the stack associated with the loopback state for that loop.
To illustrate, imagine input "x" and loop 'x'*. At entry, the stack is empty []. We enter the block, which is just 'x', and then reach the loopback state. We push that state, say, q to get [q] context at StarLoopbackState. We finds no more input and so we exit to the LoopEndState, which pops all q on the stack, leaving [].
Now, imagine input "xx" and loop 'x'*. As before, the stack starts out empty, [], and then we enter the block to match 'x'. At the loopback, the stack becomes [q]. We match another 'x' and hit the loopback. Push to get [q q] at the entry state. No more input, so we jump to the LoopEndState, which pops all q, leaving []. Etc.
The same trick works for + loops. Imagine input "xx" for loop 'x'+. We start out with an empty stack and match 'x' in the block. Looping back around to push q to get stack [q] at the block start. We match the second 'x' and hit the loopback. In this case, it's a decision state and those 2 exit to the LoopEndState. That state pops all q (just one this time).
This requires that I alter my closure routine so that it add this functionality to the loopback and loop and states. Should not be too bad but requires an extra parameter to express this "mode". The closure method has to work for both regular protection and full context production. Maybe I will do some cut-and-paste
Ok, the point of this huge exercise was simply to convince myself I understood the problem and the solution. I think I got. Parser prediction and grammar analysis are extremely complicated, which is why I'm still finding edge cases years after ANTLR v3 came out. The adaptive LL(*) in v4 is truly awesome, but even more complicated!
Wow. 15 lines of code in closure(). Took about an hour.