Gracefully handling syntax errors

As I rebuild ANTLR (v4) using v3, I'm trying to do a good job of their recovery. Jim Idle has done a great job of explaining what are the key problems concerning early termination of loops upon bad input: Custom Syntax Error Recovery

As I have looked through the v3 analysis and generated code, I realized that Jim has highlighted a fairly serious problem in the error recovery mechanism. I tried to run away some of the look ahead for efficiency, but it is caused some weaknesses in error recovery. So, I can fix that for v4 code generation.

I'm also exploring techniques for doing good error recovery beyond what ANTLR can do for you automatically. It turns out that good error recovery is quite challenging. For example, I'm trying to handle the situation where people forget the terminating ';' on the end of the rule

grammar A;
a : b
   catch [Exception e] {...}
b : B ;

Before the catch we need the ';'. The problem is that we are deeply nested (down in the "element" rule) looking for an element of an alternative. the rule stack is

[grammarSpec, rules, rule, ruleBlock, altList, alternative, elements, element]

When we see 'catch', we need to report an error and then blow all the way out back to the "ruleBlock" rule. That screams out for an exception object. All we have to do is throw it

throw new ResyncToEndOfRuleBlock();

when we detect a missing semicolon and then catch it at the right spot:

ruleBlock
    : altList
    ;
    catch [ResyncToEndOfRuleBlock e] {}
/** Used to throw us out of deeply nested element back to end of a rule's
 *  alt list. Note it's not under RecognitionException.
 */
public class ResyncToEndOfRuleBlock extends RuntimeException {
}

how do we detect the missing ';'? It's not as simple as looking for a ':' in rule element. In our case here, element has to look beyond 'b' in rule 'a' to see if there is an argument following (a[34], for example). It sees "a catch" instead and throws NoViableAltException. The problem is that the current input symbol is "b", not "catch". What we actually need to know is how far ahead it looked before it decided to fail. That token is the problem token (catch). So, I've updated v3 ANTLR runtime so that TokenStream knows how to ask for the "high water mark" via range(). Here is my handler for element that checks for the unterminated rule case:

element
    : ...
    ;
catch [RecognitionException re] {
    int ttype = input.get(input.range()).getType();
    // look for anything that really belongs at the start of the rule minus the initial ID
    if ( ttype==COLON || ttype==RETURNS || ttype==CATCH || ttype==FINALLY || ttype==AT ) { 
        RecognitionException missingSemi =
            new v4ParserException("unterminated rule (missing ';') detected at '"+ 
                                  input.LT(1).getText()+" "+input.LT(2).getText()+"'", input);
        reportError(missingSemi);
        throw new ResyncToEndOfRuleBlock();
    }   
    reportError(re);
    recover(input,re);
    retval.tree = (GrammarAST)adaptor.errorNode(input, retval.start, input.LT(-1), re);
}

Upon the above input, ANTLR v4 gives the tasty error message:

error(17): A.g:2:4: unterminated rule (missing ';') detected at 'b catch' while looking for rule element

Oh, and it correctly recovers and creates the AST for rule 'b'.

(COMBINED_GRAMMAR A
 (RULES <mismatched token: [@17,46:46='b',<63>,4:0], resync=a : b catch Exception e {...}>
  (RULE b (BLOCK (ALT B)))
 )
)

We need ResyncToEndOfRuleBlock to avoid recovering in element. input "b catch" starts with a rule ref, which is a valid element. it consumes until FOLLOW(element),which has rule ref in it so it consumes nothing. elements rule will descend back into element. must get it all the back to rule block level to recover properly by consuming all the exception crap. Ultimately I'd like to resync by parsing backwards from the ':'.

Notes...

"grammar A;;" says

error(17): A;.g:1:10: extraneous input ';' expecting EOF

in v3:

CodeGenerator.generateLocalFOLLOW():

if ( follow.member(Label.EOF) ) {
	// TODO: can we just remove?  Seems needed here:
	// compilation_unit : global_statement* EOF
	// Actually i guess we resync to EOF regardless
	follow.remove(Label.EOF);
}

I think we do need EOF. dang. I'm running into an error recovery set that doesn't have EOF when it's needed.

Havng trouble figuring out what "exact" parameter does in combineFollows() in runtime.
it seems to only add sets from above on stack if EOR is in set i. When it sees a set w/o EOR, it stops adding. Why would we ever want them all? Maybe no viable alt instead of mismatched token?

a : sync ( stuff sync )* ;

should be part of ()* DFA (for entry/exit/choice). check looks for mismatch (via dfa). If fails, consumeUntil start of loop or follow of loop. Recheck to see if dfa predicts loop. If it does, enter. If predicts exit, then just exit and proceed normally.

Code templates say:

/** Same as a normal DFA state except that we don't examine lookahead
 *  for the bypass alternative.  It delays error detection but this
 *  is faster, smaller, and more what people expect.  For (X)? people
 *  expect "if ( LA(1)==X ) match(X);" and that's it.
 */
dfaOptionalBlockState(k,edges,eotPredictsAlt,description,stateNumber,semPredState) ::= <<
int LA<decisionNumber>_<stateNumber> = input.LA(<k>);<\n>
<edges; separator="\nelse ">
>>

I'm reversing my opinion. We must not delay error attention. If we Don't see the start of an EBNF constructs, we should generate an error and trigger recovery. In fact, we should try to sync/recover at the start of these constructs and within just like this:

a : sync ( stuff sync )* ;

That way, we can get rid of crap in front of an optional element instead of bypassing the optional element and then getting a really strange error that we are missing whatever followed that optional element.

element in antlr grammar had acyclic k>1 predictor. Upon a : A b : B ; (missing (wink), it gets stuck at "b:" not ":". Hard to add error alt. had to do .* COLON. Actually makes sense. it's (TOKEN_REF|RULE_REF)COLON that says we are missing a ';'.