As I rebuild ANTLR (v4) using v3, I'm trying to do a good job of their recovery. Jim Idle has done a great job of explaining what are the key problems concerning early termination of loops upon bad input: Custom Syntax Error Recovery
As I have looked through the v3 analysis and generated code, I realized that Jim has highlighted a fairly serious problem in the error recovery mechanism. I tried to run away some of the look ahead for efficiency, but it is caused some weaknesses in error recovery. So, I can fix that for v4 code generation.
I'm also exploring techniques for doing good error recovery beyond what ANTLR can do for you automatically. It turns out that good error recovery is quite challenging. For example, I'm trying to handle the situation where people forget the terminating ';' on the end of the rule
grammar A; a : b catch [Exception e] {...} b : B ;
Before the catch we need the ';'. The problem is that we are deeply nested (down in the "element" rule) looking for an element of an alternative. the rule stack is
[grammarSpec, rules, rule, ruleBlock, altList, alternative, elements, element]
When we see 'catch', we need to report an error and then blow all the way out back to the "ruleBlock" rule. That screams out for a specialized recognition exception object. All we have to do is throw it
throw new ResyncToEndOfRuleBlock();
when we detect a missing semicolon and then catch it at the right spot:
ruleBlock : altList ; catch [ResyncToEndOfRuleBlock e] {}
how do we detect the missing ';'? It's not as simple as looking for a ':' in rule element. In our case here, element has to look beyond 'b' in rule 'a' to see if there is an argument following (a[34], for example). It sees "a catch" instead and throws NoViableAltException. The problem is that the current input symbol is "b", not "catch". What we actually need to know is how far ahead it looked before it decided to fail. That token is the problem token (catch). So, I've updated v3 ANTLR runtime so that TokenStream knows how to ask for the "high water mark" via range(). Here is my handler for element that checks for the unterminated rule case:
element : ... ; catch [RecognitionException re] { int ttype = input.get(input.range()).getType(); // look for anything that really belongs at the start of the rule minus the initial ID if ( ttype==COLON || ttype==RETURNS || ttype==CATCH || ttype==FINALLY || ttype==AT ) { RecognitionException missingSemi = new v4ParserException("unterminated rule (missing ';') detected at '"+ input.LT(1).getText()+" "+input.LT(2).getText()+"'", input); reportError(missingSemi); throw new ResyncToEndOfRuleBlock(); } reportError(re); recover(input,re); retval.tree = (GrammarAST)adaptor.errorNode(input, retval.start, input.LT(-1), re); }
Upon the above input, ANTLR v4 gives the tasty error message:
error(17): A.g:2:4: unterminated rule (missing ';') detected at 'b catch' while looking for rule element
Notes...
"grammar A;;" says
error(17): A;.g:1:10: extraneous input ';' expecting EOF
in v3:
CodeGenerator.generateLocalFOLLOW():
if ( follow.member(Label.EOF) ) { // TODO: can we just remove? Seems needed here: // compilation_unit : global_statement* EOF // Actually i guess we resync to EOF regardless follow.remove(Label.EOF); }
I think we do need EOF. dang. I'm running into an error recovery set that doesn't have EOF when it's needed.
Havng trouble figuring out what "exact" parameter does in combineFollows() in runtime.
it seems to only add sets from above on stack if EOR is in set i. When it sees a set w/o EOR, it stops adding. Why would we ever want them all? Maybe no viable alt instead of mismatched token?
a : sync ( stuff sync )* ;
should be part of ()* DFA (for entry/exit/choice). check looks for mismatch (via dfa). If fails, consumeUntil start of loop or follow of loop. Recheck to see if dfa predicts loop. If it does, enter. If predicts exit, then just exit and proceed normally.
Code templates say:
/** Same as a normal DFA state except that we don't examine lookahead * for the bypass alternative. It delays error detection but this * is faster, smaller, and more what people expect. For (X)? people * expect "if ( LA(1)==X ) match(X);" and that's it. */ dfaOptionalBlockState(k,edges,eotPredictsAlt,description,stateNumber,semPredState) ::= << int LA<decisionNumber>_<stateNumber> = input.LA(<k>);<\n> <edges; separator="\nelse "> >>
I'm reversing my opinion.
element in antlr grammar had acyclic k>1 predictor. Upon a : A b : B ; (missing , it gets stuck at "b:" not ":". Hard to add error alt. had to do .* COLON. Actually makes sense. it's (TOKEN_REF|RULE_REF)COLON that says we are missing a ';'.