As I rebuild ANTLR (v4) using v3, I'm trying to do a good job of their recovery. Jim Idle has done a great job of explaining what are the key problems concerning early termination of loops upon bad input: Custom Syntax Error Recovery
As I have looked through the v3 analysis and generated code, I realized that Jim has highlighted a fairly serious problem in the error recovery mechanism. I tried to run away some of the look ahead for efficiency, but it is caused some weaknesses in error recovery. So, I can fix that for v4 code generation.
I'm also exploring techniques for doing good error recovery beyond what ANTLR can do for you automatically. It turns out that good error recovery is quite challenging. For example, I'm trying to handle the situation where people forget the terminating ';' on the end of the rule
Code Block |
---|
grammar A;
a : b
catch [Exception e] {...}
b : B ;
|
Before the catch we need the ';'. The problem is that we are deeply nested (down in the "element" rule) looking for an element of an alternative. the rule stack is
Code Block |
---|
[grammarSpec, rules, rule, ruleBlock, altList, alternative, elements, element]
|
When we see 'catch', we need to report an error and then blow all the way out back to the "ruleBlock" rule. That screams out for a specialized recognition exception object. All we have to do is throw it
throw new ResyncToEndOfRuleBlock();
when we detect a missing semicolon and then catch it at the right spot:
Code Block |
---|
ruleBlock
: altList
;
catch [ResyncToEndOfRuleBlock e] {}
|
how do we detect the missing ';'? It's not as simple as looking for a ':' in rule element. In our case here, element has to look beyond 'b' in rule 'a' to see if there is an argument following (a34, for example). It sees "a catch" instead and throws NoViableAltException. The problem is that the current input symbol is "b", not "catch". What we actually need to know is how far ahead it looked before it decided to fail. That token is the problem token (catch). So, I've updated v3 ANTLR runtime so that TokenStream knows how to ask for the "high water mark" via range(). Here is my handler for element that checks for the unterminated rule case:
Code Block |
---|
element
: ...
;
catch [RecognitionException re] {
int ttype = input.get(input.range()).getType();
if ( ttype==COLON || ttype==RETURNS || ttype==CATCH || ttype==FINALLY || ttype==AT ) {
RecognitionException missingSemi =
new v4ParserException("unterminated rule (missing ';') detected at '"+
input.LT(1).getText()+" "+input.LT(2).getText()+"'", input);
reportError(missingSemi);
throw new ResyncToEndOfRuleBlock();
}
reportError(re);
recover(input,re);
retval.tree = (GrammarAST)adaptor.errorNode(input, retval.start, input.LT(-1), re);
}
|
Notes...
"grammar A;;" says
error(17): A;.g:1:10: extraneous input ';' expecting EOF
...