Error productions proposal for ANTLR v4

Introduction

I'm abandoning this post mid-stream...seems that regular alternatives can match erroneous input just as easily as so-called error alternatives. Because of adaptive LL(*), it shouldn't affect production speed at all once it gets warmed up.

ANTLR has a built-in mechanism to detect, report, and recover from syntax errors. It seems to do a pretty good job. Certainly it's better than PEG, which can't detect errors until EOF. It does single token insertion and deletion automatically to deal with lots of common cases where people forget a symbol or have an extra one like "f(x;" and "f((x);".

The default mechanism recovers by scanning for a token that can follow the current rule or the invoking rule and so on. This work surprisingly well and is pulled almost directly from Niklaus Wirth's early Pascal parsing discussions in the 70s.

Yacc has error productions like this:

| error SEMI

that mean consume tokens until you reach SEMI, replacing everything but the SEMI with the error nonterminal. The error rule acts like a non-greedy ".*" loop.

My intuition is that scanning with the power of the parser rather than ".*" will allow a great deal of control over resynchronization. At the very least, it it will provide a great deal of flexibility in terms of generating good messages. We can simply list the common mistakes and add actions to emit very specific error messages. With only ".*" to distinguish erroneous input, yacc it might have trouble distinguishing between different erroneous sequences (in order to generate specific error messages).

The error productions serve both to recognize the erroneous input and to resynchronize the parser. After specifying the grammatical structure of the erroneous input, the error production can include the grammatical structure needed to resynchronize. After detecting an incorrect variable declaration prefix, a programmer could specify how to consume all the way through the rest of the declaration, even if it's not terminated by a \n or ';'.

decl: 'int' '[' ']' ID '=' expr
    | 'int' '[' ']' ID
    / 'int' '[' ID ('=' expr)?
    ;

where '/' is a "bent" or "not quite right" alternative operator, to distinguish from the PEG ordered alternative operator. One could even imagine skipping over the entire method body if the header is messed up:

method
    : header body
    / .* body
    ;

Here, we match any kind of header and then skipped past the entire body. There's no way a simple "scan until resynchronization set found" loop could scan past a nested construct like a method body.

Another important thing to mention is that the ".*" is not scanning until it sees the 1st token of rule body. It's a non-greedy loop that skips anything until a valid body is seen.

Proposed mechanism

To support their productions, rule alternative blocks will be extended to allow 0 or more error productions at the end. Rules would not support error productions. (I looked through a number of grammars and couldn't really find a situation in which the default error mechanism was seriously lacking.)

The normal mechanism, without error productions, traps mismatched token and no viable alternative errors within the productions of that rule. After reporting an error, it resynchronizes to follow set. (assuming we could not repair the error with single token insertion or deletion inline). If the user specified a catch expression, that code was executed in lieu of the standard mechanism. Any finally clause is executed after everything else, right before returning from the rule method. Rules look like this:

stat:   'return' INT ';'
    |   ID '=' expr ';'
    |   ID '(' expr (',' expr)* ')' ';'
    ;   
    catch[Exception e] { }
    finally { }

Here's a rule with a few error productions:

atom:   '(' expr ')'
    |   INT
    |   ID '(' expr (',' expr)* ')'
    /   '(' expr        // missing ')'
    /   ID '(' (expr|',') ')'         // missing >=1 comma somewhere in arg list
    ;

Boy, I'm sure having a hard time coming up with decent examples.