Custom Syntax Error Recovery

Custom Syntax Error Recovery

An important part of a robust and useful parser is the behavior it exhibits when the next token in its incoming token stream is not one that the grammar specifies should be there. This is generally known as a syntax error. Here we look at how ANTLR recovers from the various mismatch cases and what techniques you can use to override or influence that recovery. The examples shown use the Java language, but the same techniques should apply to all ANTLR targets.

ANTLR implements good error recovery mechanisms by default within the runtimes of all target languages but in some cases the way a grammar is structured impairs the ability of the algorithms to recover to exactly where you might expect or wish. Sometimes your grammar rules cause a loop in a parsing rule to be exited earlier than you would expect or want; sometimes when a certain construct is in error you want to skip everything up to the end of that construct instead of resuming at wherever ANTLR sees a token that looks like it is a valid recovery point. There are many reasons you may wish to influence or override the standard recovery behavior. However, before we can examine how to implement your own recovery, we need to know something about how ANTLR recovers from mismatch problems.

Recognition Exception

Once a mismatch is detected, the generated code causes the target language equivalent of the Java Runtime's RecognitionException to be thrown or otherwise handled. If you examine a method generated for a parser rule in the Java target, you will see that the rule logic is encapsulated within a try {} catch {} block, which catches RecognitionException:

Generated Exception Handler
catch (RecognitionException re) {
    reportError(re);
    recover(input,re);
}

You can see that the exception handler does two things, the first of which is to report the error via whatever mechanism is available. Error reporting is covered elsewhere in the Wiki but essentially, if the parser is not already in error recovery mode, then reportError() increments the count of errors it has seen and calls displayRecognitionError(), which is generally the method that one overrides to provide custom display/capturing of parser error messages.

The second thing that happens after reporting the error is that the recover() method is called - it is this method that attempts to resync the token input stream to some place that makes sense within the context of the current parser rule and its parsing point within that rule. In order to do this, it must compute a Follow Set for the current parsing point - note that I am trying to avoid technical definitions and jargon here in favor of understanding. A follow set is basically a list of all the tokens that would be valid if they were seen at the this point in the parse. The recover() method then basically consumes tokens from the input stream until it locates a token that exists within this set. It may consume no tokens if it feels that there is actually a token missing.

Implementing Your Own Exception Handler

So, one obvious way we could influence recovery is to override the default implementation of the recover() method by either adding your own implementation in @parser::members() or better yet, in your own superClass (see options {}). However, this is fine if the recovery should always take place in the same way regardless of what the parse point is, but if that was the case, you probably would not have found this article, as the default implementation of recover() probably does what you want anyway.

When you want to do something special for a particular rule, you implement your own handler for the RecognitionException, and recover in whatever manner seems appropriate. Generally, you will still call reportError(), unless you wish to suppress the error, but then you can use your own method/algorithm to recover the input point. To do this, you add a catch clause to a rule:

Custom Exception Handler
myRule:
    TOK TOK1 TOK2 subRule?
;
catch [RecognitionException re] {
  
    // First, let's report the error as the user needs to know about it
    //
    reportError(re);
    myRecoverMethodForMyRule();
}

You will see the kind of things that can be done in recover() by examining the source code for the default implementation, or reading further in this article (recommended first).

Early Termination of Parsing Loops - Finer Grained Recovery

There are many situations where a parsing loop aborts prematurely using the standard implementations of recovery. An example helps to illustrate this and here we use a fairly typical rule that parses some kind of class definition (this is actually extracted from the JavaFX parser).

Grammar for Class Definition
class
: CLASS 
    name 
    supers			
    LBRACE 
        (
            classMember
        )*
    RBRACE
;

This grammar is perfectly sound and will indeed parse a class definition correctly, assuming the subrules are correct. However it creates a recovery problem as it stands in that quite a large number of the possible syntax errors found within the rule classMember will cause the * loop to exit, issue an error about an expected RBRACE and return to the calling rule. Basically an error in the classMember will abandon the parse of the class definition.

Clearly, when we return from an attempted parse of a classMember, we need to ensure that we resync the input stream to either the start of a new classMember, or the closing RBRACE. Now, this example also serves to highlight the limitations of any recovery method - we are not going to parse the input stream to try and make sense of it, just try to discard things that cannot possibly make sense at this point in the parse. However, by doing this, we will at least keep the classMember* loop functioning, even if we issue more errors while we try to get back in sync. This should be our goal because when we issue errors, we wish them to be as local to the actual problem as possible. We can now issue errors that will say "While trying to parse class XYZ" rather than "While parsing a script", and within classMember we can apply similar techniques, so that we stay within the class member parsing loops as long as we can.

So, how can we implement such a resync? Well, we could of course write a custom method for each parse point, and call it in an action directly after the classMember rule. However, as well as being tedious, it is prone to error as the grammar develops because you must remember to manually update the set of tokens that are valid recovery points for any case. So, before attempting this, let us pause and examine what ANTLR can provide us automatically.

First and Follow sets

I have tried to keep this article as free from jargon as possible, but here we must use a little of it. As ANTLR parses your grammar it constructs an NFA for each block of alternatives in a rule or subrule. When this process is complete, ANTLR can then compute the First and Follow sets for rules.

The First set, as its name implies, is the computation of all the tokens that indicate a rule should be selected (or similar decisions), for instance: X ruleY? Z - the First set for ruleY is the set of tokens that show ruleY should be invoked. The First set is used by ANTLR for code generation and analysis and is not available to us when the parser is running. The Follow set is used for standard error recovery and as its name also implies, is the set of tokens that should follow on when a rule has finished parsing. At runtime, before any rule is invoked from any other rule, the pre-calculated follow set is made available on a stack, and we can use the top set (or indeed any number of stacked sets) for whatever we like.

In our example rule, we clearly want to know what the First set is for the rule classMember, as when we apply our resync, this is the set of tokens that we should resync to (as well as the RBRACE that closes the class definition). However, as just noted, we do not have access to the First set at runtime, so how can we get ANTLR to do all the hard work for us?

Hijacking Follow Sets as First Sets

In order to have ANTLR compute the required recovery set for us, we first note that the set we want is not just the First set of the rule classMember, but also must include the closing RBRACE. This in fact is the Follow Set for the rule classMember in this production, and hence we can hijack that set and use it as our resync set. But, we want all this to happen automatically, and it would be better if we could write one resync rule that we could use (judiciously) anywhere in our grammar. In fact, all we need do is utilize a simple trick, which is perfectly 'legal' and will do all the calculations for us.

The key to this trick is to create a parser rule that consists only of a single empty alt (epsilon set in jargon), then use action code within the rule to resync the input stream to the Follow Set that it finds on the top of the stack. ANTLR will then invoke this empty rule but will first place the Follow set for that rule on the recovery stack. Because the rule itself is empty, it will add no additional tokens to the set and we can infer that - for our example - the Follow Set it computes will be the combination of the First set for the classMember rule and the RBRACE that ends the class definition. It now only remains to sync the input stream to this set within our empty rule. Because our rule is generic, we can use it anywhere in our grammar that it is required, and ANTLR will supply us with correct set of tokens for the location it is used in our rules.

Here is what the empty rule looks like:

Empty Grammar Rule for Custom Error Recovery
syncClass[ListBuffer<JFXTree> mems]
@init
{
    // Consume any garbled tokens that come before the next statement
    // or the end of the block. The only slight risk here is that the
    // block becomes MORE inclusive than it should but as the script is
    // in error, this is a better course than throwing out the block
    // when the error occurs and screwing up the whole meaning of
    // the rest of the token stream.
    //
    syncToSet();}
    :   // Deliberately match nothing, causing this rule always to be 
        // entered.
    ;

We can now use this new rule in our class rule as follows;

Grammar for Class Definition
class
: CLASS 
    name 
    supers			
    LBRACE 
        // Consume any garbled declarations so that we don't drop out
        // of parsing the class until we hit '}' for the class definition
        // or get to something that is so garbled we have no choice.
        // This first invocation consumes any garbage before the first member
	syncClass
        (
            classMember
            // Now invoke our sync rule again, this time within the loop
            // so that if the classMember rule could not recover anything
            // sensible, then we will at least have a good shot at resyncing
            // to the next definition and not dropping out of this loop.
            // This approach will fail if the class member was very badly 
            // in error, but we will keep resyncing until we hit the end of the
            // the class or really find a new class member. We could of course
            // use a custom action instead if there were special rules about 
            // recovery for this language - here we just don't want to drop out
            // of the loop unless there is no real choice.
            //
            syncClass
        )*
    RBRACE
;

The Recovery Code

We have our grammar in order now, so it remains only to implement the syncToSet() method in our target language. The followSet is implemented in all language targets, so if you are not using Java, then you can still do this, but must adapt the code to the semantics of your target language. Here is what the code looks like:

Java Implementation of Custom Resync
    /**
     * Use the current stacked followset to work out the valid tokens that
     * can follow on from the current point in the parse, then recover by
     * eating tokens that are not a member of the follow set we compute.
     *
     * This method is used whenever we wish to force a sync, even though
     * the parser has not yet checked LA(1) for alt selection. This is useful
     * in situations where only a subset of tokens can begin a new construct
     * (such as the start of a new statement in a block) and we want to
     * proactively detect garbage so that the current rule does not exit on
     * on an exception.
     *
     * We could override recover() to make this the default behavior but that
     * is too much like using a sledge hammer to crack a nut. We want finer
     * grained control of the recovery and error mechanisms.
     */
    protected void syncToSet()
    {
        // Compute the followset that is in context wherever we are in the
        // rule chain/stack
        //
         BitSet follow = state.following[state._fsp]; //computeContextSensitiveRuleFOLLOW();

         syncToSet(follow);
    }

    protected void syncToSet(BitSet follow)
    {
        int mark = -1;

        try {

            mark = input.mark();

            // Consume all tokens in the stream until we find a member of the follow
            // set, which means the next production should be guaranteed to be happy.
            //
            while (! follow.member(input.LA(1)) ) {

                if  (input.LA(1) == Token.EOF) {

                    // Looks like we didn't find anything at all that can help us here
                    // so we need to rewind to where we were and let normal error handling
                    // bail out.
                    //
                    input.rewind();
                    mark = -1;
                    return;
                }
                input.consume();

                // Now here, because you are consuming some tokens, yu will probably want
                // to raise an error message such as "Spurious elements after the class member were discarded"
                // using whatever your override of displayRecognitionError() routine does to record
                // error messages. The exact error my depend on context etc.
                //
            }
        } catch (Exception e) {

          // Just ignore any errors here, we will just let the recognizer
          // try to resync as normal - something must be very screwed.
          //
        }
        finally {

            // Always release the mark we took
            //
            if  (mark != -1) {
                input.release(mark);
            }
        }
    }

Using this code, we can sync to either the follow set currently on the stack, or to a set of our own construction (should we have anything more special to do). It should be obvious that our method can do anything it likes of course and specialized recovery for a particular rule is entirely possible.

Other Recovery Mechanisms Within ANTLR Runtimes

There is one other aspect of recovery which you may need to customize, and that is what happens when a mismatch() occurs. You will see in the generated code that there are lots of calls to the match() method. Inspecting the default implementation (in the Java runtime) we find that the match method will call the method recoverFromMismatchedToken() and this in turn will try to use the current Follow set stack to determine if the reason we mismatched is that there was a spurious token in the input: X Y Z when we wanted just X Z, or a missing token: X Z when we wanted X Y Z. If ANTLR can determine, using the Follow sets, that by skipping a token, it would see valid syntax, then it will consume the spurious token, report the extra token, but will not raise a RecognitionException. Similarly, if ANTLR can see that there is exactly one token missing from the input stream, which if present, would make the syntax valid, then it will manufacture this missing token, report the error, but again will not raise the RecognitionException.

If you want behavior that is different to this, then you can override the match() method, or more likely, the recoverFromMismatchedToken() method. Perhaps you do not want the spurious/missing error detection? Or, as you will see from the default implementation, ANTLR will first see if it can fix things by ignoring a token, then go on to see if it can fix things by adding a token. However, there are some syntax errors that can be recovered using either method - perhaps you want to reverse the order that these strategies are tried?

Conclusion

The built-in recovery mechanisms of ANTLR are suitable for almost all general parsing problems; however we can clearly see that there are cases where you must take things in to your own hands. Don't be afraid to do this - generic mechanisms, by their very nature cannot cover all your specialized cases and all the language targets allow you to override the default behaviors.