6. Advanced processing

Advanced processing: Scopes, Validating Predicates and Error handling

As observed by Johannes the parser grammar we have by now does not check whether start and end tag match correctly. This means an XML like this would not fail, even though it really should:

<root>
  <tag1>error</tag2>
  <tag1>ok</tag1>
</root>

Note that end tag </tag2> does not match start tag <tag1>! Fortunately, Johannes also had a solution for that: In the rule that matches the end tag we check for the tag name of the element currently open. We do this using a so called validating semantic predicate. Additionally, we need a way to store the name of the element currently open. And as we can have a hierarchy of open elements we'd better do this using a stack. ANTLR supports us here by providing so called "scopes". Before we go into details, here is the complete grammar:

scope ElementScope {
  String currentElementName;
}

document : element ;

element
scope ElementScope;
    : ( startTag^
            (element
            | PCDATA
            )*
            endTag!
        | emptyElement
        )
    ;

startTag
    : TAG_START_OPEN GENERIC_ID attribute* TAG_CLOSE
            {$ElementScope::currentElementName = $GENERIC_ID.text; }
        -> ^(ELEMENT GENERIC_ID attribute*)
    ;

attribute : GENERIC_ID ATTR_EQ ATTR_VALUE -> ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) ;

endTag!
    : { $ElementScope::currentElementName.equals(input.LT(2).getText()) }?
      TAG_END_OPEN GENERIC_ID TAG_CLOSE
    ;

emptyElement : TAG_START_OPEN GENERIC_ID attribute* TAG_EMPTY_CLOSE
        -> ^(ELEMENT GENERIC_ID attribute*)
    ;

Rule endTag now contains the predicate that checks if the text stored in currentElementName is equal to the name of the end tag. How it was stored there can be seen in rule startTag. ElementScope stands for a reference to a scope defined to contain a string named currentElementName right at the top of the grammar. Now, every time rule element gets called a new scope of that type is created and pushed to an implicit stack. Each rule called from the element rule can now access the elements of that scope. And that's what rules startTag and endTag do. At the end of rule element this scope gets popped off the stack.

Using this grammar on the above XML file the parser now reports

line 2:13 rule endTag failed predicate: { $ElementScope::currentElementName.equals(input.LT(2).getText()) }?

which is all by itself perfectly alright, but is certainly a little bit confusing to the user who potentially has no idea about ANTLR, grammars and predicates. ANTLR offers a mechanism to add custom error handling, though. We add it to rule endTag:

endTag!
    : { $ElementScope::currentElementName.equals(input.LT(2).getText()) }?
      TAG_END_OPEN GENERIC_ID TAG_CLOSE
    ;
catch [FailedPredicateException fpe] {
    String hdr = getErrorHeader(fpe);
    String msg = "end tag (" + input.LT(2).getText() +
                 ") does not match start tag (" +
                 $ElementScope::currentElementName +
                 ") currently open, closing it anyway";
    emitErrorMessage(hdr+" "+msg);
    consumeUntil(input, TAG_CLOSE);
    input.consume();
}

This makes ANTLR generate a catch block for that kind of failed predicate exception. Our code does two useful things:

  1. Display a much more meaningful error message that tells the user which tag has been and which tag should have been closed.
  2. As already announced in this message we manually sync the input stream to skip this end tag, silently assuming this was meant to be the right closing tag. This allows to continue the parse in a meaningful way.

There still is much more to ANTLR. Expect this tutorial to be extended with every issue that pops up.