2. Parser

Parsing XML with ANTLR3

As soon as you have your lexer tuned to deliver the right tokens you need to think about the structure expressed by them. Or, you might say, reveal the inherent structure of the token stream coming from the lexer. This is the task of the token parser.  Most people simply call it parser, but some people also call the combination of the lexer/parser/tree parser the parser. That's why I like to use the term token parser. Anyway, to illustrate the job of the token parser consider the programming language Pascal. In Pascal the keywords begin and end mark the beginning and the end of a block. begin and end would be tokens delivered by the lexer and the token parser now has to reveal that anything in between is a block. Or, to stick to our XML example, a name between a TAG_START_OPEN and a TAG_CLOSE together is are a start tag. In ANTLR we describe it like that:

startTag  : TAG_START_OPEN GENERIC_ID TAG_CLOSE ;

Whereas the GENERIC_ID is the name of that specific start tag. Note that token names are usually all upper case and token parser rule names at least start with a lower case character. Other than that, syntax of the lexer and token parser description is unified. This means the general rule structure and most of the expressions - like ()+ and ()* - are the same both in the lexer as well as in the token parser. Which is good as you only have to learn one language! But wait, our rule isn't quite complete, we forgot about attributes:

startTag  : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;

attribute  : GENERIC_ID ATTR_EQ ATTR_VALUE ;

You have put the attribute definition into a separate rule to reuse it as a sub rule in the definition for the empty element tag:

emptyElement : TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;

Finally,  the definition for the end tag, which is easy:

endTag :  TAG_END_OPEN GENERIC_ID TAG_CLOSE ;

Using this grammar ANTLR can now identify all kinds of tags. This is the complete first version of the grammar:

parser  grammar XMLParser;
startTag  : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
attribute  : GENERIC_ID ATTR_EQ ATTR_VALUE ;
endTag :  TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
emptyElement : TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;

Save it to a file - I use xmlParser-version1.g - and run it through ANTLR just like the lexer grammar:

java  org.antlr.Tool xmlLParser-version1.g

ANTLR will know what to do with it because of the specific header.

So,  you have a parser now. But what's the use? What exactly is the effect of these token parser rules? In the lexer rules produce tokens once the characters coming in match it. Obviously, this is not the case here. Remember, what we wanted from the parser was to find out about the structure of our XML document. What you have is the structure of the tags inside that document which is only part of our overall structure. You also need to know which element is the root element, where the text and generally where all those tags are relatively to each other. It is important to know which tag is a child of another, what tags are siblings and so on. Marvel at this beautifully simple rule for the overall structure of an XML document:

element
    : startTag
        (element
        | PCDATA
        )*
        endTag
    | emptyElement
    ;

It matches all structures that either are an empty element or a complete element headed by a start tag and ended by an end tag. Such a complete element may contain text as PCDATA and other elements as sub elements. Both text and sub element can be mixed. That's it. That's the whole structure of an XML document! OK, we add this rule and also another to tell ANTLR where to start parsing. We also need to tune our parser to the types of the token coming from the lexer:

parser  grammar XMLParser;

options {      tokenVocab=XMLLexer; }
document  : element ;

element
    : startTag
        (element
        | PCDATA
        )*
        endTag
    | emptyElement
    ;

startTag  : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;

attribute  : GENERIC_ID ATTR_EQ ATTR_VALUE ;

endTag :  TAG_END_OPEN GENERIC_ID TAG_CLOSE ;

emptyElement : TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;

And this is the glue code to feed the token parser with the tokens coming from the lexer:

import  org.antlr.runtime.*;
import  org.antlr.runtime.tree.*;

public  class MainParser {
    public static void main(String[] args)  {
     try {
           CharStream input = new ANTLRFileStream(args[0]);
           XMLLexer lex = new XMLLexer(input);
           CommonTokenStream tokens = new  CommonTokenStream(lex);

           XMLParser parser = new XMLParser(tokens);
           parser.document();
       }  catch(Throwable t) {
           System.out.println("exception: "+t);
           t.printStackTrace();
       }
    }
}

Save this class, compile it and start the whole parser on the XML we were using for the lexer. Like this:

java  MainParser simpleXML.xml

And see: nothing! It may not seem so, but this is good as it indicates that we have successfully parsed the XML input. The structure actually has been revealed to be a tree, but we do not see much of it, yet.

In the next section we will construct a tree data structure that exactly matches the revealed one.

My siblings (including me):