/
4. Tree Parsing

4. Tree Parsing

Adding a tree parser stage

In the section about the token parser we revealed the structure of an XML document to be a tree. You did not see pretty much of it, though. The reason is that you did not actually build up any data structure from what you revealed. Such a data structure usually is described as an AST (abstract syntax tree), a tree structure that mimics the structure of the parsed input. In this section you will see how to build up such a structure and also how to traverse it using tree parsers.

Tree parsers do not consume flat streams of tokens, both rather structured trees of tokens generated by the token parser. Usually, all redundant data is stripped from those trees and their internal structure is most natural. One example of redundant data is an end tag. Its purpose in a flat stream is to indicate where an element ends. Such structure can, however, be expressed in a two dimensional tree. No need for an end tag. An example of a token that can be completely skipped is the equal sign (=) between attribute name and value. It is there for better human readability, but actually does not carry any meaning. Finally, there is good reason to consider an empty element equivalent to a start tag directly followed by an end tag. This can be unified.

Even though this is not a common approach, let us start with the tree grammar that looks most natural and simple. Later you can find out how tree construction must look like to produce a tree matching this grammar. To achieve this, let us remember that an XML document is a representation of a tree structure. Leaf nodes are either empty elements or text data. Non-leaf nodes are always elements that enclose other elements or text. Additionally, there are attributes and finally names that are associated to elements. It turns out that you can easily write all this in a single rule:

element
    : ^( ELEMENT
            GENERIC_ID
            ( ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) )*
            (element | PCDATA )*
        )
    ;

Cute, isn't it? This rule expresses the whole stammering above. The ^ symbol before an opening bracket indicates a hierarchical tree structure with the first item after the bracket being the root of the tree and the rest being the children. In this case we have a token called ELEMENT that tells us what follows actually is an element. The first child is GENERIC_ID which is the name of the element, then we have a list of attributes and finally the element's sub elements potentially mixed with text. All attributes are again structured as a tree.

To complete the tree grammar we just need to add the header that tells the tree parser to use the same token vocabulary as the lexer and the parser and we are done:

tree grammar XMLTreeParser;
options {
    tokenVocab=XMLLexer;
}

For real processing of the tree you certainly would want to access the text of the tokens. In ANTLR 3 you do this by

$name.text

where name is the label of a tree element like name=GENERIC_ID. You need to tell ANTLR about the type of your tree and also need to introduce a start rule. This is the complete tree grammar that simply dumps the whole AST back to XML:

tree grammar XMLTreeParser;
options {
    tokenVocab=XMLLexer;
    ASTLabelType = Tree;
}

document : element ;

element
    : ^( ELEMENT name=GENERIC_ID
            { System.out.print("<"+$name.text); }
            (
                ^(ATTRIBUTE attrName=GENERIC_ID value=ATTR_VALUE)
                { System.out.print(" "+$attrName.text+"="+$value.text); }
            )*
            { System.out.println(">"); }
            (element
            | text=PCDATA
                { System.out.print($text.text); }
            )*
            { System.out.println("</"+$name.text+">"); }
        )
    ;

You now have to augment the parser you have already seen above with tree construction statements. In ANTLR 3 tree creation and matching syntax is very similar. You can actually declare how the constructed tree is supposed to look like by duplicating the grammar fragment that matches the tree. E.g. for the tree construction of the attribute part we could write:

attribute : GENERIC_ID ATTR_EQ ATTR_VALUE -> ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) ;

where the tree construction part that comes after the -> is exactly the same as the fragment seen in the tree grammar above. Isn't that cool? It, however, turns out that tree construction declaration by example isn't sufficient in all scenarios. For example we want our end tag to be omitted completely. That's what the ! operator is for. You can either apply it to a full rule by placing it behind a rule name like in

endTag! : TAG_END_OPEN GENERIC_ID TAG_CLOSE;

or to a single item (endTag again) in a rule like in

element
    : ( startTag^
            (element
            | PCDATA
            )*
            endTag!
        | emptyElement
        )
    ;

You will notice that this is the element rule you already are familiar with. Next to the ! we can see the double ^ here. This operator declares startTag to be the root of the generated tree. As you can see, this does not only work for a single token, but also for complete sub trees! Additionally, we make use of the automatic tree construction feature already known from ANTLR 2. Left hand items without declaration will simply result in a flat tree node.

Finally, we have the two rules for start tags and empty elements

startTag : TAG_START_OPEN GENERIC_ID attribute* TAG_CLOSE
        -> ^(ELEMENT GENERIC_ID attribute*)
    ;

emptyElement : TAG_START_OPEN GENERIC_ID attribute* TAG_EMPTY_CLOSE
        -> ^(ELEMENT GENERIC_ID attribute*)
    ;

which generate a unified tree for both complete as well as for empty elements. As you might have noticed, we are using new token names which we have to declare. This results in this grammar head:

parser grammar XMLParser;
options {
    tokenVocab=XMLLexer;
    output=AST;
}

tokens {
    ELEMENT;
    ATTRIBUTE;
}

That's our complete parser grammar that actually creates an AST:

parser grammar XMLParser;
options {
    tokenVocab=XMLLexer;
    output=AST;
}

tokens {
    ELEMENT;
    ATTRIBUTE;
}

document : element ;

element
    : ( startTag^
            (element
            | PCDATA
            )*
            endTag!
        | emptyElement
        )
    ;

startTag : TAG_START_OPEN GENERIC_ID attribute* TAG_CLOSE
        -> ^(ELEMENT GENERIC_ID attribute*)
    ;

attribute : GENERIC_ID ATTR_EQ ATTR_VALUE -> ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) ;

endTag! : TAG_END_OPEN GENERIC_ID TAG_CLOSE;

emptyElement : TAG_START_OPEN GENERIC_ID attribute* TAG_EMPTY_CLOSE
        -> ^(ELEMENT GENERIC_ID attribute*)
    ;

Never has tree construction been easier! Finally, the glue code:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
    public static void main(String[] args) {
     try {
         CharStream input = new ANTLRFileStream(args[0]);
         XMLLexer lex = new XMLLexer(input);

         CommonTokenStream tokens = new CommonTokenStream(lex);
         XMLParser parser = new XMLParser(tokens);
         XMLParser.document_return root = parser.document();
         System.out.println("tree="+((Tree)root.tree).toStringTree());

         CommonTreeNodeStream nodes = new CommonTreeNodeStream((Tree)root.tree);
         XMLTreeParser walker = new XMLTreeParser(nodes);
         walker.document();
     } catch(Throwable t) {
         System.out.println("exception: "+t);
         t.printStackTrace();
     }
    }
}

When you feed in an example that is a bit more complex, like:

<t>Starting text inside t<t attrt1="valuet1" attrt2="valuet2"><t1>Text inside t1<empty/></t1><empty2/><empty3/></t></t>

This is what your parser spits out:

tree=
(ELEMENT t Starting text inside t (ELEMENT t (ATTRIBUTE attrt1 "valuet1") (ATTRIBUTE attrt2 "valuet2")
(ELEMENT t1 Text inside t1 (ELEMENT empty)) (ELEMENT empty2) (ELEMENT empty3)))
<t>
Starting text inside t<t attrt1="valuet1" attrt2="valuet2">
<t1>
Text inside t1<empty>
</empty>
</t1>
<empty2>
</empty2>
<empty3>
</empty3>
</t>
</t>

The first part of the output is the result of the toStringTree() method call from our glue code. The formatted XML is printed by the actions inside our tree parser. And, wow! So, what you have built is an XML unifier and pretty printer.

As you have now mastered this complete trail through all ANTLR parser types you may ask yourself where to go from here. A good idea is to have a look at the examples provided with ANTLR3. Another good starting point is the ANTLR3 main page and of course the Wiki documentation which this tutorial is part of. If you have any questions concerning ANTLR or this tutorial or any fixes or enhancements to this tutorial feel free to contact the ANTLR mailing list. Obvious corrections to this tutorial can be applied directly to the Wiki.

My siblings (including me):