Can you explain ANTLR's tree construction facilities?

Introduction

ANTLR v3 has moved away from using the ANTLR v2 child-sibling trees for the common tree implementation. Instead, ANTLR v3 uses a simple list-of-children approach by default. To avoid locking everyone into the same implementation, ANTLR uses an approach similar to the TreeModel in Java's Swing tree viewing classes. All tree nodes are of type Object and an adaptor, TreeAdaptor, indicates how to create nodes and constructs trees--ANTLR does not care about the type at all. You could even make the Token objects double as AST nodes to avoid allocating a second tree node object to hold the Token as payload.

ANTLR builds trees when you use the option output=AST.

grammar T;
options {output=AST;}
a : ID INT;
ID : 'a'..'z'+ ;
INT : '0'..'9'+;
WS : (' '|'\n') {channel=99;} ;

By default, a CommonTreeAdaptor is created for the parser:

protected TreeAdaptor adaptor = new CommonTreeAdaptor();

The generated code for rule a() has calls to the adaptor to construct nodes and add them to the current rules subtree:

root_0 = (Object)adaptor.nil();
ID1=(Token)input.LT(1);
match(input,ID,FOLLOW_ID_in_a16); 
ID1_tree = (Object)adaptor.create(ID1);
adaptor.addChild(root_0, ID1_tree);
INT2=(Token)input.LT(1);
match(input,INT,FOLLOW_INT_in_a18); 
INT2_tree = (Object)adaptor.create(INT2);
adaptor.addChild(root_0, INT2_tree);

A nil node is used to hold a flat tree--a list of children. Here, ID and INT nodes are created and added to the nil root_0.

All rules return a tree with output=AST. In this case, the following return value structure is created and used by a():

public static class a_return extends ParserRuleReturnScope {
    Object tree;
    public Object getTree() { return tree; }
};

Before returning the tree, a rule does some postprocessing:

// post processing is only ^(nil x) -> x so far
retval.tree = (Object)adaptor.rulePostProcessing(root_0);
adaptor.setTokenBoundaries(retval.tree, retval.start, retval.stop);

An important advantage that v3 has over v2 is that v3 automatically computes and stores the range of tokens associated with the subtree created for each rule. For example, imagine the simple expression 15 + 21. The root "plus" node created by the associated grammar rule will store token boundaries 0..4 assuming that the expression as the only input and that is broken down into five tokens (including the whitespaces). This information is extremely useful later when generating code because sometimes you want to replace sections of the input with a translation computed from the tree; you need to know the corresponding input tokens to replace.

Default tree types

ANTLR does not assume any particular tree structure or data type, but of course it must have a default implementation. v3 chooses a simple but powerful implementation where each node has a list of children and a payload consisting of the token from which the node was created. For imaginary nodes, a CommonToken is created from the imaginary token type and used as the payload.

The CommonTreeAdaptor creates CommonTree objects that satisfy interface Tree, which among a few other things has:

Tree getChild(int i);

int getChildCount();

/** Add t as a child to this node.  If t is null, do nothing.  If t
 *  is nil, add all children of t to this' children.
 */      
void addChild(Tree t);

/** Indicates the node is a nil node but may still have children,
 *  meaning the tree is a flat list.
 */      
boolean isNil();

The tree part of the functionality is abstracted into BaseTree, which is a generic tree implementation with no payload. You must subclass it to have any user data. An empty, but non-null node is called "nil". A flat tree (a list) is an empty nil node whose children represent the list. BaseTree will be useful to you if you want the core tree functionality but a payload other than the Token.