Tree construction

There are two mechanisms in v3 for building abstract syntax trees (ASTs): operators and rewrite rules.

Operators

Nodes created for unmodified tokens and trees for unmodified rule references are added to the current subtree as children.

Operator

Description

!

do not include node or subtree (if referencing a rule) in subtree

^

make node root of subtree created for entire enclosing rule even if nested in a subrule

additiveExpression
	:	multiplicativeExpression ('+'^ multiplicativeExpression)*
	;

That is the same as the following in rewrite notation from the following section:

additiveExpression
	:	(a=multiplicativeExpression->$a) // set result
                (    '+' b=multiplicativeExpression
                     -> ^('+' $additiveExpression $b) // use previous rule result
                )*
	;

Rewrite rules

The rewrite syntax is more powerful than the operators. It suffices for most common tree transformations.
While the parser grammar specifies how to recognize input, the rewrites are generational grammars, specifying how to generate output. ANTLR figures out how to map input to output grammar. To create an imaginary node, just mention it like the following example (UNIT is a node created from an imaginary token and is used to group the compilation unit chunks):

compilationUnit
    :   packageDefinition? importDefinition* typeDefinition+
        -> ^(UNIT packageDefinition? importDefinition* typeDefinition*)
    ;

ANTLR tracks all elements with the same name into a single implicit list:

formalArgs
	:	formalArg (',' formalArg)* -> formalArg+
	|
	;

If the same rule or token is mentioned twice you generally must label the elements to distinguish them. If you want to combine multiple elements into a single list, list labels are very handy (though in this case since they have the same name ANTLR will automatically combine them):

('implements' i+=typename (',' i+=typename)*)?

Here is the entire rule:

classDefinition[MantraAST mod]
	:	'class' cname=ID
		('extends' sup=typename)?
		('implements' i+=typename (',' i+=typename)*)?
		'{'
		(	variableDefinition
		|	methodDefinition
		|	ctorDefinition
		)*
		'}'
		-> ^('class' ID {$mod} ^('extends' $sup)? ^('implements' $i+)?
		     variableDefinition* ctorDefinition* methodDefinition*
		    )
	;

Note that using a simple action in a rewrite means evaluate the expression and use as a tree node or subtree. The mod argument is a set of modifiers passed in from an enclosing rule.

Deleting tokens or rules is easy: just don't mention them:

packageDefinition
	:	'package' classname ';' -> ^('package' classname)
	;

If you need to build different trees based upon semantic information, use a semantic predicate:

variableDefinition
	:	modifiers typename ID ('=' completeExpression)? ';'
		-> {inMethod}? ^(VARIABLE ID modifiers? typename completeExpression?)
		->             ^(FIELD ID modifiers? typename completeExpression?)
	;

where inMethod is set by the method rule.

Often you will need to build a tree node from an input token but with the token type changed:

compoundStatement
	:	lc='{' statement* '}' -> ^(SLIST[$lc] statement*)
	;

SLIST by itself is a new node based upon token type SLIST but it has no line/column information nor text. By using SLIST[$lc], all information except the token type is copied to the new node.

Using a rewrite rule at a non-extreme-right-edge-of-production location is ok, but it still always sets the overall subtree for the enclosing rule.

'if' '(' equalityExpression ')' s1=statement
( 'else' s2=statement -> ^('if' ^(EXPR equalityExpression) $s1 $s2)
|                     -> ^('if' ^(EXPR equalityExpression) $s1)
)

You may reference the previous subtree for the enclosing rule using $rulename syntax

postfixExpression
	:	(primary->primary) // set return tree
		(	lp='(' args=expressionList ')' -> ^(CALL $postfixExpression $args)
		|	lb='[' ie=expression ']'       -> ^(INDEX $postfixExpression $ie)
		|	dot='.' p=primary              -> ^(FIELDACCESS $postfixExpression $p)
		|	c=':' cl=closure[false]        -> ^(APPLY ^(EXPR $postfixExpression) $cl)
		)*
	;

Imaginary nodes

References to tokens with rewrite not found on left of -> are imaginary tokens.

d : type ID ';' -> ^(DECL type ID) ; // DECL is imaginary

or

call : lp='(' ID args ')' -> ^(CALL[$lp] ID args) ;

Here, the CALL node has its line/column info set with info from '(' token. CALL node is "derived" from the '('.

Even tokens referenced within alternative result in nodes disassociated with tokens from left of -> if you put arguments on the references:

a : INT -> INT["99"] ; // node created from adaptor.create(INT, "99")

Tree construction during tree parsing

ANTLR 3.0.1 could not create trees during tree parsing. 3.1 introduces the ability to create a new AST from an incoming AST using rewrites rules:

  • Each rule returns a new tree.
  • An alternative without a rewrite duplicates the incoming tree.
  • The tree returned from the start rule is the new tree.
  • The new tree created with output=AST in a tree grammar is completely independent of the input tree as all nodes are duplicated (with and without rewrite -> operator).

The rewrites work just like they do for normal parsing:

a : INT ; // duplicate INT node and return
a : ID -> ; // delete ID node from tree
a : INT ID -> ID INT ; // reorder nodes
a : ^(ID INT) -> ^(INT ID) ; // flip order of nodes in tree
a : INT -> INT["99"] + // make new INT node
a : (^(ID INT))+ -> INT+ ID+ ; // break apart trees into sequences

Predicates can be used to choose between rewrites as well:

a : ^(ID INT) -> {some test}? ^(ID["ick"] INT)
              -> INT
  ;

Don't forget the wildcard (smile)

s : ^(ID c=.) -> $c ; // new tree is whatever matched wildcard

Polynomial differentiation example

For translations whose input and output languages are the same, it often makes sense to build a tree and them morph it towards the final output tree, which can then be converted to text. Polynomial differentiation is a great example of this. Recall that:

  • d/dx(n) = 0
  • d/dx(x) = 1
  • d/dx(nx) = n
  • d/dx(nx^m) = nmx^m-1
  • d/dx(foo + bar) = d/dx(foo) + d/dx(bar)

Ok, here's a parser that builds nice trees.

grammar Poly;
options {output=AST;}
tokens { MULT; } // imaginary token

poly: term ('+'^ term)*
    ;

term: INT ID  -> ^(MULT["*"] INT ID)
    | INT exp -> ^(MULT["*"] INT exp)
    | exp
    | INT
    | ID
    ;
exp : ID '^'^ INT
    ;

ID  : 'a'..'z'+ ;
INT : '0'..'9'+ ;
WS  : (' '|'\t'|'\r'|'\n')+ {skip();} ;

Then we differentiate:

tree grammar PolyDifferentiator;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=AST;
//  rewrite=true; // works either in rewrite or normal mode
}

poly:   ^('+' poly poly)
    |   ^(MULT INT ID)      -> INT
    |   ^(MULT c=INT ^('^' ID e=INT))
        {
        String c2 = String.valueOf($c.int*$e.int);
        String e2 = String.valueOf($e.int-1);
        }
                            -> ^(MULT["*"] INT[c2] ^('^' ID INT[e2]))
    |   ^('^' ID e=INT)
        {
        String c2 = String.valueOf($e.int);
        String e2 = String.valueOf($e.int-1);
        }
                            -> ^(MULT["*"] INT[c2] ^('^' ID INT[e2]))
    |   INT                 -> INT["0"]
    |   ID                  -> INT["1"]
    ;

then we simplify (a little anyway):

tree grammar Simplifier;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=AST;
    backtrack=true;
//  rewrite=true; // works either in rewrite or normal mode
}
/** Match some common patterns that we can reduce via identity
 *  definitions.  Since this is only run once, it will not be perfect.
 *  We'd need to run the tree into this until nothing
 *  changed to make it correct.
 */
poly:   ^('+' a=INT b=INT)  -> INT[String.valueOf($a.int+$b.int)]

    |   ^('+' ^('+' a=INT p=poly) b=INT)
                            -> ^('+' $p INT[String.valueOf($a.int+$b.int)])
    |   ^('+' ^('+' p=poly a=INT) b=INT)
                            -> ^('+' $p INT[String.valueOf($a.int+$b.int)])
    |   ^('+' p=poly q=poly)-> {$p.tree.toStringTree().equals("0")}? $q
                            -> {$q.tree.toStringTree().equals("0")}? $p
                            -> ^('+' $p $q)
    |   ^(MULT INT poly)    -> {$INT.int==1}? poly
                            -> ^(MULT INT poly)
    |   ^('^' ID e=INT)     -> {$e.int==1}? ID
                            -> {$e.int==0}? INT["1"]
                            -> ^('^' ID INT)
    |   INT
    |   ID
    ;

Finally we walk the tree to print it back out using simple templates:

tree grammar PolyPrinter;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=template;
}

poly:   ^('+'  a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a>+<b>"
    |   ^(MULT a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a><b>"
    |   ^('^'  a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a>^<b>"
    |   INT                     -> {%{$INT.text}}
    |   ID                      -> {%{$ID.text}}
    ;

Here is a test rig:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
    public static void main(String[] args) throws Exception {
        CharStream input = null;
        if ( args.length>0 ) {
            input = new ANTLRFileStream(args[0]);
        }
        else {
            input = new ANTLRInputStream(System.in);
        }

        // BUILD AST
        PolyLexer lex = new PolyLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lex);
        PolyParser parser = new PolyParser(tokens);
        PolyParser.poly_return r = parser.poly();
        System.out.println("tree="+((Tree)r.tree).toStringTree());

        // DIFFERENTIATE
        CommonTreeNodeStream nodes = new CommonTreeNodeStream((Tree)r.tree);
        nodes.setTokenStream(tokens);
        PolyDifferentiator differ = new PolyDifferentiator(nodes);
        PolyDifferentiator.poly_return r2 = differ.poly();
        System.out.println("d/dx="+((Tree)r2.tree).toStringTree());

        // SIMPLIFY / NORMALIZE
        nodes = new CommonTreeNodeStream((Tree)r2.tree);
        nodes.setTokenStream(tokens);
        Simplifier reducer = new Simplifier(nodes);
        Simplifier.poly_return r3 = reducer.poly();
        System.out.println("simplified="+((Tree)r3.tree).toStringTree());
        // CONVERT BACK TO POLYNOMIAL
        nodes = new CommonTreeNodeStream((Tree)r3.tree);
        nodes.setTokenStream(tokens);
        PolyPrinter printer = new PolyPrinter(nodes);
        PolyPrinter.poly_return r4 = printer.poly();
        System.out.println(r4.st.toString());
    }
}

Running the rig on "2x+3x^5" shows:

tree=(+ (* 2 x) (* 3 (^ x 5)))
d/dx=(+ 2 (* 15 (^ x 4)))
simplified=(+ 2 (* 15 (^ x 4)))
2+15x^4

Rewriting an existing AST

For efficiency, option rewrite=true does an in-line replacement for rewrite rules so you can avoid making a copy of an entire tree just to tweak a few nodes. For example, if you have a huge expression tree but only want to rewrite ^('+' INT INT) to be a single INT node, it's better not to duplicate the entire huge tree. The rewrite mode behaves exactly the same as nonrewrite mode except that rewrites stitch changes into the incoming tree. Nodes are not duplicated for rules w/o rewrites.

The result of a rule with a rewrite is the newly created tree. The result of a rule w/o a rewrite is simply the incoming tree. For chains of rule invocations as in the next example, ANTLR copies rewrites upwards so that the action in rule 's' prints out the tree created in b:

tree grammar TP;
options {output=AST; ASTLabelType=CommonTree; tokenVocab=T; rewrite=true;}
s : a {System.out.println($a.tree.toStringTree());} ;
a : b ;
b : ID INT -> INT ID ;

Heterogeneous tree nodes

By default, with output=AST, ANTLR creates trees of type CommonTree. To create different nodes depending on the incoming token type, you can override create(Token) and YourTreeClass.dupNode(Object) and errorNode() in a subclass of CommonTreeAdaptor or implement your own TreeAdaptor. Unfortunately, this only allows you to change the node type based upon the token type and not grammatical context. Sometimes you want to have ID become a VarNode and sometimes they MethodNode object. As of v3.1, you can use the node token option to indicate node type (in both parsers and tree parsers):

decl : 'int'<node=TypeNode> ID<node=VarNode> ';' ;

or equivalently

decl : 'int'<TypeNode> ID<VarNode> ';' ;

because node is assumed if there is only one option and it is not an option assignment. Token references with node options and invoke the following constructor during tree construction:

public V(Token t); // NEED SPECIAL CTOR for ID<V> on left of ->

You can specify the node type on any token reference including liberals:

a : ID<V> ';'<V>

The "become root" operator ^ is used following the token options:

e : INT '+'<PlusNode>^ INT ;

Labels are available as usual; e.g., x=ID<V> and x+=ID<V>.

Heterogeneous tree nodes are labeled on the right-hand side of the -> rewrite operator as well:

decl : 'int' ID -> ^('int'<TypeNode> ID<VarNode>) ;

You can also specify arguments on node type constructors on the right of -> rewrite operator. For example, the following two token references:

ID<V>[42,19,30] ID<V>[$ID,99]

invoke the following two constructors of V:

public V(int ttype, int x, int y, int z)
public V(int ttype, Token t, int x)

The TreeAdaptor is not called; instead for constructors are invoked directly. This is much more flexible because the list of arguments can change per type whereas the TreeAdaptor interface is fixed. Note that parameters are not allowed on token references to the left of ->:

a : ID<V>[23,21] ; // ILLEGAL

Use imaginary nodes as you normally would, but with the addition of the node type:

block : lc='{' stat+ '}' -> ^(BLOCK<StatementList>[$lc] stat+) ;

Here is a complete simple example:

grammar T;
options {output=AST;}
@members {
static class V extends CommonTree {
  public int x,y,z;
  public V(int ttype, int x, int y, int z) {
    this.x=x; this.y=y; this.z=z; token=new CommonToken(ttype,"");
  }
  public V(int ttype, Token t, int x) { token=t; this.x=x; }
  public String toString() {
    return (token!=null?token.getText():"")+"<V>;"+x+y+z;
  }
}
}
a : ID -> ID<V>[42,19,30] ID<V>[$ID,99] ;
ID : 'a'..'z'+ ;
WS : (' '|'\\n') {$channel=HIDDEN;} ;

Sometimes ANTLR must duplicate nodes to avoid cycles and to provide useful semantics. In the next example, the trees returned from rule type are type V. There is only one type specification in the input (e.g., "int a,b,c;") but multiple identifiers. to create multiple trees with 'int' at the root, that node must be duplicated by the rewrite rule ^(type ID)+.

grammar T;
options {output=AST;}
a : type ID (',' ID)* ';' -> ^(type ID)+;
type : 'int'<V> ;
ID : 'a'..'z'+ ;
INT : '0'..'9'+;
WS : (' '|'\\n') {$channel=HIDDEN;} ;

We want 3 trees, one for each identifier:

(int<V> a) (int<V> b) (int<V> c)

We need to override dupNode() in our node class definition as well as two constructors:

class V extends CommonTree {
    public V(Token t) { token=t;}                 // for 'int'<V>
    public V(V node) { super(node); }             // for dupNode
    public Tree dupNode() { return new V(this); } // for dup'ing type
    public String toString() { return token.getText()+"<V>";}
}

Here is a test rig:

public class Test {
    public static void main(String[] args) throws Exception {
        CharStream input = new ANTLRFileStream(args[0]);
        TLexer lex = new TLexer(input);
        TokenRewriteStream tokens = new TokenRewriteStream(lex);
        TParser parser = new TParser(tokens);
        TParser.a_return r = parser.a();
        if ( r.tree!=null ) {
            System.out.println(((Tree)r.tree).toStringTree());
            ((CommonTree)r.tree).sanityCheckParentAndChildIndexes();
        }
    }
}

Using custom AST node types

/** An adaptor that tells ANTLR to build CymbalAST nodes */
public static TreeAdaptor cymbalAdaptor = new CommonTreeAdaptor() {
    public Object create(Token token) {
        return new CymbalAST(token);
    }
    public Object dupNode(Object t) {
        if ( t==null ) {
            return null;
        }
        return create(((CymbalAST)t).token);
    }
    public Object errorNode(TokenStream input, Token start, Token stop,
                            RecognitionException e)
    {
        CymbalErrorNode t = new CymbalErrorNode(input, start, stop, e);
        return t;
    }
};

Here's a suitable error node:

/** A node representing erroneous token range in token stream */
public class CymbolErrorNode extends CymbolAST {
        org.antlr.runtime.tree.CommonErrorNode delegate;

        public CymbolErrorNode(TokenStream input, Token start, Token stop,
                                           RecognitionException e)
        {
                delegate = new CommonErrorNode(input,start,stop,e);
        }

        public boolean isNil() { return delegate.isNil(); }

        public int getType() { return delegate.getType(); }

        public String getText() { return delegate.getText(); }
        public String toString() { return delegate.toString(); }
}

Error Node Insertion Upon Syntax Error

Prior to v3.1, ANTLR AST-building parsers did not alter the resulting AST upon syntax error. After v3.1 ANTLR adds an error node as created by TreeAdaptor.errorNode(...) to represent the missing nodes or confusing input sequences. The first token in the error sequence is the token at which the parser first detected an error. The last token in the sequence is the last token consumed during error recovery. ANTLR creates a CommonErrorNode by default, but you can obviously create your own tree adapter and override this.

Let me demonstrate the new mechanism by example. Referring to the attached SimpleC.g from the v3 examples, here is some good input:

int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}

That input results in the following tree output:

tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

The grammar and tree construction for the FOR loop is as follows:

forStat
    :   'for' '(' start=assignStat ';' expr ';' next=assignStat ')' block
        -> ^('for' $start expr $next block)
    ;

Now, remove the first '(' of the or loop:

int foo() {
  for i=0; i<3; i=i+1) {
    x=9;
  }
}

You will see that ANTLR detects an error, but magically inserts the missing token. In this case, the parser was not asked to insert the '(' into the tree so there is no evidence of the error in the output tree:

line 2:6 missing '(' at 'i'
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

What about when you have a random extra token such as "22" before the '(':

int foo() {
  for 22 (i=0; i<3; i=i+1) {
    x=9;
  }
}
line 2:6 extraneous input '22' expecting '('
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

Again, ANTLR detects the error and is able to ignore the extraneous token to yield a valid tree.

If you forget a token that must go into the output tree, however, you will see an error node. Given a missing identifier at the start of the FOR loop:

int foo() {
  for (=0; i<3; i=i+1) {
    x=9;
  }
}

The parser emits:

line 2:7 missing ID at '='
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= <missing ID> 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

The toString() method of the error node yields "<missing ID>".

When the parser gets really confused, such as when it gets a NoViableAltException, you will see that it consumes a whole bunch of input and adds it to the tree as an error node (it indicates what tokens it consumes during resynchronization). Input:

);
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}

yields:

line 1:0 required (...)+ loop did not match anything at input ')'
tree=<error: );
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}>

Making custom error nodes

Just override errorNode() in TreeAdaptor. The default handling is as follows:

public Object errorNode(TokenStream input, Token start, Token stop,
			RecognitionException e)
{
	CommonErrorNode t = new CommonErrorNode(input, start, stop, e);
	return t;
}

Make sure that your error node type is a subclass of your node type so that you do not get class cast exceptions.

See the next section for example of how to override.

Turning off error node construction

To turn this off, just override errorNode:

class MyAdaptor extends CommonTreeAdaptor {
    public Object errorNode(TokenStream input, Token start, Token stop,
                            RecognitionException e)
    {
        return null;
    }
}

and then set

parser.setTreeAdaptor(new MyAdaptor());