Interfacing AST with Java

The power of ANTLR is in enabling you to parse and gather information from external sources. These external sources will have their own syntax, and as a good programmer, you want to keep the grammar and implementation separate from each other. Doing this allows you to adapt the interface between the data and grammar without changing the base program. One method of gathering information into data tables that the base program will analyze. This intermediate representation (IR) is useful and often used in other grammar generators. ANTLR further allows you to create two other forms automatically: Parse Tree and Abstract Syntax Tree (AST). The parse tree closely follow what the original data looks like, only in tree form. The AST is more abstract, leaving out useless tokens like braces, parentheses, etc. The IR, AST, and parse tree serves different purposes and will increasely lose the original form (and thus information) as the translation approaches IR.

This section assumes that you have successfully written your parser and added the rewriting rules.  Now, you need to interface with your working language.  This section attempts to describe the process of setting up your program and interfacing with Java.  For other languages you may be able to extrapolate from the information that is here.

The Grammar 

Suppose that you have a simple grammar for reading an HTML document.  I'm sure that this has "never" been done before, so don't yawn and blink too much. 

grammar HtmlDoc;
options { output=AST; }
tokens {
	DOC='doc';
	TITLE='title';
	BODY='body';
}

html_doc
	: '<html>' html_header html_body '</html>' -> ^('doc' html_header html_body);

html_header
	: '<title>' TEXT '</title>' -> ^('title' TEXT) ;

html_body
	: '<body>' TEXT '</body>' -> ^('body' TEXT)	;

TEXT : (~('<'))*;

This grammar indicates the title and body to be output. The '->' is new to ANTLR 3.0 and is a cleaner way to "rewrite" the data. See XXX for more detail on rewrites. The next step is writing the Java interfacing code.

The Java Interface 

Writing the interface to get the ANTLR output is a little more complicated than earlier versions, but the result is more powerful. The basic glue to get things going has changed as well:

import org.antlr.runtime.ANTLRFileStream;
import org.antlr.runtime.TokenRewriteStream;
...
ANTLRFileStream fs = new ANTLRFileStream("test.html");
HtmlDocLexer lex = new HtmlDocLexer(fs);
TokenRewriteStream tokens = new TokenRewriteStream(lex);
HtmlDoc grammar = new HtmlDoc(tokens);

First, you need to create an "ANTLRFileStream" with the filename. Generally, it's a good idea to use the java.io.File, because it gives you more flexibility with accessing different files. Like earlier versions, you first open the lexer (HtmlDocLexer) then open the grammar (HtmlDoc). You now have a connection to your full grammar. Of course, this is not enough, since you want data back from it.

The AST Changes 

Before plowing into AST generation, you need to understand how the data is supposed to be stored. In the earlier versions of ANTLR, you could use "siblings" and "children" to walk the tree. ANTLR 3 combines these into simply arrays of children. So where are siblings and how do you move "down" the tree? It's actually fairly simple: the arrays represent the siblings of children, and the children can have children.

The AST Magic 

The main gist of getting data back is through call-backs, or rather calls through a class you provide. This section starts off with a simpler form that helps you get the idea. First you need to create a tree adaptor (i.e., CommonTreeAdaptor). This class interfaces with the rewrite code that keeps track of the data and its location in the tree (remember the arrays of children and children of children mentioned above?).

static final TreeAdaptor adaptor = new CommonTreeAdaptor() {
	public Object create(Token payload) {
		return new CommonTree(payload);
	}
};

There is no need to keep more than one copy of this class, because it acts like a object factory for nodes in the tree. This is the "magic" of the new AST manupulation. This "magic" code will be expanded a little later in this article.

Hooking Things Together 

The magic code now needs to be hooked in to the grammar, so that the grammar knows what to use to create the tree. The first statement, below, does the hook up. The next step engages the grammar and lexer to parse over the file provided it earlier.

grammar.setTreeAdaptor(adaptor);
HtmlDoc.html_doc_return ret = grammar.html_doc();
CommonTree tree = (CommonTree)ret.getTree();

After the parsing completes, you can get the tree from the grammar by calling "getTree()." The result of this call is always Object so you can create your own CommonTree-like classes (see below). Also the result is the head of the tree, so walking it is fairly easy.

Walking Through the Tree 

If you recall, the tree is organized in arrays of children and children can have arrays of children. Effectively, then, you can write a simple recursive procedure to walk the tree:

public void printTree(CommonTree t, int indent) {
	if ( t != null ) {
		StringBuffer sb = new StringBuffer(indent);
		
		if (t.getParent() == null){
			System.out.println(sb.toString() + t.getText().toString());	
		}
		for ( int i = 0; i < indent; i++ )
			sb = sb.append("   ");
		for ( int i = 0; i < t.getChildCount(); i++ ) {
			System.out.println(sb.toString() + t.getChild(i).toString());
			printTree((CommonTree)t.getChild(i), indent+1);
		}
	}
}

Similarly, you can find the needed information by searching the tree and finding the subelements you created in the grammar's rewrite syntax.

Advanced Magic 

The magic that was mentioned earlier really has an extended purpose and not meant to make your life more, um, challenging. Suppose that you wanted to use the tree as a medium to process information. The tree, and thus the individual nodes, would then need to be flexible enough to handle the annotations. You can do this by extending the CommonTree class:

import org.antlr.runtime.tree.*;
import org.antlr.runtime.Token;

public class HtmlAST extends CommonTree {
	public String text; 
	
	public HtmlAST(Token t) {
		super(t);
		text = (t != null? t.getText(): "[]");
	}

	public String toString() {
		return text + super.toString();
	}
}

In this case 'text' would simply echo what the parent (CommonTree) already has. But, you can see how this would be useful, allowing you to add your own custom fields. The hook up is even more simple:

static final TreeAdaptor adaptor = new CommonTreeAdaptor() {
	public Object create(Token payload) {
		return new HtmlAST(payload);
	}
};

The rest of the code remains the same. Or, well, the example code does. Your, on the other hand, coding is never going to be the same.