Tree rewriting in ANTLR v4

Because most ANTLR users don't build compilers, I decided to focus on the other applications for ANTLR v4: parsing and extracting information and then translations. For compilers, we need to convert everything into operations and operands--that means ASTs are easier. For example, 3+4 should become a tree with + at the root and 3 and 4 as operands/children: (+ 3 4). The parse tree in contrast is probably (expr 3 + 4) where rule reference expr is the root node and the other elements are children. The parse tree is indeed less useful for compiler work.

On the other hand, parse trees are usually what we want for data extraction and translation tasks. Moreover, the parse trees can be automatically generated and ANTLR can generate listener and visitor tree walker mechanisms automatically as well. Users of ANTLR v4 don't have to learn AST construction syntax or AST tree grammar syntax and semantics.

ANTLR itself is a very complicated translator. It reads in grammar files and generates multiple Java files as output. The translation is complicated by the fact that the output looks very different from the input and the order of the output can be very different from the order the input. There can be nonlocal transformations. For example, ANTLR can decide it needs a local variable or parameter deep within a rule. Token reference x=T defines a label x that ANTLR would translate as something like:

_localctx.x = _input.LT(1);
match(T);

That code goes somewhere inside the rule function for the surrounding rule. At the same time, ANTLR must inject a definition of x into the rule context object for the surrounding rule, which must be generated outside of the rule function. A single element, x=T, generates code in multiple locations. The order of computation of the output is different than the order of the input elements.

The point is that ANTLR does all this without gradually transforming a grammar AST into a Java AST. That would be extremely cumbersome from lots of personal experience. Instead, ANTLR walks the tree multiple times to extract information and also to annotate the nodes. Once all of the information it needs is available, it does another walk over the tree and builds up a model of the output. The output objects are built up as the information becomes available during the tree walk of the input tree. The result of model construction is a complete tree of output model objects, which must be then rendered to text via StringTemplate templates. The name of the object model class, such as RuleFunction, corresponds to a template with the same name that knows how to render that model object to text. I built an automatic walker that walks the model and builds up the giant output template. The final step is to ask StringTemplate to render that template the text, effectively generating code. It works great!

That does not say there is no use for tree to tree transformations. ANTLR does so itself. I believe that the time to use such transformation is when the transformation results in a tree representing the same language, such as C to C or ANTLR to ANTLR. For example, it's extremely appropriate to normalize or optimize input by rewriting the tree. x+0 can be converted to just x, which means converting AST (+ x 0) to x or parse tree (expr x + 0) to (expr x). ANTLR transforms left recursive grammars to non-left recursive grammars using tree rewriting, but it does so manually at the moment. (boo)

ANTLR does not have tree transformation support at this time and it will not look like it did in v3 when it arrives. Many powerful rewriting systems let you perform rewrites in the concrete syntax of the language you are translating. E.g., see ASF+SDF by example. I imagine we'll do something very similar and allow a series of declarative rules like:

"<x:expr> + 0" -> "<x>"
"<x:expr> * 1" -> "<x>"
"<x:expr> * 0" -> "0"

Working with parse trees make such concrete transformation rules much easier to implement.

Sam Harwell and I are also talking about an xpath-like or other query mechanism to specify paths and nodes within parse trees. That way you could say: ``go get me all variable declarations'' and so on. There are lots of possibilities, given all of the research that has been done in this area.