Automatic StringTemplate construction in ANTLR grammars

Currently ANTLR does not create templates for you automatically when you use output=template option. This is because, when I first implemented it, I had no idea what the right answer was here. I did not know how to deal with whitespace and so on. I think I have the answer now. First, let me remind you that output=AST builds a completely flat tree given no instructions to the contrary. Similarly, the template output should reproduce the input given no instructions.

Templates from parser grammars

Some cases seem obvious. What should the output template be for this rule?

d : 'int' ID ';' ;

The answer is not just concatenating the tokens because of whitespace. I tried a simple mechanism that added a little bit of code to each token and rule reference. The code snippet would copy the token object into some default template, which the user can specify by overriding a method call getDefaultTemplate(String ruleName). The problem is that the output came out as: intx; not int x; or whatever.

The answer seems to be a simple matter of inserting any off channel tokens into the output template before inserting the real token. What happens though when you invoke a rule that invokes another rule. You cannot simply add any whitespace before the starting token of a rule reference because chains of rule invocations will insert the same whitespace multiple times:

a : b ID ;
b : c ;
c : 'int' ;

The template for c would be 'int' plus any of the whitespace to the left of that token. Rule b's template would be again any whitespace before the first token of c, 'int'. This would duplicate the whitespace. I think a simple index into the token stream could track whether a token has been added to the output. So the little snippets of code for a rule reference would find all off channel tokens between the start token of the rule reference and the first real token before it or the index of the last emitted token (to prevent duplicates).

ST construction in tree grammars

Tree grammars match subtrees constructed by a parser. In order to create an output template using a tree grammar, the tree grammar must know about the token stream from which its trees were created. If you rewrite the tree, all of the token indexes will be incorrect. If a node for ID was originally created from a token at index 32, but you move it around in the tree, this pretty much preventing ANTLR from creating a valid string derived from the input. So, Automatic construction of templates only works if you have not manipulated the tree.

ANTLR tree grammar rules compute the automatic template by asking for the default template as with a parser grammar. The elements inserted into the output templates are a sequence of token objects including the whitespace object. Each subtree root has a start and stop index into the token stream, which naturally includes all of the off channel tokens in between the real tokens. The automatic templates do not include whitespace before or after the tokens associated with the nodes matched by a treat member rule.

prog : ^(PROGRAM (d+=decl)+) -> file(decls={$d}) ;
decl : ^(DECL type ID) ; // auto creates template from input tokens for decl

What about when a referenced rule returns a template? That output must be included rather than the original input associated with the subtree matched by that rule reference.

decl : ^(DECL type ID) ;
type : 'int' -> float(...) ;

The automatically create a template for decl cannot be the original input matched for that declaration. We have to build up the output template piecemeal again just like in the parser. The order of the elements will be the order as they are encountered in the tree so if you built a tree that had type as last instead of first child, the output would change. Here, the output would be whatever whitespace appeared before the first token associated with the subtree matched by type, followed by the template returned by type, followed by the whitespace in front of the ID followed by the text of the ID node. If this is not what you want, then you must specify what template to create. I am just trying to do something that will work in the common case.

The mechanism should also create templates for alternatives that do not have template specifications even when others do:

e : ^('+' e e) // auto create template
  | ^('*' e e)
  | INT -> intval(...)
  | ID  -> load(...)
  ;

A warning

Each tree grammar rule knows the text from which the associated subtree was created but only if the subtree has a single root. The following rule, because it has a single root, gives ANTLR a problem.

decls : decl+ ;

ANTLR is not currently automatically figure out the last sibling, which means that it cannot figure out the complete text automatically for an arbitrary rule; so, stick to rules with a single root node. Fortunately, that is the common case; e.g.,

decl : ^(DECL type ID) ;