Blog from April, 2008

Rewrite rules

just had some cool ideas about a semantic rule specification language...i should write that up too... (also think about multi-threaded rewriting; no side-effects required

Some language translation problems can be described with a few rewrite rules that are predicated upon their context and potentially an arbitrary Boolean expression. For example, consider a few identity transformations such as

expr: // in context of expr, match left side, transformed to right side
  "x + 0" -> "x"
  "0 + x" -> "x"

You do not have to specify these rewrite rules inside a grammar, assuming you have a parser grammar that can parse the input (and has rule expr). The use of such concrete syntax notation is very easy for humans to look at. The only problem comes in when you have perhaps 100 of these rules. Their interaction can sometimes lead to bugs that are impossible to discover. Nonetheless, a number of academic systems exist that use a series of rewrite rules that live separately from the syntax grammar (e.g., Meta-Env (ASF+SFD) and Stratego). At least from the journal papers, it seems they are very powerful. Something must be preventing the average developer from using them, however.

Sometime I think I will spend a couple of weeks and try to build one myself ANTLR-style. The first thing I realize is that concrete syntax, while beautiful, is not always powerful enough to do what we want (I suppose it could be an option). For example, what if you want to match a series of identifiers and transform them to something else? You need some kind of repetition operator inside the strings of concrete syntax, but why invent new syntax. We have grammars to specify languages. Grammars both recognize and generate language as you see in ANTLR parser grammar to tree grammar rewrite rules. So, I propose something more along the lines of grammar to grammar transformations. Elements on the left are available as streams of data to the template on the right-hand side:

var:
  type ID (',' ID)* -> (type ID ';')+

In this case, whatever was matched for the type rule would be replicated for each identifier match on the input stream. Input tokens int i,j would become output token stream int i; int j;. This token stream could then be repeatedly processed by the rules until nothing had changed, signifying the end of translation. Naturally, one could design rewrite rules that flip back and forth between two patterns causing an infinite loop. This is a well-known problem with rewrite systems. I'm satisfied to simply call that a bug in your rewrite rules, though I'm sure I could come up with something nice to let you know precisely which rules were the problem.

Rewrite rule precedence

Rewrite rules specified first would be given precedence over those specified later. There is therefore no notion of an ambiguous rewrite rule to match. The first one that matches, wins. So, be careful that you get the order right. In the following case, a cast expression will never be matched because it is mentioned after the parenthesized expression.

expr:
  '(' expr ')' -> whatever
  '(' ID ')' expr -> somethingelse // can't match

In general, you should put the longest matches first.

Generating text rather than tokens

You could also go to text instead of a token stream. In this case, the output grammar would actually be a StringTemplate (an "unparser"). For example, the following transformation would convert the token stream matched on the left hand push all of the elements into the attributes of the template on the right:

var:
  type ID (',' ID)* -> "<type> <ID; separator=\", \">"

Grammatical context

A rewrite rule is applicable when the left-hand side matches the input in the context of the specified rule, such as expr above. The context means "what rule am I matching?" Context is important, because you might want to do something different depending upon context. For example, you might want to do do something different for variable declarations matched at the class level (fields) and for variables matched in methods (locals):

[classDef var]: // context is "field"
  type ID -> blort

[... method var]: // context is "local"
  type ID -> blech

The "..." means anything on the stack of rule vacations. The entire context for the local variables then is a var rule match within a method rule.

One could also imagine wanting to combine contexts into a single translation if fields and locals are separated in the grammar:

[localVar | field]:
  type ID -> blort

In general, the stack context language is regular and rewrite rule candidates could be found for a given stack using a DFA looking from right to left on the stack. In other words, If the stack top has rule var, then the DFA would rule out any context do not end in var or "...".

Note that the rewrite matches only have to match starting at the left edge of the specified context rule. They do not have to match an exact alternative within that rule nor a complete alternative. You might only want to alter the first or second token:

classDef:
  'class' ID -> "class _<ID>" // prefix with '_'

Arbitrary semantic content

Sometimes grammatical context is not enough. For example, you might want to change an identifier reference in an expression depending on symbol table information:

expr:
  {is local or arg}? ID -> "locals.<ID>"

What if the right-hand side needs to reference something other than an element matched on the left-hand side? In this case, we need to allow actions to set attributes.

expr:
  {is local or arg}? ID -> (fname={currentFunc.name}) "<fname>_locals.<ID>"

What is currentFunc, though? Normally that is a parser class instance variable set to the currently matched function by an action in the parser grammar. For constructs that can nest such as classes in some languages, the "current class" pointer needs to be treated like a stack. ANTLR has dynamically scoped very rules for this purpose; perhaps we could do something similar.

method:
  scope { FunctionSymbol currentFunc; }
  blah blah blah

The result of a pattern match could be an action or perhaps even just a context with no pattern:

method:
  {System.out.println("heh, I just entered a method")} // always exec in method context
  type ID {currentFunc = currentScope.resolve($ID.text);} // exec this but refer to ID

We might also need to be able to specify a predicate on token references:

expr:
  ID<"null"> -> "nil" // test for null keyword as ID
  ^('*' INT<{is power of 2}> expr) -> (lg2={$INT.int}) "<expr> << <lg2>"

Rewrites for tree grammars

There is no reason we could not do rewrite patterns on trees. The left would always be a tree pattern but the right could be a tree pattern or template.

expr:
  ^('+' expr expr) -> add(blah blah blah) // rewrite with template ref
  INT -> "0" // differentiate any integer to constant "0"

To rewrite to another tree:

expr:
  ^('+' INT<"0"> expr) -> expr
  ^('*' INT<"1"> expr) -> expr

Currently ANTLR does not create templates for you automatically when you use output=template option. This is because, when I first implemented it, I had no idea what the right answer was here. I did not know how to deal with whitespace and so on. I think I have the answer now. First, let me remind you that output=AST builds a completely flat tree given no instructions to the contrary. Similarly, the template output should reproduce the input given no instructions.

Templates from parser grammars

Some cases seem obvious. What should the output template be for this rule?

d : 'int' ID ';' ;

The answer is not just concatenating the tokens because of whitespace. I tried a simple mechanism that added a little bit of code to each token and rule reference. The code snippet would copy the token object into some default template, which the user can specify by overriding a method call getDefaultTemplate(String ruleName). The problem is that the output came out as: intx; not int x; or whatever.

The answer seems to be a simple matter of inserting any off channel tokens into the output template before inserting the real token. What happens though when you invoke a rule that invokes another rule. You cannot simply add any whitespace before the starting token of a rule reference because chains of rule invocations will insert the same whitespace multiple times:

a : b ID ;
b : c ;
c : 'int' ;

The template for c would be 'int' plus any of the whitespace to the left of that token. Rule b's template would be again any whitespace before the first token of c, 'int'. This would duplicate the whitespace. I think a simple index into the token stream could track whether a token has been added to the output. So the little snippets of code for a rule reference would find all off channel tokens between the start token of the rule reference and the first real token before it or the index of the last emitted token (to prevent duplicates).

ST construction in tree grammars

Tree grammars match subtrees constructed by a parser. In order to create an output template using a tree grammar, the tree grammar must know about the token stream from which its trees were created. If you rewrite the tree, all of the token indexes will be incorrect. If a node for ID was originally created from a token at index 32, but you move it around in the tree, this pretty much preventing ANTLR from creating a valid string derived from the input. So, Automatic construction of templates only works if you have not manipulated the tree.

ANTLR tree grammar rules compute the automatic template by asking for the default template as with a parser grammar. The elements inserted into the output templates are a sequence of token objects including the whitespace object. Each subtree root has a start and stop index into the token stream, which naturally includes all of the off channel tokens in between the real tokens. The automatic templates do not include whitespace before or after the tokens associated with the nodes matched by a treat member rule.

prog : ^(PROGRAM (d+=decl)+) -> file(decls={$d}) ;
decl : ^(DECL type ID) ; // auto creates template from input tokens for decl

What about when a referenced rule returns a template? That output must be included rather than the original input associated with the subtree matched by that rule reference.

decl : ^(DECL type ID) ;
type : 'int' -> float(...) ;

The automatically create a template for decl cannot be the original input matched for that declaration. We have to build up the output template piecemeal again just like in the parser. The order of the elements will be the order as they are encountered in the tree so if you built a tree that had type as last instead of first child, the output would change. Here, the output would be whatever whitespace appeared before the first token associated with the subtree matched by type, followed by the template returned by type, followed by the whitespace in front of the ID followed by the text of the ID node. If this is not what you want, then you must specify what template to create. I am just trying to do something that will work in the common case.

The mechanism should also create templates for alternatives that do not have template specifications even when others do:

e : ^('+' e e) // auto create template
  | ^('*' e e)
  | INT -> intval(...)
  | ID  -> load(...)
  ;

A warning

Each tree grammar rule knows the text from which the associated subtree was created but only if the subtree has a single root. The following rule, because it has a single root, gives ANTLR a problem.

decls : decl+ ;

ANTLR is not currently automatically figure out the last sibling, which means that it cannot figure out the complete text automatically for an arbitrary rule; so, stick to rules with a single root node. Fortunately, that is the common case; e.g.,

decl : ^(DECL type ID) ;

Ok, Kay Roepke is in town and we've been discussing the faster expression parsing, among other things. Look for another entry on default StringTemplate generation or a parsing and tree parsing.

Found a ref to Keith Clarke's original recursive-descent precedence work

More on precedence climbing

ANTLR v3.2 will allow special rules for specifying expressions that are particularly efficient both in speed and space. Special rules will define either unary suffix operators, binary operators, or trinary operators by virtue of how they recurse. Precedence of operators is specified by the order in which the alternatives appear in the special rule. Left associativity is assumed. Right associativity is done with a token level option, TOKEN<associativity=right>. The alternatives that are not specifying operators or that specify unary operators are grouped into a new rule. A special rule is rewritten into one that does not have left recursion. Semantic predicates are used to recurse more or less depending on the precedence of the next operator (first symbol of lookahead). Here is the special rule:

e : parse_expr[0] ; // rewrite e to this rule, which invokes the precedence engine

/** "Precedence climbing" engine
parse_expr[int p]
   :   primary
       (   {prec[input.LA(1)]>=p}?=>     bop r= parse_expr[nextp(p)]
       |   {postprec[input.LA(1)]>=p}?=> suffix
       )*
   ;

where rule suffix is described below (basically we remove e references from the left edge). If there are no suffix operators, the 2nd alt of the (...)* is not generated.

If you specify a rule with option parser=precedence, ANTLR does the new magical thing:

e
options {parser=precedence;}
    :   primary  // highest precedence
    |   e '++'
    |   e '.' ID // higher than array/method call
    |   suffix
    |   ('++'|'--') e
    |   '(' type ')' e // cast
    |   e '*' e
    |   e ('+'|'-') e
    |   e '='<associativity=right> e  // lowest precedence
    ;

suffix
    :   e '[' e ']'
    |   e '(' e (',' e)* ')'
    ;

primary
    :   '(' e ')'
    |   INT
    |   ID
    ;

Within the special rule, ANTLR looks for binary operators and the suffix operators. It also looks one level deep into rules referenced from the special rule. In this case, ANTLR identifies '*' '+' '-' '=' as binary operators and '.' '[' '(' as suffix operators. All other alternatives are lumped together into a new rule:

parse_expr_alts
options {backtrack=true;}
    :   primary
    |   ('++'|'--') e
    |   '(' type ')' e // cast
    ;

The backtrack option is specified in case it is needed, which is in this case because of the cast versus parenthesized expression in primary. Ooops. Just realized that the order of the alternatives is wrong. The primary will match the parenthesized expression before trying to cast. we need a way to figure this out.

The suffix rule has left recursive references to e sell those must be removed. The parse_expr rule invokes a suffix after having matched e already:

suffix
    :  '[' e ']'
    |  '(' e (',' e)* ')'
    ;

Affect on lookahead

ANTLR can only compute LL(1) lookahead sets for these special rules. So, rules that need to look into expressions might have to backtrack:

s : e ';' // nondeterministic; must use option backtrack=true!!!
  | e '!'
  ;

Tree construction and actions

ANTLR can allow actions at the end of binary operator alternatives as well as anywhere else in the other rules. The binary operators are all combined into the parse_expr rule so ANTLR must collect all actions on the right-hand side and make a "switch on the operator" action within the parse_expr loop:

parse_expr[int p]
   :   primary
       (   {prec[input.LA(1)]>=p}?=>     bop r= parse_expr[nextp(p)]
           {
           switch ( $bop.start.getType() ) {
               case '+' :
               case '-' :
                  action-for-this-alt
                  break;
               ...
               case '=' :
                  action-for-this-alt
                  break;
               default : // no action
           }
           }
       |   {postprec[input.LA(1)]>=p}?=> suffix
       )*
   ;

Trees for the binary operators could also be automatically done using the standard ANTLR mechanisms.

parse_expr[int p]
   :   (primary->primary)
       (   {prec[input.LA(1)]>=p}?=>     bop r=parse_expr[nextp(p)] -> ^(bop $parse_expr $r)
       |   {postprec[input.LA(1)]>=p}?=> suffix[$e.tree]            -> {$suffix.tree}
       )*
   ;

The other alternatives and rules create trees as usual:

    |   primary
    |   ('++'^|'--'^) e
    |   '(' e ')' -> e

Rule suffix is left alone except for the removal of the left recursion so you can add whatever tree construction you want but you must allow a $lhs parameter to come in. This parameter is the primary matched immediately to the left.

suffix[CommonTree lhs]
   :   t='[' e ']'             -> ^(INDEX[$t] {$lhs} e)
   |   t='(' (e (',' e)*)? ')' -> ^(CALL[$t] {$lhs} e*)
   |   t='++'                  -> ^(POSTINC[$t] {$lhs})
   ;

Implementation

Have ANTLR build parse_expr rules as templates and then call ANTLRParser.rule() on that text. It will be like an include. This is the mechanism I use for nextToken generation.

It's possible that I could make the precedence tables and the parse_expr rule itself members of the Parser superclass.