Token stream rewriting with rewrite rules

Let's talk tree and token stream rewriting. Starting with simple token rewriting. I've thought about this and I think we can fold rewriting patterns into grammar composition and token stream stuff. Here's a simple rule that just replaces a token:

decl : 'int' ID ';' -> 'var' ID ';' ;

Presumably a variant of the TokenRewriteStream can insert tokens not just text. This would just replace the start..stop with the right hand side. If 'var' is a string, it would have to be tokenized. Hmm...only reason to go back to tokens is if you are going to do multiple parses. If going to text, just use output=template.

Ok, if going token to token, should really be able to use same lexer as we're going language L to L.

term	:	ID		-> INT["1"]
	|	ID '^' INT	-> INT ID '^' INT[ival($INT.text)-1]
	|	INT ID		-> INT
	|	c=INT ID '^' e=INT
	        -> INT[ival($c.text)*ival($e.text)] ID '^' INT[ival($e.text)-1]
	;

where ival(token) returns the integer value of the token.

Syntax-directed token rewrites

That is not as obvious as an exemplar-based rewrite where i,j,k,x,y,z,v are IDs and c,n,m are ints and s,t,u are strings (Idea I took from Andy Tripp):

term:
	"<x>" -> "1"
	"<x>^<n>" -> "<n><x>^" {ival($n.text)-1}
	"<c><x>^<n>" -> {ival($c.text)*ival($n.text)} "<x>^" {ival($n.text)-1}

The right hand side would be tokenized statically (given a pre-existing lexer) except the actions, which must be tokenized after evaluation:

term:
	"<x>" -> INT["1"]
	"<x>^<n>" -> $n $x '^' {ival($n.text)-1]}
	"<c><x>^<n>" -> {ival($c.text)*ival($n.text)} "<x>^" {ival($n.text)-1}

Andy has rules like:

v + x --> v.add(x)

But, for speed and sake of context, I like specifying the rule that matches the input. I also like delimiters as I might want to match white space at some point:

expr:
	"<x> + <n>" -> "<x>.add(<n>)"

We need the <...> because we can't correctly/generally identify x and n without them. Need to identify the elements on the left and tokenize the right if we are going to token stream. First break up into rule elements and literals:

x=ID '+' n=INT -> $x '.add(' $n ')'

then tokenize:

x=ID '+' n=INT -> $x '.' ID["add"] '(' $n ')'

where we assume that the ".add()" notation is also in the language we are translating.

If going to text we can leave as a template. In that case, the rewrite string is converted to have x, n as template attribute refs <x> and <n>:

x=ID '+' n=INT -> template(x={$x}, n={$n}) "<x>.add(<n>)".

StringTemplate thought. And, if we can do that, why can't we use that for output=template? The elements in <...> refer to labels. Can't do <ID> and such though can we? Sure. why not

ID '+' INT -> "<ID>.add(<INT>)"

What about property refs? Sure. That string is really:

ID '+' INT -> template(ID={$ID}, INT={$INT}) "<ID>.add(<INT>)"

where ID and INT can be template attributes like any other (just reusing the names).

Uh oh. We want to keep the same tokens if possible rather than retokenizing I'd say. So can't convert to string and then retokenize. Need to keep tokens in tact as somebody might annotate tokens with info during translation. We'll accept a string with holes but no ST extensions unless you jump into a {...} action and use %{...}.

Repeated elements

Take TXL's example of dot product: (1 2 3).(3 2 1).

// grammar rules
dot : vector '.' vector ;
vector : '(' INT+ ')' ;

// rewrite rules
dot:
'(' n+=ID+ ").(" m+=INT+ ')' -> {dotProduct($n, $m)}

That rewrite would get translated (double quote string tokenized) to

'(' n+=ID+ ')' '.' '(' m+=INT+ ')' -> {dotProduct($n, $m)}

where the
{...} action would return a string that is tokenized at run-time. The $n and $m are List objects per usual ANTLR interpretation.

Ok, now, let's turn to rule references.

Rule references

stat:
	"if ( true ) <s1=stat> else <s2=stat>" -> "<s1>"
	"if ( true ) <stat>" -> "<stat>"

Also,

expr:
	"<multExpr> + 0" -> "<multExpr>" // or $multExpr
	"0 + <multExpr>" -> "<multExpr>"

The 0 will be tokenized as INT. That has to be converted to a predicated element. See next section.

More Andy rewrites:

strcasecmp(v1, v2) != 0 --> !v1.equalsIgnoreCase(v2)
strcasecmp(v1, v2) == 0 --> v1.equalsIgnoreCase(v2)
strcasecmp(v1, v2) --> v1.compareToIgnoreCase(v2)

Seems like I need to use <expr> not v1, v2:

"strcasecmp(<v1=expr>, <v2=expr>) != 0" --> "!<v1>.equalsIgnoreCase(<v2>)"
"strcasecmp(<v1=expr>, <v2=expr>) == 0" --> "<v1>.equalsIgnoreCase(<v2>)"
"strcasecmp(<v1=expr>, <v2=expr>)" --> "<v1>.equalsIgnoreCase(<v2>)"
 function:
	"void main(int argc, char *argv[]) {" stat+ '}' ->
<<
public class MyClass {
    public static void main(String[] args) {
	<stat+>
    }
}
>>

Predicates

What about conditions patterns must satisfy? Should it be just a predicate on the front or preds on the individual elements? Need to constrain tokens:

"x + 0" -> "<x>"

becomes

x=ID '+' i=INT {$i.text.equals("0")}? -> $x

Have to allow predicates also on left:

atom:
	{inMethod}? ID -> "locals[" {$ID.localIndex} "]"
		    ID -> "global_<ID>"

Input traversal and multiple passes

How many times are the rules applied and in what order? First, the rewrites are only applied in the grammatical context specified by the rule. Then, the rules are ordered by precedence just like autobacktracking mode.

After each replacement, I assume we'd continue parsing after the rewritten input construct and look for the next match. Would we then just rewind the stream and do another pass until nothing changes? Perhaps, but it's pretty inefficient to pass over the entire input again just to look for a few patterns on expr. I think TXL will repeatedly apply the expr rewrite rules for a single expr until nothing changes and then continue. Maybe we could do this as an option? Or, for efficiency, we could map all locations in the input that start an expr and then do like an increment parse, jumping from expr to expr. The problem is that we'd lose context; i.e., which rule invoked that expr.

Another way would be to make the parse tree during a first pass and then alter the parse tree during each pass. We could then jump straight to the expr subtree root nodes and look upwards for the context.

Actually, repeatedly applying the expr rule set until nothing changes ala TXL is probably the right idea. No other rewrite rules can match the expr input so only a single overall pass is necessary. So there is local iteration, but each rule set represents a single pass. You could define multiple rule sets to define multiple passes; e.g., remove identify operations then define implicit scalars.

Avoid parse tree construction to save time/space.

When we have input f(f(x)) and we are translating f() to g(), when does the arg list get translated?

primary:
	"f(<expr>)" -> "g(<expr>)"

First we apply the rewrite to the outside and get g(f(x)). Then we try the rewrites again for primary. We can't jump over the args. OH! It's invoking rule expr so it will get back down to primary and do it recursively. Marvelous. This way it kind of manually says which rewrites to go do.

Pattern matching implementation

I think we can just list the patterns first in the indicated rule and turn on autobacktracking.

expr
options {backtrack=true;}
	:	x=ID '+' y=INT {is "0"}? -> $x
	|	y=INT {is "0"}? '+' x=ID -> $x
	|	normal-expr-alternatives
	;

How do we make it spin around expr though until it doesn't change? Perhaps a method that invokes the rule:

while ( changed ) {
	int m = input.mark();
	expr();
	input.rewind(m);
}

But, every reference to a rule with rewrites would have to call a meta-function like this instead of the rule directly. Hmm...does that mean the grammar must be regenerated every time I add a rewrite to a new rule? Is this related to grammar composition where I pull in components of a previous grammar?

Hidden channels

How do we handle comments in between tokens we rewrite? Ric Klaren had the idea to match off-channel tokens on the left so we can reference on the right.

// move comments in front of assignments to end
assign:
	.COMMENT ID '=' expr ';' -> ID '=' expr ';' COMMENT

where the dot in front means "hidden" like in UNIX filenames.