tree pattern matching grammars

Filter mode for Lexers

So, filter mode for Lexers is incredibly useful. For example, I use it for my wiki to HTML translation. you just specify some rules in the lexer with filter=true and it tries all of the rules against the input stream. Precedence is given to rules specified first. In other words, the lexer tries to match the first rule against the current input location. If it fails, it moves to the next rule in tries it. It tries rules until it finds one that matches or it fails. Upon failure, it advances the character pointer by 1 and repeats. It is like a sliding window across the string of characters. Works great. Here is something from the fuzzy Java parser in the examples-v3 dir:

lexer grammar FuzzyJava;
options {filter=true;}

IMPORT
        :       'import' WS name=QIDStar WS? ';'
        ;

RETURN
        :       'return' (options {greedy=false;}:.)* ';'
        ;

fragment
QID :   ID ('.' ID)*
        ;
...

Note the use of the fragment rule. As usual, this says that it is a help or rule and the lexer should not interpret it as a pattern to match.

Tree matching would work the same way.

Tree pattern matching

One of the problems with a full tree grammar, as Andy Tripp will point out with pleasure, is that you have to specify the full tree structure just to execute a single action for a single pattern. Let's say all you want to do is track whether or not they multiply is found. You would be specifying an entire grammar when all you want to do is say

tree grammar Foo;
options {filter=true;}
// match ANY multiply
mult : ^('*' . .) {System.out.println("found *");} ;

The filter mode would automatically do a depth first walk of the tree, looking for pattern matches on the way back up. This complete specification here would print "found *" for every multiply found anywhere in the tree.

We would also need fragment rules in this case. For example, we might want to specify what the subtrees look like that we are interested in (note they may not be the full syntax--they are just the ones we are interested in):

tree grammar Foo;
options {filter=true;}
// match multiply with either ID or INT as operands
mult : ^('*' e e) {System.out.println("found *");} ;

fragment
e : ID | INT ;

Like the lexer filter mode, the tree pattern matching filter mode would try the rules in order (except for the fragment rules). Is nothing matches, it would simply move on to the next node in the tree.

Context

We often need to match tree patterns only in a certain context. For example, the following rule has two alternatives predicated on the context. The first alternative only matches a variable definition immediately under a CLASS node (i.e. as a field). The second alternative only matches a variable definition if it's immediate parent is a BLOCK node.

vars : {inContext("CLASS")}? ^(VARDEF ID)  // fields
     | {inContext("BLOCK")}? ^(VARDEF ID)  // locals
     ;

Tree rewriting

With output=AST and rewrite=true, we could easily do tree rewriting as well. For example:

tree grammar Foo;
options {filter=true; output=AST; rewrite=true;}
simplify
    :    ^('+' a=. INT {$INT.int==0}?) -> $a  // x+0 -> x
    |    ^('+' INT {$INT.int==0}? b=.) -> $b  // 0+x -> x
    ;

flip:    ^('+' a=. b=.) -> ^('*' $a $b) ; // x+y -> x*y

Pattern matching order

Andy Tripp brought up a good point. How do we deal with simplifications to operator sequences such as "A + B * 0". If we have two rules that simplify expressions:

x+0 -> x  // not valid grammar; just showing the rules in English
x*0 -> 0

We can't execute the rewrites in the order the operators appear in the input sequence: + before *. x+0 does not match. We'd end up with A+0 not A. It turns out that trees just do the right thing. The AST already encodes precedence, which ensures that the multiply subtree is lower than the addition subtree. The A+B*0 tree is:

The pattern matching engine checks B*0 subtree first, replacing it with 0 and then tries the addition simplification.

Implementation

Tree grammars with filter mode we generate an extra method that created a tree Walker and invoked a special patterns rule (which would be the collection of all nonfragment rules in the grammar).

public static void match(CommonTree t) {
    System.out.println("match "+t.toStringTree());
    CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
    MyTreeWalker p = new MyTreeWalker(nodes);
    try {
        MyTreeWalker.patterns_return r=p.patterns();
        if ( r.tree!=null ) System.out.println("result="+((CommonTree)r.tree).toStringTree());
    }   
    catch (RecognitionException re) { ; } // ignore errors
}

We'd also need to override match() so that it did not try to recover.

public Object match(IntStream input, int ttype, BitSet follow)
    throws RecognitionException
{   
    state.backtracking=1;
    Object r = super.match(input, ttype, follow);
    state.backtracking=0;
    return r;
}

We also need the depth first walk method:

public static void walk(CommonTree t) {
    if ( t.getChildCount()>0 ) { 
        List children = t.getChildren();
        for (int i=0; i<children.size(); i++) {
            CommonTree child = (CommonTree)children.get(i);
            walk(child);
        }   
    }   
    MyTreeWalker.match(t);
}

My current prototype implementation is not particularly efficient, but is a suitable first start. My eventual implementation what it minimum need to turn on backtracking to avoid throwing exceptions. It would also need to avoid creating a new CommonTreeNodeStream each time (that is very expensive I think because it walks the entire subtree to create a stream). Perhaps O(n^2) matching cost.