Squirrel away the trees, call on the visitors

After a few weeks away from ANTLR v4 coding, I'm back to thinking about tree grammars and the automated generation of tree visitors. I recently replaced a number of tree grammars in ANTLR v4 itself with much simpler visitor implementations. Doesn't require a separate specification and is much easier to debug. I made an ubervisitor that actually matches patterns in the tree rather than nodes (using a single prototype tree grammar) and then calls listener functions. The ubervisitor has the combined pattern matching functionality necessary for the various listeners.

A tree grammar is nothing more than a formal specification of the tree structure created by parser grammar. From this tree grammar, I generate an external visitor (See Language implementation patterns). People always ask if there is an automated way to create tree grammar from the tree construction instructions sitting in the parser grammar. It's a pretty hard task given the complexity of the rewrite instructions and so on. I was thinking that I could do something that observed the trees you created and combined it with knowledge of the grammar to produce a tree grammar. Sounded pretty complicated. But, that got me thinking about using the structure implied by the grammar to automatically produce high-level visitors (i.e., visitors that don't simply tell you that they found an ID node, which is pretty worthless). What we want is to match phrases per what you had in the grammar. Oliver Zeigermann and I have been tossing some ideas around, which also led to my thoughts on this blog page.

Newbie mode

What if we had a newbie mode that literally did not expose an internal representation (tree or otherwise). Let's face it. most of the time all we want to do is parse something and then visit the elements multiple times. We'd reparse all of the time if it weren't so slow and it didn't sound so unsophisticated. Also, trees or some other internal data structure are really great for keeping little pieces of information. We like heterogeneous tree nodes because we can have different information per node. To replicate that in a visitor, we could simply deconstruct the nodes or whatever the internal representation is, and simply send that information back to the visitor as parameters of a function. No need to have any of this exposed to the user. Here is a simple expression grammar thingie:

grammar Ollie;

stat : lhs=ID '=' e=expr ';' ;

expr : left=mexpr (op='+' right=mexpr)* ;

mexpr : left=atom (op='*' right=atom)* ;

atom : INT
     | ID
     | array=ID '[' index=expr ']'
     ;

We don't need to build a tree per se. What we need is to be able to visit the phrases and tokens multiple times, and efficiently. We also don't want to have to build a tree grammar, which requires tree construction by the user, and we don't want to have to build a visitor by hand.

So, perhaps we automatically generate visitor methods that mirror the grammar. This has the advantage that were not using the generic visitor that visits node by node. Here's what we could easily generate automatically from the above grammar:

public class OllieListener {
    Object stat(Token lhs, Object e) {
    }

    Object expr(Object left, Token op, Object right) {
    }

    Object mexpr(Object left, Token op, Object right) {
    }

    Object atom1(Token t) { // INT
    }

    Object atom2(Token t) { // ID
    }

    Object atom3(Token array, Object index) { // array index
    }
}

The return values get passed as parameters to other visitor methods. For example, rule stat calls rule expr and saves the return value in label e. We pass that e value to the stat() visitor method. You can return nothing or a payload of interest, depending on the application of the visitation task. We could even let these functions set data "local" to each particular phrase matched by the parser. Subclasses of OllieListener could maintain state by using instance variables. If we want to persist the parameters and return values, this could easily be added to the functionality of the internal representation.

The point is that we don't have to specify tree structure. We don't have to build a tree grammar that calls our listener functions (or has actions directly inside). We don't need to create heterogeneous tree nodes; the parameters of the various visitor methods represent the elements of tree. Actually, as I say, we don't actually care what's on the inside. could be a list of tokens that we re-parse continuously.

The other thing to note here is that I don't call them visitStat, etc. these are actually listener methods triggered by a visitor. We don't want the user to have to make the right calls inside these methods to make sure that we walk the entire intermediate representation.

From looking at the work of my friends, Jurgen Vinju (ASF+SDF) and Eelco Visser (Stratego/XT), there are a couple of very interesting lessons:

  • you don't always have to care what the internal structure of the input is
  • when doing transformations, using concrete syntax like "x+x -> 2*x" is useful; probably implies a parse tree as the internal structure
  • it's a great idea to separate how we walk the input from what we do while discovering nodes or phrases.

The negatives

I can imagine this being less efficient because I would likely build a parse tree instead of an AST. Parse trees have internal nodes that represent rule applications/invocations, but for many applications we care more about the relationship between the tokens than we do about which rules were used to match them. So, these internal nodes are a waste. What we really want to do is to create operator-operand trees such as (+ 3 4) where + is at the root and has 2 children, 3 and 4. Power users will still want the ability to create their own ASTs and tree grammars / tree filters for maximum efficiency and control.

The other problem is that every time you change the grammar, the visitor would change. Like automatically generated GUIs, we have to worry about code that users supplied that no longer apply or should go into a different method. So, if you rearrange alternatives one and two of atom above, all of a sudden the code would break since atom1 and atom2 would be flipped. Worse, it might sort of work. I guess one way to mitigate this is to make sure that you get a decent grammar 1st and then generate the visitor. Minor tweaks or reordering from that point shouldn't be a big deal. I can't think of a good automated way to reduce the sensitivity of the visitor to the original grammar since we are relying on the grammar for structural information. If we left special markers in comments, we could do like the GUI designer tools do and try to shuffle programmer supplied code around to the right spot.

For ANTLR v4, I had a single tree grammar that walked the nice orderly tree created from my parser grammar. That isolated all of my visitors from changes to the tree and changes from the original parser grammar. I don't see a way to isolate visitor code from grammars automatically.

If we want to do transformations, however, we can use concrete syntax which hides not only the internal structure of the tree but does not have any visitors that programmers can modify. All the functionality is extracted from the side of "this goes to that" transformations. Of course, this only works when you're doing translation. If you need to trigger calls from the configuration file into your massive server, there's no escaping the fact that we need to execute code snippets. Where to specify and how to specify those snippets is the key question.

Pattern-based visitors

Maybe we should borrow a page from the term rewriting guys and use patterns instead of special method names. For example, we could specify contexts (rules) and then patterns followed by actions to execute if those are seen:

stat:
    <lhs=ID> = <e=expr>;    { operate on $lhs and $e }

expr:
    <INT>           { }
    1               { }  // match INT with value 1
    <ID>            { }
    <ID>[<expr>]    { } 

This would work very well for operating on pieces of the input, such as all variable declarations. Of course, requires an extra step to convert this to Java code and without a plug-in an IDE must treat this as a text file.

One of the negatives I can see here is that there might be a lot of cut-and-paste from the grammar into these patterns. The pattern for a method declaration can be rather large. In contrast, a visitor would simply say visitMethodDecl() or whatever. Obviously, one could simply put a call to visitMethodDecl() inside the action, isolating the code from the patterns.

Contrast this pattern-based approach, with a tree filter grammar that triggers actions. I would have to specify how to build trees in the parser grammar but could get away with specifying only the patterns of interest in the tree filter like this:

tree grammar Blort;
options { filter=true; }

topdown : expr ; // this rule on the way down the tree

expr: INT                    { }
    | INT {$INT.int==1}?     { }
    | ID                     { }
    | ^(ARRAY_INDEX ID expr) { }
    ;

The pattern-based approach seems easier and, because it uses concrete syntax, could be easier to use.

Java annotations approach

I wonder if there's a way to use annotations to specify patterns of interest to avoid having to run ANTLR or some other tool on those patterns or rules to get the visitor. (This won't help the other languages but Java is probably the largest market for ANTLR, so I will investigate this option).

public class OllieListener {
    @Visit("stat")
    Object doThisUponMatchingStatPattern(Token lhs, Object e) {
    }

    @Visit("expr")
    Object add(Object left, Token op, Object right) {
    }
    
    @Visit("mexpr")
    Object mult(Object left, Token op, Object right) {
    }

    @Visit("expr","INT")
    Object visitINT(Token t) { // INT
    }

    @Visit("expr","ID")
    Object visitID(Token t) { // ID
    }

    @Visit("expr","ID '[' expr ']'")
    Object visitArrayIndex(Token array, Object index) { // array index
    }
}

From the collection of annotations, and optional patterns, we could run a static annotation analyzer to derive a visitor that would trigger these patterns. We could also do it at runtime so that there was no tool to run whatsoever to make this work. Running the static analyzer would simply make it go faster.

I can see that the names of the parameters could be an issue because changes to the labels in the grammar would force changes in the method signatures. I guess at least we would know about it because the annotations.

Mixed-mode rules and listener triggers

Finally, it's probably worth exploring a simpler and more direct approach: adding triggers to the original grammar but having ANTLR actually invoke them during its traversal of an internal data structure.

grammar Ollie;
option { output=visitor; } // inventing an option here

stat : lhs=ID '=' e=expr ';' -> visitStat;

expr : left=mexpr (op='+' right=mexpr -> visitAdd)*  ;

mexpr : left=atom (op='*' right=atom -> visitMult)* ;

atom : INT                         -> visitInt
     | ID                          -> visitID
     | array=ID '[' index=expr ']' -> visitArrayIndex
     ;

Here, I'm using the -> rewrite operator to indicate the name of a function to trigger. The parameters to the function are the labels if any on the elements to the left of the -> operator. To be more general, we can allow arbitrary actions to the right of the rewrite that would call the listener function after performing some operations and passing custom values to the listener.

ANTLR would generate a visitor as as well as the parser. The parser would have embedded actions to build a parse tree. The programmer could request multiple visits over that tree without having to re-parse the input. It only looks like we are triggering actions during the parser.

I'm not sure I like overloading/abusing the -> rewrite notation, but I guess it's okay.

This approach is nice, because you might have noticed that we don't call visitAdd unless we see an addition operator. Some of the previous versions call visitExpr even when all it does is match INT down in atom.

The user doesn't have to figure out a tree to build, ANTLR would simply build a parse tree. Because we are specifying the names of functions to call, the grammar and resulting visitor functions should be less tightly coupled than the visitor methods atom1, atom2, atom3 for the 3 alternatives of rule atom, for example.