Blog from August, 2011

After a few weeks away from ANTLR v4 coding, I'm back to thinking about tree grammars and the automated generation of tree visitors. I recently replaced a number of tree grammars in ANTLR v4 itself with much simpler visitor implementations. Doesn't require a separate specification and is much easier to debug. I made an ubervisitor that actually matches patterns in the tree rather than nodes (using a single prototype tree grammar) and then calls listener functions. The ubervisitor has the combined pattern matching functionality necessary for the various listeners.

A tree grammar is nothing more than a formal specification of the tree structure created by parser grammar. From this tree grammar, I generate an external visitor (See Language implementation patterns). People always ask if there is an automated way to create tree grammar from the tree construction instructions sitting in the parser grammar. It's a pretty hard task given the complexity of the rewrite instructions and so on. I was thinking that I could do something that observed the trees you created and combined it with knowledge of the grammar to produce a tree grammar. Sounded pretty complicated. But, that got me thinking about using the structure implied by the grammar to automatically produce high-level visitors (i.e., visitors that don't simply tell you that they found an ID node, which is pretty worthless). What we want is to match phrases per what you had in the grammar. Oliver Zeigermann and I have been tossing some ideas around, which also led to my thoughts on this blog page.

Newbie mode

What if we had a newbie mode that literally did not expose an internal representation (tree or otherwise). Let's face it. most of the time all we want to do is parse something and then visit the elements multiple times. We'd reparse all of the time if it weren't so slow and it didn't sound so unsophisticated. Also, trees or some other internal data structure are really great for keeping little pieces of information. We like heterogeneous tree nodes because we can have different information per node. To replicate that in a visitor, we could simply deconstruct the nodes or whatever the internal representation is, and simply send that information back to the visitor as parameters of a function. No need to have any of this exposed to the user. Here is a simple expression grammar thingie:

grammar Ollie;

stat : lhs=ID '=' e=expr ';' ;

expr : left=mexpr (op='+' right=mexpr)* ;

mexpr : left=atom (op='*' right=atom)* ;

atom : INT
     | ID
     | array=ID '[' index=expr ']'
     ;

We don't need to build a tree per se. What we need is to be able to visit the phrases and tokens multiple times, and efficiently. We also don't want to have to build a tree grammar, which requires tree construction by the user, and we don't want to have to build a visitor by hand.

So, perhaps we automatically generate visitor methods that mirror the grammar. This has the advantage that were not using the generic visitor that visits node by node. Here's what we could easily generate automatically from the above grammar:

public class OllieListener {
    Object stat(Token lhs, Object e) {
    }

    Object expr(Object left, Token op, Object right) {
    }

    Object mexpr(Object left, Token op, Object right) {
    }

    Object atom1(Token t) { // INT
    }

    Object atom2(Token t) { // ID
    }

    Object atom3(Token array, Object index) { // array index
    }
}

The return values get passed as parameters to other visitor methods. For example, rule stat calls rule expr and saves the return value in label e. We pass that e value to the stat() visitor method. You can return nothing or a payload of interest, depending on the application of the visitation task. We could even let these functions set data "local" to each particular phrase matched by the parser. Subclasses of OllieListener could maintain state by using instance variables. If we want to persist the parameters and return values, this could easily be added to the functionality of the internal representation.

The point is that we don't have to specify tree structure. We don't have to build a tree grammar that calls our listener functions (or has actions directly inside). We don't need to create heterogeneous tree nodes; the parameters of the various visitor methods represent the elements of tree. Actually, as I say, we don't actually care what's on the inside. could be a list of tokens that we re-parse continuously.

The other thing to note here is that I don't call them visitStat, etc. these are actually listener methods triggered by a visitor. We don't want the user to have to make the right calls inside these methods to make sure that we walk the entire intermediate representation.

From looking at the work of my friends, Jurgen Vinju (ASF+SDF) and Eelco Visser (Stratego/XT), there are a couple of very interesting lessons:

  • you don't always have to care what the internal structure of the input is
  • when doing transformations, using concrete syntax like "x+x -> 2*x" is useful; probably implies a parse tree as the internal structure
  • it's a great idea to separate how we walk the input from what we do while discovering nodes or phrases.

The negatives

I can imagine this being less efficient because I would likely build a parse tree instead of an AST. Parse trees have internal nodes that represent rule applications/invocations, but for many applications we care more about the relationship between the tokens than we do about which rules were used to match them. So, these internal nodes are a waste. What we really want to do is to create operator-operand trees such as (+ 3 4) where + is at the root and has 2 children, 3 and 4. Power users will still want the ability to create their own ASTs and tree grammars / tree filters for maximum efficiency and control.

The other problem is that every time you change the grammar, the visitor would change. Like automatically generated GUIs, we have to worry about code that users supplied that no longer apply or should go into a different method. So, if you rearrange alternatives one and two of atom above, all of a sudden the code would break since atom1 and atom2 would be flipped. Worse, it might sort of work. I guess one way to mitigate this is to make sure that you get a decent grammar 1st and then generate the visitor. Minor tweaks or reordering from that point shouldn't be a big deal. I can't think of a good automated way to reduce the sensitivity of the visitor to the original grammar since we are relying on the grammar for structural information. If we left special markers in comments, we could do like the GUI designer tools do and try to shuffle programmer supplied code around to the right spot.

For ANTLR v4, I had a single tree grammar that walked the nice orderly tree created from my parser grammar. That isolated all of my visitors from changes to the tree and changes from the original parser grammar. I don't see a way to isolate visitor code from grammars automatically.

If we want to do transformations, however, we can use concrete syntax which hides not only the internal structure of the tree but does not have any visitors that programmers can modify. All the functionality is extracted from the side of "this goes to that" transformations. Of course, this only works when you're doing translation. If you need to trigger calls from the configuration file into your massive server, there's no escaping the fact that we need to execute code snippets. Where to specify and how to specify those snippets is the key question.

Pattern-based visitors

Maybe we should borrow a page from the term rewriting guys and use patterns instead of special method names. For example, we could specify contexts (rules) and then patterns followed by actions to execute if those are seen:

stat:
    <lhs=ID> = <e=expr>;    { operate on $lhs and $e }

expr:
    <INT>           { }
    1               { }  // match INT with value 1
    <ID>            { }
    <ID>[<expr>]    { } 

This would work very well for operating on pieces of the input, such as all variable declarations. Of course, requires an extra step to convert this to Java code and without a plug-in an IDE must treat this as a text file.

One of the negatives I can see here is that there might be a lot of cut-and-paste from the grammar into these patterns. The pattern for a method declaration can be rather large. In contrast, a visitor would simply say visitMethodDecl() or whatever. Obviously, one could simply put a call to visitMethodDecl() inside the action, isolating the code from the patterns.

Contrast this pattern-based approach, with a tree filter grammar that triggers actions. I would have to specify how to build trees in the parser grammar but could get away with specifying only the patterns of interest in the tree filter like this:

tree grammar Blort;
options { filter=true; }

topdown : expr ; // this rule on the way down the tree

expr: INT                    { }
    | INT {$INT.int==1}?     { }
    | ID                     { }
    | ^(ARRAY_INDEX ID expr) { }
    ;

The pattern-based approach seems easier and, because it uses concrete syntax, could be easier to use.

Java annotations approach

I wonder if there's a way to use annotations to specify patterns of interest to avoid having to run ANTLR or some other tool on those patterns or rules to get the visitor. (This won't help the other languages but Java is probably the largest market for ANTLR, so I will investigate this option).

public class OllieListener {
    @Visit("stat")
    Object doThisUponMatchingStatPattern(Token lhs, Object e) {
    }

    @Visit("expr")
    Object add(Object left, Token op, Object right) {
    }
    
    @Visit("mexpr")
    Object mult(Object left, Token op, Object right) {
    }

    @Visit("expr","INT")
    Object visitINT(Token t) { // INT
    }

    @Visit("expr","ID")
    Object visitID(Token t) { // ID
    }

    @Visit("expr","ID '[' expr ']'")
    Object visitArrayIndex(Token array, Object index) { // array index
    }
}

From the collection of annotations, and optional patterns, we could run a static annotation analyzer to derive a visitor that would trigger these patterns. We could also do it at runtime so that there was no tool to run whatsoever to make this work. Running the static analyzer would simply make it go faster.

I can see that the names of the parameters could be an issue because changes to the labels in the grammar would force changes in the method signatures. I guess at least we would know about it because the annotations.

Mixed-mode rules and listener triggers

Finally, it's probably worth exploring a simpler and more direct approach: adding triggers to the original grammar but having ANTLR actually invoke them during its traversal of an internal data structure.

grammar Ollie;
option { output=visitor; } // inventing an option here

stat : lhs=ID '=' e=expr ';' -> visitStat;

expr : left=mexpr (op='+' right=mexpr -> visitAdd)*  ;

mexpr : left=atom (op='*' right=atom -> visitMult)* ;

atom : INT                         -> visitInt
     | ID                          -> visitID
     | array=ID '[' index=expr ']' -> visitArrayIndex
     ;

Here, I'm using the -> rewrite operator to indicate the name of a function to trigger. The parameters to the function are the labels if any on the elements to the left of the -> operator. To be more general, we can allow arbitrary actions to the right of the rewrite that would call the listener function after performing some operations and passing custom values to the listener.

ANTLR would generate a visitor as as well as the parser. The parser would have embedded actions to build a parse tree. The programmer could request multiple visits over that tree without having to re-parse the input. It only looks like we are triggering actions during the parser.

I'm not sure I like overloading/abusing the -> rewrite notation, but I guess it's okay.

This approach is nice, because you might have noticed that we don't call visitAdd unless we see an addition operator. Some of the previous versions call visitExpr even when all it does is match INT down in atom.

The user doesn't have to figure out a tree to build, ANTLR would simply build a parse tree. Because we are specifying the names of functions to call, the grammar and resulting visitor functions should be less tightly coupled than the visitor methods atom1, atom2, atom3 for the 3 alternatives of rule atom, for example.

Introduction

I'm abandoning this post mid-stream...seems that regular alternatives can match erroneous input just as easily as so-called error alternatives. Because of adaptive LL(*), it shouldn't affect production speed at all once it gets warmed up.

ANTLR has a built-in mechanism to detect, report, and recover from syntax errors. It seems to do a pretty good job. Certainly it's better than PEG, which can't detect errors until EOF. It does single token insertion and deletion automatically to deal with lots of common cases where people forget a symbol or have an extra one like "f(x;" and "f((x);".

The default mechanism recovers by scanning for a token that can follow the current rule or the invoking rule and so on. This work surprisingly well and is pulled almost directly from Niklaus Wirth's early Pascal parsing discussions in the 70s.

Yacc has error productions like this:

| error SEMI

that mean consume tokens until you reach SEMI, replacing everything but the SEMI with the error nonterminal. The error rule acts like a non-greedy ".*" loop.

My intuition is that scanning with the power of the parser rather than ".*" will allow a great deal of control over resynchronization. At the very least, it it will provide a great deal of flexibility in terms of generating good messages. We can simply list the common mistakes and add actions to emit very specific error messages. With only ".*" to distinguish erroneous input, yacc it might have trouble distinguishing between different erroneous sequences (in order to generate specific error messages).

The error productions serve both to recognize the erroneous input and to resynchronize the parser. After specifying the grammatical structure of the erroneous input, the error production can include the grammatical structure needed to resynchronize. After detecting an incorrect variable declaration prefix, a programmer could specify how to consume all the way through the rest of the declaration, even if it's not terminated by a \n or ';'.

decl: 'int' '[' ']' ID '=' expr
    | 'int' '[' ']' ID
    / 'int' '[' ID ('=' expr)?
    ;

where '/' is a "bent" or "not quite right" alternative operator, to distinguish from the PEG ordered alternative operator. One could even imagine skipping over the entire method body if the header is messed up:

method
    : header body
    / .* body
    ;

Here, we match any kind of header and then skipped past the entire body. There's no way a simple "scan until resynchronization set found" loop could scan past a nested construct like a method body.

Another important thing to mention is that the ".*" is not scanning until it sees the 1st token of rule body. It's a non-greedy loop that skips anything until a valid body is seen.

Proposed mechanism

To support their productions, rule alternative blocks will be extended to allow 0 or more error productions at the end. Rules would not support error productions. (I looked through a number of grammars and couldn't really find a situation in which the default error mechanism was seriously lacking.)

The normal mechanism, without error productions, traps mismatched token and no viable alternative errors within the productions of that rule. After reporting an error, it resynchronizes to follow set. (assuming we could not repair the error with single token insertion or deletion inline). If the user specified a catch expression, that code was executed in lieu of the standard mechanism. Any finally clause is executed after everything else, right before returning from the rule method. Rules look like this:

stat:   'return' INT ';'
    |   ID '=' expr ';'
    |   ID '(' expr (',' expr)* ')' ';'
    ;   
    catch[Exception e] { }
    finally { }

Here's a rule with a few error productions:

atom:   '(' expr ')'
    |   INT
    |   ID '(' expr (',' expr)* ')'
    /   '(' expr        // missing ')'
    /   ID '(' (expr|',') ')'         // missing >=1 comma somewhere in arg list
    ;

Boy, I'm sure having a hard time coming up with decent examples.