Auto tree construction and visitors

Ok, been doing some thinking and playing around and also talking to Sam Harwell / Oliver Zeigermann.

The first modification I've made is to turn parse tree construction on or off with a simple Boolean, rather than having to regenerate the parser with -debug. Also, the parsers fire methods enterMethod/exitMethod with the rule index all the time now since it is so convenient to have these. No more needing -trace and regenerating to get debug output.

The parse tree is useful for debugging because it records a complete trace of how the input was matched by the grammar. Because it can be automatically generated, it is also a convenient way for newbies and experts alike to get something up and running quickly. It nicely isolates their actions from the grammar itself, which can then be reused for other projects without regenerating and recompiling it.

Basic parse tree listeners

ANTLR can generate a visitor that walks a parse tree, triggering methods of a listener. The most brain-dead listener looks like this:

interface ParseTreeListener {
    void visitToken(Token token);
    void enterRule(ParserRuleContext ctx);
    void exitRule(ParserRuleContext ctx);
}

With the enter/exit methods, programmers can specify what to do on the way down and on the way up the tree. By using a listener mechanism, I can hide all of the details of a visitor within the ANTLR runtime library. It's too easy to forget that visitor methods must check their children. In this way, listeners just manage the event, not the walking of the tree.

Parse tree implementation

The amazing thing is that I really didn't have to implement parse trees because we create RuleContext objects for every rule invocation. These objects contain the parameters, return values, labels, and any local variables you specify in the rule. These can act like a parse tree if we simply track them and add a list of children. We only need to create a leaf node pointing at token objects. To add properties to that particular node, just add a new local. For example, consider rule r:

r[int x] returns [int y]
locals [int z]
        : name=ID
        ;

we get the following context object:

public static class r_ctx extends ParserRuleContext {
    public int x;      // <-- parameter
    public int y;      // <-- return value
    public int z;      // <-- local variable
    public Token name; // <-- token label
    public r_ctx(RuleContext parent, int state) { super(parent, state); }
    public r_ctx(RuleContext parent, int state, int x) {
        super(parent, state);
        this.x = x;
    }   
}   

Interface ParseTree defines two sub interfaces

public interface RuleNode extends ParseTree {
    RuleContext getRuleContext();
}       
public interface TokenNode extends ParseTree {
    Token getToken();
}       

The visitor

We want the visitor mechanism itself to be in the ANTLR library. We might have multiple strategies for walking the tree, distinct from firing events from the tree. The following recursive depth first implementation fires the basic events in ParseTreeListener and any rule specific events that the programmer has defined in their listener:

public class ParseTreeVisitor {
    public void visit(ParseTreeListener listener, ParseTree t) {
        if ( t instanceof ParseTree.TokenNode) {
            visitToken(listener, (ParseTree.TokenNode) t);
            return;
        }
        ParseTree.RuleNode r = (ParseTree.RuleNode)t;
        enterRule(listener, r);
        int n = r.getChildCount();
        for (int i = 0; i<n; i++) {
            visit(listener, r.getChild(i));
        }
        exitRule(listener, r);
    }

    protected void visitToken(ParseTreeListener listener, ParseTree.TokenNode t) {
        listener.visitToken(t.getToken());
    }

    /** The discovery of a rule node, involves sending two events:
     *  the generic enterRule and a RuleContext-specific event.
     *  First we trigger the generic and then the rule specific.
     *  We to them in reverse order upon finishing the node.
     */
    protected void enterRule(ParseTreeListener listener, ParseTree.RuleNode r) {
        ParserRuleContext ctx = (ParserRuleContext)r.getRuleContext();
        listener.enterRule((ParserRuleContext)r.getRuleContext());
        ctx.enterRule(listener);
    }

    protected void exitRule(ParseTreeListener listener, ParseTree.RuleNode r) {
        ParserRuleContext ctx = (ParserRuleContext)r.getRuleContext();
        ctx.exitRule(listener);
        listener.exitRule(ctx);
    }
}

We want ANTLR to generate an interface like the following that has the rule specific events:

interface TListener extends ParseTreeListener {
void enterRule(s_ctx ctx);
void exitRule(s_ctx ctx);
void enterRule(ifstat_ctx ctx);
void exitRule(ifstat_ctx ctx);
}

ANTLR always generate this interface in case it's needed

So, to specify what to do upon discovering any rule, programmers would override the visitor's enterRule() method. To respond only to the discovery of a specific rule, they would implement a method from the grammar specific interface. For example, imagine the following simple parser grammar:

grammar T;
s : i=ifstat ;
ifstat : 'if' '(' INT ')' ID '=' ID ';' ;

That would result in RuleContext objects with visitor double-dispatch methods in them (normally I don't like these, but they're automatically generated in this case):

public static class s_ctx extends ParserRuleContext {
    public ifstat_ctx i;
    public s_ctx(RuleContext parent, int state) {
        super(parent, state);
    }   
    public void enterRule(TListener listener) { listener.enterRule(this); }
    public void exitRule(TListener listener) { listener.exitRule(this); }
}   

Naturally, we can easily generate a blank listener implementation of TListener so programmers only have to implement the events they care about:

TListener listener = new BlankTListener() {
    public void enterRule(s_ctx ctx) { ... }
    ...
};

Then, to use the visitor on a particular parse tree root, we can do the following.

startRule_ctx root = ... ;
ParseTreeVisitor visitor = new ParseTreeVisitor();
visitor.visit(listener, root);

Listener event return values

Listeners can track whatever state information they want as fields of that listener object. But, what if they want to compute information and have that percolate back up to the root of the tree? Instead of having a return value from the various listener methods and have the visitor pass them back up, the listener methods can simply alter the parse tree nodes. Normally, this is a bad idea because side effects free code is much easier to maintain and read and so on. But, in this case, the nodes are actually rule context objects that store all of the variables and labels defined in rule. If your listener methods need to compute values used by listener methods triggered higher up in the tree, they can save the values in a field stored in the rule context object. Locals, parameters, return values, and labels and get stored in that context. The parse tree is like a freeze-dried version of the entire stack frame used during the parse. Storing return values in the nodes is simply freeze-drying all of the return values. More often than not, however, the result of the visitor is kept in the state of the listener object, which can easily be returned to the invoking code. It also has the benefit of not modifying the trees at all. That would be my recommendation. We can always simulate return values with a map: Map<RuleContext, Object>.

Looking back at my previous post on the visitors and trees, I noticed that I can't put listener events anywhere but the right-hand side of the outermost alternatives. This would be illegal:

expr : left=mexpr (op='+' right=mexpr -> visitAdd)*  ;

It looks nice, but does not make any sense. The visitor only walks the tree after the entire parse. Triggering event inside a parsing loop has no meaning. To truly deal well with expression trees, I recommend using abstract syntax tree is not parse trees. I suppose the tail recursion could be used here to trigger an event upon each add operation.

Dealing with multiple alternatives

Rules with multiple alternatives should have a different event for each alternative, but using automatically generated names like alt1(), alt2(), etc. is a bad idea because a change in the grammar will silently break the listener. That means we need to do something like I was talking about previously and specify the name of the event in the listener. Something like this (NEW SYNTAX):

atom : INT                         # anInt
     | ID                          # anID
     | array=ID '[' index=expr ']' # anArrayIndex
     ;

ANTLR can collect all of these names and trigger discover and finish methods for them. I don't think it's as useful to limit ourselves to just a single event that occurs after we've processed a rule's nodes. For example, what if you want to do something before the visitor reaches the index of the array reference? We need a listener method for the start and stop of the alternative. Notice that I changed the name from the previous incarnation from visitInt to anInt. This way, ANTLR can generate methods like enter_anInt(), etc. We say what the parser has matched, rather than the specific name of a listener or visitor. Having a discover and finish for alternatives with a single token reference doesn't make much sense but we could ignore one or the other method.

These names can also be used on single alternative rules as well to isolate name changes in the grammar from causing compilation errors in the listener. It would prevent runtime errors for ANTLR targets based upon dynamically typed languages.

This '#' allows AST rewrites with -> and visitors for parse trees.