Blog from July, 2007

Heterogeneous AST node types

I just finished adding heterogeneous AST node types to the tree construction. I did not get a response on the mailing list about what people needed, so I build what I thought I would need. The mechanism is extremely simple. If you want a specific node type for a token type, simply switch on the token type inside the TreeAdapter. When you need to use a specific node type for the same token type depending on grammatical context, then ANTLR needs to let you specify that. Here's a simple example that illustrates how to set the node type for token references and string literals:

a : ID<Var> ID<Method> 'main'<Method> ;

This will also work with the rewrite operator. Any token or string literal on the right-hand side can specify the node type, including arguments if necessary:

// imaginary node
m : type ID '(' args ')' body -> ^(METHOD<MethodNode> type ID args body) ;
// real node
d : type ID ';' -> ^(DECL type ID<VarNode>) ;

Can use literals too:

a : 'int' -> 'int'<TypeNode> ;

Args just get passed to constructor:

b : ID -> ID<VarNode>[3,4] BLORT<IckNode>["hi"];

Can even do in loops:

g : ID+ -> ID<VarNode>+ ;

Cool, eh? This is all still in my dev branch. My intention is that this will be used in rewrites for tree grammars too, but not on left side of -> in tree grammars.

In terms of implementation, ID<Var> becomes new Var(ID) and ID<Var>[foo] becomes new Var(ID, foo) instead of the usual stream_ID.next() etc...

Working on design for tree grammar rewriting. I want to be able to tweak a tree w/o having to rebuild whole tree. Imagine

a : ^(A B c D) ;
c : C -> E F ;

which means replace the C with E F using a rewrite rule. Suppose we only want to alter the 2nd child of A and not the whole tree. I want to avoid a bunch of extra generated code in favor of making the tree node stream track current child number and start/stop index for rule invocation. Those could be made available to actions if you want, though I don't want to make TreeNodeStream interface too large. Anyway, I suggest added enterRule(), exitRule() methds to TreeNodeStream so it can keep track of the variables I need to do a rewrite

void a() {
  match(A);
  match(DOWN);
  r=c();
  // replace tree associated with last rule invocation
  input.rewrite(r.tree);  // NEW
  match(D);
  match(UP);
}

c_return c() {
  input.enterRule();  // NEW
  match(C);
  return ^(nil E F);
  input.exit();  // NEW
}

The rewrite(r.tree) amounts to:

input.rewrite(startIndex, stopIndex, startChild, stopChild, r.tree);

but those variables are tracked within the stream. If done immediately after rule invocation, stream only needs to track one set of variables. Further, this doesn't really require much in code gen.

The rewrite method should not alter the stream if the tree is the same. How do I indicate difference between no return tree and blank (delete) rewrite? Oh, the rewrite only happens in the actual rewrite. Duh:

void a() {
  match(A);
  match(DOWN);
  r=c();
  match(D);
  match(UP);
}

c_return c() {
  input.enterRule();  // NEW
  match(C);
  // replace tree associated with last rule invocation
  input.rewrite(^(nil E F));  // NEW
  return ^(nil E F);
  input.exit();  // NEW
}

That way, we only have to track start index and start child. (smile)

The key here is that the stream is rewritten as is the tree. Next use of the stream will see the modified tree. I would just keep the serialized streams for rewrite sections and use those instead of the old stream, just like TokenRewriteStream does. Very efficient.

On the reuse of grammars

Spent a few hours talking to Kay Roepke as he is in town for 10 days. He and I started looking at the tree diff mechanism he found on the net and we discussed how it would be included into antlr. Also we discussed grammar composition. Seems to me there are four problems in the area of reusing grammars:

  1. Island grammars (probably best handled by a scannerless parser); ignored for the purposes of this discussion
  2. Combining and sharing grammars (multiple variations on C, SQL, etc...); good also on a practical level to reduce size of any one particular generated file.
  3. Deriving a new grammar from an existing standard grammar such as Java.g. Changes to prototype grammar should be pulled into derived grammar.
  4. n-phase translators with a single prototypical tree grammar. Changes to the prototype should be "pushed" to all derivative grammars.

The following sections are some terse notes to remind myself later when I implement this stuff.

Grammar composition

In v2, we used an inheritance mechanism that was really a glorified dumb include. After discussing with the number of people including Ari Steinberg (who has a lot of experience with large SQL grammars), I have formulated the following mechanism with Kay. The mechanism is based on the idea of delegation rather than inheritance, however, rephrase it as an import that pulls in all rules making them available to the grammar that imports them.

Parsers

Imagine a simple grammar with three rules:

parser grammar JavaDecl;
type : 'int' ;
decl : type ID ';'
     | type ID init ';'
     ;
init : '=' INT ;

Now imagine that you want to build another grammar by reusing some of the rules and that grammar. You can use an import statement that at least in the abstract includes all the rules from the other grammar that are not overridden:

parser grammar Java;
import JavaDecl;
prog : decl ;
type : 'int' | 'float' ;

ANTLR will aggressively optimize out the rules that are not needed. It must still include rules whose lookahead DFA change as a result of an overridden rule. In this case, the change in rule type alters the lookahead prediction for rule decl. Consequently, decl must be included in the generated code for grammar Java. Here is the output ANTLR would generate for Java:

class JavaParser extends Parser {
  JavaDecl delegate = new JavaDecl(...); // probably set in ctor actually
  public void type() { ... }
  public void prog() { decl(); } // uses overridden version.
  public void decl() {
    int alt = predict-alt-of-decl; // DFA changed; must copy whole rule here
    switch (alt) {
    case 1 :
      type(); match(ID); match(';');
    case 2 :
      type(); match(ID); init(); match(';');
    }
  }
  void init() { delegate.init(); }
}

The code generation would look the same and delegation would be handled by having a rule with the proper name that delegated to the other grammar; e.g., init(). This is nice because it shows you the list of all rules delegated to other grammars.

Notice that you cannot use combined grammars for this; lexers have to be handled with a separate delegation.

Lexers

Lexer composition works exactly the same way. The import in the abstract copies all of the rules into the new lexer:

lexer grammar L;
import S,T;

Precedence of rules for the same name is given to the grammar listed first in the import statement; e.g., ID, INT, WS, etc... All rules would be copied in because there's no way to optimize out which ones you want.

We will add the ability to specify the implicitly-defined Tokens rule so that you can specify only the set of tokens you want from the merged lexers.

Propogating grammar changes

Basically, we will need tree-based grammar versions of diff and diff3: gpatch, gdiff, gdiff3. Also, we will need a tool called gderive that effectively does a copy of an original grammar into a special location, ~/.grammar-prototypes, so that later gsync calls will know what the original looked like.

When propagating the change, there is a possibility of collision. The user might have altered an action, or inserted an action into a location that the original grammar changes. Collisions in actions or any change within an alt that has actions
in proto or derived causes a marker in the merged file. ANTLR could even look for that cookie and warn that the grammar may not work.

Branching off a new version of an existing grammar

Let's say you want to do a project using the existing Java.g grammar. First, make sure that you make a copy of the original grammar with:

gderive Java.g MyJava.g

Then make whatever changes you want to the grammar or actions. When you want to get fixes from the original Java.g, all you need to do is the following:

gsync Java.g MyJava.g

N-phase translators with a single prototypical tree grammar

Imagine a collocated translator that has multiple passes over the tree to collect information and build ancillary data structures. Changes to the tree construction in the parser grammar, affect the underlying grammar of all phases. This is a change to manually propagate the changes to each of the tree grammars and can introduce lots of errors. Instead, we should build a single prototypical tree grammar and keep it around (not up in the gderive directory though). Later, all of the phases can be changed simply by altering the proto.g tree grammar and invoking:

gsync proto.g p1.g ... p2.g

That effectively pushes all of the changes to the different phases. Collisions result in a marker in a grammar file.

I've been pretty sick this last week so I decided to work on something that was fun. After talking with Don Bradshaw at http://www.quipoz.com in Sydney and after trying to convert ANTLR v3's grammar (written in v2) into v3, I decided to build TreeWizard. ANTLR has a nice facility for creating trees from input symbols like:

decl : 'int' ID ';' ; -> ^(DECL 'int' ID)

But, sometimes outside of the grammar file (or in severe cases in a grammar action) you need to build a tree in raw code. Further, there are many cases, as Andy Tripp will tell you, that building a tree grammar is a overkill just to examine a few subtrees. You might want to simply print out all of the declarations defined in the tree. You don't care about the tree structure except for the declaration itself. A grammar for the full tree is a lot of unnecessary work.

TreeWizard is my first attempt at a utility class that lets you create and navigate trees without relying on an ANTLR grammar. You will find it attached with about 30 unit tests. This is only my development branch so we are free to update, change, and otherwise mangle the definition. I just want to make sure that we have something out there that people can try out to make sure I'm going in the right direction. See the unit tests for more examples than I give here. Also note that I think I changed CommonTree's getText to not call toString(), but call getText():

public String getText() {
	return token.getText();
}

Creating trees

Creating a tree is a simple matter of supplying a pattern for the tree using the token names from an ANTLR grammar:

TreeAdaptor adaptor = new CommonTreeAdaptor();
String[] tokens = myparser.getTokenNames();
TreeWizard wiz = new TreeWizard(adaptor, tokens);
CommonTree t = (CommonTree)wiz.create("(A B C)");

where the tokens from your parser look something like:

public static final String[] tokenNames = new String[] {
    "<invalid>", "<EOR>", "<DOWN>", "<UP>", "A", "B", "C"
};

Once you create a TreeWizard, you can keep calling create as it will know what kind of nodes to make, how to hook them up, and what the token types are for the token names. The wizard really only has write-once data in it...just a convenience so you don't have to keep passing stuff to each of the methods.

Oh, if you want to set the text of a node, use the argument syntax with no quotes:

CommonTree t = (CommonTree)wiz.create("(A B[foo] C[bar])");

Tree Equality

If you want to compare two trees, you'd have to write your own method to do so right now. The equals() method in Java is not necessarily the right way to implement either because that is probably just matching to nodes not trees. The TreeWizard provides a method to check the equality two trees (one static and one instance method):

/** Compare t1 and t2; return true if token types/text, structure match exactly.
 *  The trees are examined in their entirety so that (A B) does not match
 *  (A B C) nor (A (B C)). 
public static boolean equals(Object t1, Object t2, TreeAdaptor adaptor);

/** Compare type, structure, and text of two trees, assuming adaptor
 *  in this instance of a TreeWizard.
 */
public boolean equals(Object t1, Object t2);

If you would only like to match the structure of a tree, but not the text associated with each node (so that, say, any ID node will match), use the parse method:

boolean matched = wiz.parse(t, "(A B C)");

Wildcards work too:

CommonTree t = (CommonTree)wiz.create("(A B C)");
boolean matched = wiz.parse(t, "(A . .)");

Sometimes you want some of the nodes to match the text exactly, but most of the time match just the structure. You can do that too:

CommonTree t = (CommonTree)wiz.create("(A B[foo] C)");
boolean matches = wiz.parse(t, "(A B[foo] C)");

Getting pointers to subtree nodes

Matching trees is great, but most of the time you want to get a pointer to some of the children not just determine if the tree structure matches. To accomplish this, label the elements in the tree pattern and pass in a hash table that the parse() method can fill in:

CommonTree t = (CommonTree)wiz.create("(A B C)");
Map labels = new HashMap();
boolean valid = wiz.parse(t, "(%a:A %b:B %c:C)", labels);

Then labels.get("a") points to the root etc... Cool, eh?

Indexing trees

Sometimes you want a list of all nodes with a certain type like VARDECL or METHOD:

List nodes = wiz.find(t, wiz.getTokenType("VARDECL"));

Sometimes you want all lists for each token type:

// from the unit tests
CommonTree t = (CommonTree)wiz.create("(A B (A C B) B D D)");
Map m = wiz.index(t);
String found = m.toString();
String expecting = "{8=[D, D], 6=[B, B, B], 7=[C], 5=[A, A]}";

Visitors

In order to visit all nodes a particular type for all subtree that match a particular pattern, use the visit() method. For each node or subtree that matches, TreeWizard invokes the method and the following interface:

public interface ContextVisitor {
	public void visit(Object t, Object parent, int childIndex, Map labels);
}

Notice that it passes you a lot of interesting contact information including "who's your daddy?" and "what child am I?" The labels printer is not set unless you're matching against a tree pattern (i.e., won't work for matching against a token type).

Here is some code that visits every B node and puts them into a list, which is of course how I implemented the find() that returns a list.

CommonTree t = (CommonTree)wiz.create("(A B (A C B) B D D)");
final List elements = new ArrayList();
wiz.visit(t, wiz.getTokenType("B"),
	   new TreeWizard.Visitor() {
			public void visit(Object t) {
				elements.add(t);
			}
	   });

where Visitor is just a convenience class that makes it easier for you to implement if all you want to do is visit without context information:

public static abstract class Visitor implements ContextVisitor {
	public void visit(Object t, Object parent, int childIndex, Map labels) {
		visit(t);
	}
	public abstract void visit(Object t);
}

Now, let's say that you want to match a bunch of subtrees for declarations and pick out the identifier in each. No problem:

wiz.visit(t, "(VARDECL %type:. %id:ID)",
	   new TreeWizard.ContextVisitor() {
		   public void visit(Object t, Object parent, int childIndex, Map labels) {
		       // labels.get("type") is first child
		       // labels.get("id") is 2nd child
		   }
	   });