On the reuse of grammars

Spent a few hours talking to Kay Roepke as he is in town for 10 days. He and I started looking at the tree diff mechanism he found on the net and we discussed how it would be included into antlr. Also we discussed grammar composition. Seems to me there are four problems in the area of reusing grammars:

  1. Island grammars (probably best handled by a scannerless parser); ignored for the purposes of this discussion
  2. Combining and sharing grammars (multiple variations on C, SQL, etc...); good also on a practical level to reduce size of any one particular generated file.
  3. Deriving a new grammar from an existing standard grammar such as Java.g. Changes to prototype grammar should be pulled into derived grammar.
  4. n-phase translators with a single prototypical tree grammar. Changes to the prototype should be "pushed" to all derivative grammars.

The following sections are some terse notes to remind myself later when I implement this stuff.

Grammar composition

In v2, we used an inheritance mechanism that was really a glorified dumb include. After discussing with the number of people including Ari Steinberg (who has a lot of experience with large SQL grammars), I have formulated the following mechanism with Kay. The mechanism is based on the idea of delegation rather than inheritance, however, rephrase it as an import that pulls in all rules making them available to the grammar that imports them.

Parsers

Imagine a simple grammar with three rules:

parser grammar JavaDecl;
type : 'int' ;
decl : type ID ';'
     | type ID init ';'
     ;
init : '=' INT ;

Now imagine that you want to build another grammar by reusing some of the rules and that grammar. You can use an import statement that at least in the abstract includes all the rules from the other grammar that are not overridden:

parser grammar Java;
import JavaDecl;
prog : decl ;
type : 'int' | 'float' ;

ANTLR will aggressively optimize out the rules that are not needed. It must still include rules whose lookahead DFA change as a result of an overridden rule. In this case, the change in rule type alters the lookahead prediction for rule decl. Consequently, decl must be included in the generated code for grammar Java. Here is the output ANTLR would generate for Java:

class JavaParser extends Parser {
  JavaDecl delegate = new JavaDecl(...); // probably set in ctor actually
  public void type() { ... }
  public void prog() { decl(); } // uses overridden version.
  public void decl() {
    int alt = predict-alt-of-decl; // DFA changed; must copy whole rule here
    switch (alt) {
    case 1 :
      type(); match(ID); match(';');
    case 2 :
      type(); match(ID); init(); match(';');
    }
  }
  void init() { delegate.init(); }
}

The code generation would look the same and delegation would be handled by having a rule with the proper name that delegated to the other grammar; e.g., init(). This is nice because it shows you the list of all rules delegated to other grammars.

Notice that you cannot use combined grammars for this; lexers have to be handled with a separate delegation.

Lexers

Lexer composition works exactly the same way. The import in the abstract copies all of the rules into the new lexer:

lexer grammar L;
import S,T;

Precedence of rules for the same name is given to the grammar listed first in the import statement; e.g., ID, INT, WS, etc... All rules would be copied in because there's no way to optimize out which ones you want.

We will add the ability to specify the implicitly-defined Tokens rule so that you can specify only the set of tokens you want from the merged lexers.

Propogating grammar changes

Basically, we will need tree-based grammar versions of diff and diff3: gpatch, gdiff, gdiff3. Also, we will need a tool called gderive that effectively does a copy of an original grammar into a special location, ~/.grammar-prototypes, so that later gsync calls will know what the original looked like.

When propagating the change, there is a possibility of collision. The user might have altered an action, or inserted an action into a location that the original grammar changes. Collisions in actions or any change within an alt that has actions
in proto or derived causes a marker in the merged file. ANTLR could even look for that cookie and warn that the grammar may not work.

Branching off a new version of an existing grammar

Let's say you want to do a project using the existing Java.g grammar. First, make sure that you make a copy of the original grammar with:

gderive Java.g MyJava.g

Then make whatever changes you want to the grammar or actions. When you want to get fixes from the original Java.g, all you need to do is the following:

gsync Java.g MyJava.g

N-phase translators with a single prototypical tree grammar

Imagine a collocated translator that has multiple passes over the tree to collect information and build ancillary data structures. Changes to the tree construction in the parser grammar, affect the underlying grammar of all phases. This is a change to manually propagate the changes to each of the tree grammars and can introduce lots of errors. Instead, we should build a single prototypical tree grammar and keep it around (not up in the gderive directory though). Later, all of the phases can be changed simply by altering the proto.g tree grammar and invoking:

gsync proto.g p1.g ... p2.g

That effectively pushes all of the changes to the different phases. Collisions result in a marker in a grammar file.