Composite Grammars
Introduction
Some grammars are so large that the generated recognizers are too big to load into development environments or debug. For example, with ANTLR v2, one Oracle grammar generated 129,000 lines of code! ANTLR's strength that it produces human readable parser can be a serious weakness for very large grammars.
ANTLR v3.1 introduces a grammar composition mechanism to simultaneously allow the logical organization of large grammars, provide more opportunities for grammar reuse, and allow the programmer to control the size of the generated classes.
A root grammar imports dependent grammars that it needs and, like traditional inheritance, the root grammar imports only those rules that are not overridden in the root grammar. Duplicate rule definitions imported from multiple delegate grammars get resolved in favor of the rule imported first. Imported grammars may, in turn, import other delegate grammars.
ANTLR's composition mechanism behaves like multiple inheritance and is implemented using a delegation model. Given root grammar R and imported grammars A and B, ANTLR generates classes R, R_A, and R_B. Class R has two delegate pointers to instances of type R_A and R_B so that rules in R can access the imported rules. The delegate classes have delegator pointers back to the root so that, with a little bit of indirection, any rule can get to any other rule. There is no need to copy any rules from the delegate grammars into the root grammar.
Because of this "whiteboard" approach, each grammar should be imported only once. If B relies on rules in A, it should not import A if R already imports B. Otherwise it would cause some duplicate members in generated code.
Every import of a delegate grammar results in the generation of a delegate parser not just the root parser. At first this might seem a waste, because if two root grammars import A, ANTLR will generate two different classes from A. In general, though, ANTLR must do this because a rule overridden in the root grammar often changes the prediction lookahead sets of rules in the imported grammar. Additionally, all token types must be consistent across rules from all delegate grammars. There is little choice but to regenerate classes for delegate grammars. Reuse is at the source grammar level not at the compiled binary level. Consider the following grammar for simplified Java declarations.
parser grammar JavaDecl; type : 'int' ; decl : type ID ';' | type ID init ';' ; init : '=' INT ;
Another grammar, Java
, can import JavaDecl
to override rule type
, thus, altering the prediction decision in rule decl
:
parser grammar Java; import JavaDecl; prog : decl ; type : 'int' | 'float' ;
From the root grammar (only run ANTLR on the root grammar because it automatically processes the delegates), ANTLR generates code akin to the following (using the Java target).
class Java extends Parser { JavaDecl gJavaDecl; // delegate pointer void prog() { decl(); } void type() { match int or float } void decl() { gJavaDecl.decl(); } void init() { gJavaDecl.init(); } }
and generates the following code from the imported grammar.
class JavaDecl extends Parser { public Java gJava; // delegator pointer void decl() {predict alts using Java.type} void init() {match = INT} }
There is no method associated with rule type
in the JavaDecl
delegate because the root grammar overrides it. More to the point, the lookahead DFA for rule JavaDecl.decl
differs because type
has been overridden. decl
can now start with 'float' not just 'int'.
Rule references are polymorphic: references to rule r in delegate grammar A see R.r not A.r if r is overridden in the root R. In this case, rule decl
from grammar JavaDecl
invokes rule type
from grammar Java
via the delegator pointer: gJava.type()
.
See the composite-java example in the examples-v3 tarball.
Rules
Rule definition for r overrides any r definition from an imported grammar.
Grammar imports behave like multiple inheritance, including and possibly overriding rules from all delegate/imported grammars. Rules in delegate grammars that reference rules overridden in a delegator reference the overridden rule. This is identical to the rules for overridden methods in an object oriented programming language.
Import order defines rule precedence. If more than one imported grammar defines a rule, the rule from the grammar imported first takes precedence.
The import statement adds all rules from the specified grammar even if they are not referenced in the delegator grammar in order to satisfy the expected rule interface. If you import JavaDecl into a grammar, users of that grammar will expect to be able to call any of the rules from JavaDecl (any rule can be a start symbol).
Lexers
Lexers in delegator grammars invoke the Tokens
rule from the delegates as well as matching any tokens specified or overridden in the delegator grammar. Consider the following combines grammar that introduces a single token definition:
grammar Root; import Delegate; a : 'int' ID ;
Token ID gets imported with all the other lexer rules from the delegate grammar:
lexer grammar Delegate; ID : 'a'..'z'+ ; INT : '0'..'9'+ ; WS : (' '|'\n')+ {skip();} ;
Running ANTLR on the root grammar generates the following three files: RootLexer.java, RootParser.java, Root_Delegate.java. The generated lexer is comprised of RootLexer and Root_Delegate. Here is the main lexer:
public class RootLexer extends Lexer { // tokens from all imported grammars and the root grammar public static final int T__7=7; public static final int INT=5; public static final int EOF=-1; public static final int WS=6; public static final int Tokens=8; public static final int ID=4; public Root_Delegate gDelegate; public RootLexer(CharStream input, RecognizerSharedState state) { super(input,state); gDelegate = new Root_Delegate(input, state, this); } public final void mT__7() throws RecognitionException { // match 'int' } public void mTokens() throws RecognitionException { // implement Root.g:1:8: ( T__7 | Delegate.Tokens ) if ( is int ) mT__7(); else gDelegate.Tokens(); } }
There are a number of interesting features of this lexer. First, the root grammar inherits all delegates' token vocabularies as they must all synced up. ID must mean the same token type in all generated code. Secondly, notice that mTokens() delegates token recognition to Delegate.g for any token not defined within Root. Finally, notice that the constructor for the delegates is assumed to be the input stream, recognizer state object, and the delegator. If your delegate needs a different constructor executed,, you must create a new constructor in the root/delegator grammar that invokes the appropriate constructor.
The tokenVocab grammar option is ignored within delegate grammars.
A lexer is only generated if the root grammar is a lexer or combined grammar.
Legal imports
The syntax is
import A;
or
import A,B,...,Z;
grammar | can import which grammars? |
---|---|
lexer | lexer |
parser | parser |
tree parser | tree parser |
combined lexer/parser | lexer, parser |
Please note that a combined grammar may not import a combined grammar. This may seem to limit the depth of imports but it's not the case. Parser grammars don't need to explicitly import the lexer grammar they rely on, this is done only once in the root composite grammar which "glues" its dependent grammars.
lexer grammar L ; LETTER : 'a'..'z' ; SPACE : ' ' ;
parser grammar P1 ; letter : LETTER ; spaces : SPACE+ ;
parser grammar P2 ; import P1 ; letters : letter+ ;
grammar C ; import L, P2 ; stuff : ( letters spaces )+ ;
Options
Grammar options are discovered by searching upwards towards the root in the grammar delegation hierarchy, with root options used last. So a delegate can use an option but the default is to use root's. Normally don't want options in delegates but a k value might be necessary.
When looking for delegates, ANTLR searches in the -lib directory specified on the command line.