Migrating from ANTLR 2 to ANTLR 3
Contents
Terence built a tool that helps you convert from v2 to v3. It is not perfect, but does some useful transformations for you. It cannot handle actions or trees for you, but it does some of the grunt work (actually I noticed that it converts ^ to ^^, which is wrong...should leave ^ alone).
ANTLR Tool Changes
Ways in which the command line interface to ANTLR differs from version 2 to version 3.
Tool Invocation
The package in which the 'Tool' class is located is now different:
ANTLR 2 |
ANTLR 3 |
---|---|
java antlr.Tool ... |
java org.antlr.Tool ... |
v3 also allows multiple grammar files on the same command line. See also ANTLR 3 Command line options.
Changes in ANTLR Syntax
Some tips for migrating a grammar developed with ANTLR 2 over to ANTLR 3 syntax:
Parser and Lexer in One Definition
You don't need separate sections defining the parser and lexer, ANTLR 3 just puts things in the appropriate place based on the case of the rule name's initial letter ('foo' is a parser production, 'Foo' is a token definition):
ANTLR 2 |
ANTLR 3 |
---|---|
class FooParser extends Parser; |
grammar Foo; |
(ANTLR 3 actually creates the lexer in Foo__.g for you behind the scenes)
'protected' lexer rules are now called 'fragment'
'protected' lexer rules are rules that do not produce a separate token and are only called from other lexer rules. They are now called 'fragment' rules, since they represent a 'fragment' of a token.
ANTLR 2 |
ANTLR 3 |
---|---|
protected |
fragment |
Renamed 'header' to '@header'
ANTLR 3 generally prefixes the label on named code sections with an '@'.
ANTLR 2 |
ANTLR 3 |
---|---|
header { |
@header { |
@header is only used by the parser, so in a combined parser/lexer definition, you are likely to need to duplicate some of the above in a @lexer::header section.
Token Skipping / Hiding
The method to skip a token has changed from a SKIP token type, to a more generic system allowing multiple 'channels' of tokens within the token stream. The parser normally only sees tokens on the 'default' channel, so changing a token's channel to anything else will hide it from the parser. When not playing tricks with multiple token channels, tokens should be hidden by putting them on a channel above Token.DEFAULT_CHANNEL (0), which ANTLR supports by providing a constant 'HIDDEN' (from Token.HIDDEN_CHANNEL).
ANTLR 2 |
ANTLR 3 |
---|---|
$setType(Token.SKIP); |
$channel=HIDDEN; |
If you are accessing the token stream directly, or the 'channel' mechanism is otherwise insufficient, it's also possible in ANTLR 3 to drop tokens entirely from the token stream by using skip() in a lexer action:
WS : (' '|'\t')+ {skip();}
See org.antlr.runtime.CommonTokenStream for more information. Also see its discardTokenType()
and discardOffChannelTokens()
methods.
Code section for members must now be labelled
In ANTLR 2, code surrounded by curly braces preceding the parser productions would be added to the body of the parser class, allowing the grammar to define member fields and functions in the parser. In ANTLR 3, this section must be labelled '@members':
ANTLR 2 |
ANTLR 3 |
---|---|
class FooParser extends Parser; |
grammar Foo; |
To inject members into the lexer of a combined lexer-parser, use @lexer::members {}.
Code sections for rules must now be labelled
In ANTLR 2 you could write initialization code for a rule directly after the rule statement, this section has to be labelled '@init' now:
ANTLR 2 |
ANTLR 3 |
---|---|
foo |
foo |
There is also @after
action, which is executed after all rule elements have been matched (and after all the rule cleanup code that sets return values etc...), but before any finally clause.
Literals
Literals are always in single quotes, not double quotes:
ANTLR 2 |
ANTLR3 |
---|---|
x : y | z ; |
x : y | z ; |
Labels
Labels on elements within a production are denoted with an equals-sign, not a colon:
ANTLR 2 |
ANTLR 3 |
---|---|
lp:LPAREN a:arguments RPAREN |
lp=LPAREN a=arguments RPAREN {String t = $a.text+$lp.text;} |
From the label, you may access all of the return values from a rule reference as well as the normal properties of the rule reference such as text
. You no longer differentiate between getting the return value with '=' and labeling the element with ':'. It is always '=' now.
Multiple Elements Sharing a Label Name
In ANTLR 2, it was necessary to give elements in a production unique label names. ANTLR 3 allows several elements to share the same label.
ANTLR 2 |
ANTLR 3 |
---|---|
statement |
statement |
The value of $e
is of type Token
and refers to the token object matched in either the first or the second alternative.
NB this only works reliably for labels referencing tokens. The return type of each rule is different, and since ANTLR declares the generated variable's type based on the first rule reference seen, attempting hold references to the return values of other rules will result in generated code that doesn't compile.
Parentheses no Longer Mandatory With Cardinality Operators
When a single element has '?', '+' or '*' in a production, you don't have to put () around it, as was required in ANTLR 2:
ANTLR 2 |
ANTLR 3 |
---|---|
compilationUnit |
compilationUnit |
This is true in the lexer also: '\r'?
and '0'..'9'+
. Note that .*
works as well and is an idiom understood by ANTLR to mean match until you see what follows. Here's a rule that matches the single-line comment:
LINE_COMMENT : '//' .* '\n' ;
Tree Building
The option that turns on AST building code has changed:
ANTLR 2 |
ANTLR 3 |
---|---|
options |
options |
You can also build output templates by using output=template;
.
AST References
Using ANTLR v2 you would use a name with a '#' prefix to refer to a labelled AST node, ANTLR 3 uses a '$' for all attribute access.
ANTLR 2 |
ANTLR 3 |
---|---|
typeBlock |
typeBlock |
Tree Rewrite Rules Replace Rewrite Actions
While you can still use the ^ and ! tree construction operators to build trees, v3 introduces an entirely new syntax for Tree construction that avoids the special syntax was used in ANTLR 2 actions:
ANTLR 2 |
ANTLR 3 |
---|---|
arrayLiteral : LBRACK! (elementList)? RBRACK! {## = #([ARRAY_LITERAL, "ARRAY_LITERAL"],##);} ; |
arrayLiteral : LBRACK (elementList)? RBRACK -> ^(ARRAY_LITERAL elementList) ; |
Changing the Type of AST Nodes
Within a rewrite rule, there is a new syntax to replace ANTLR 2's setType() method call in an action:
ANTLR 2 |
ANTLR 3 |
---|---|
in:INC^ {#in.setType(POST_INC);} |
in=INC -> ^(POST_INC[$in]) |
The POST_INC[$in] constructs a new POST_INC node, and copies the text, line/col, etc. from the node labeled 'in'.
Changing the Type of Tokens in the Lexer
ANTLR 2 |
ANTLR 3 |
---|---|
{ $setType(TOKEN); } |
{ $type = TOKEN; } |
Tree parser uses ^ instead of #
Within a tree parsing rule, subtrees are indicated by ^ instead of #.
ANTLR 2 |
ANTLR 3 |
||
---|---|---|---|
expr : #(PLUS expr expr) |
INT; |
expr : ^(PLUS expr expr) |
INT; |
Error handling
To disable generation of the standard exception handling code in the parser:
ANTLR 2 |
ANTLR 3 |
---|---|
options |
@rulecatch { } |
Further, in ANTLR 3, to cause an exception to be raised on mismatched tokens in the middle of an alternative, the parser must override the mismatch() method of BaseRecogniser. The default implementation looks like this:
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException { MismatchedTokenException mte = new MismatchedTokenException(ttype, input); recoverFromMismatchedToken(input, mte, ttype, follow); }
To immediately fail on error, override this with code that constructs an exception as above, but then throws it, rather than calling the recoverFromMismatchedToken() method:
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException { throw new MismatchedTokenException(ttype, input); }
For now, ANTLR's Java code generation directly calls the following method which you must override so that it does not recover
public void recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow) throws RecognitionException { throw e; }
Also, make sure that ANTLR does not generate its normal rule try/catch:
@rulecatch { catch (RecognitionException e) { throw e; }
To alter the way your messages appear, override
public String getErrorMessage(RecognitionException e, String[] tokenNames) {...}
See Book excerpt: Error Reporting and Recovery
catch blocks
Catch blocks are no longer prefixed with "exception" keyword. Just list the catch blocks:
r : ... ; catch [FailedPredicateException fpe] {...} catch [RecognitionException re] {...}
Case-Insensitivity
ANTLR 2 |
ANTLR 3 |
---|---|
options { |
No equivalent option, but see How do I get case insensitivity? |
Changes in ANTLR Runtime Support Code
Java
General API Reorganisation
Runtime classes are now all under org.antlr.runtime.
ANTLR 2 Type |
ANTLR 3 Type |
---|---|
interface antlr.collections.AST |
interface org.antlr.runtime.tree.Tree |
class antlr.Token |
interface org.antlr.runtime.Token |
|
|
Lookahead in Actions and Semantic Predicates
If your actions or semantic predicates used LT() or LA() methods of ANTLR 2, these will need to be prefixed with 'input.' in ANTLR 3, as the methods are no londer defined by the parser class.
ANTLR 2 |
ANTLR 3 |
---|---|
{LA(1)==LCURLY}? (block) |
{input.LA(1)==LCURLY}? (block) |
{LT(1).getText().equals("namespace")}? IDENT |
{input.LT(1).getText().equals("namespace")}? IDENT |
// in lexer, |
// in lexer, |
Newline Tracking in Lexical Actions
ANTLR 3 tracks newlines by itself, so if your ANTLR 2 lexical actions included calls to 'newline()', these must be removed (the method has gone).
No More XXXTokenTypes Interface
ANTLR 3 doesn't generate the XXXTokenTypes interface for grammar 'XXX' any more. The constants are now generated directly in both the parser and lexer implementation classes.
ANTLR 2 |
ANTLR 3 |
---|---|
MyGrammerTokenTypes.LBRACK |
MyGrammar.LBRACK |
AWOL
Stuff that existed in ANTLR 2 which has no equivalent in ANTLR 3 yet (or which Ter just hasn't explained enough times on the mailing list for it to sink in ):
Paraphrase
For giving a little bit more comprehensible errors it was possible to set the paraphrase in Antlr 2.
RIGHT_PAREN options \{ paraphrase = "a closing parenthesis '\}'"; \} : '\}' ;
You can still do something like this manually with very little work. Described in the reference book Chapter 10, "Error Reporting and Recovery" : "Altering Recognizer Error Messages". The new v3 mechanism lends itself to a more flexible "top down" paraphrase mechanism. Rules as well as tokens may be paraphrased. As a simple start the paraphrase option may be replaced trivially...
LPAREN options { paraphrase = "("; } : '(';
...by...
... @members { ... Stack<String> paraphrase = new Stack<String>(); ... } ... LPAREN @init { paraphrase.push("("); } @after { paraphrase.pop(); } : '(';
Per-Token AST Type Specs
ANTLR 2 allowed the grammar to specify an AST implementation class per token type.
tokens \{ COMPILATION_UNIT; TYPE_BLOCK<AST=uk.co.badgersinfoil.metaas.impl.ParentheticAST>; "import"<AST=uk.co.badgersinfoil.metaas.impl.ExprStmtAST>; \}
A workaround in ANTLR 3 might be to implement this 'by hand' in a custom TreeAdaptor implementation.
Terence will investigate heterogeneous tree construction after v3.0 release.