ANTLR v3 printable documentation

An inline version of ANTLR v3 documentation.

ANTLR3 Code Generation Targets

Code generation for the following target languages is currently in development, testing or is complete. Visit the page for each target language for more information - hopefully the persons dealing with each target language will update their respective rows in this table with their current status.

See also Target API documentation and How to build an ANTLR code generation target.

Language

Irresponsible Person

Status

Ada

Luke A. Guest

Currently dormant.

ActionScript

George Scott (initial port, not actively maintaining)

In sync up to 3.2, but currently not in active development.

C

Jim Idle

In sync with ANTLR3 development. Use the .tgz files under the dist subdirectory to build the runtime.

C++

Gokulakannan Somasundaram  (was Jim Idle & Ric Klaren)

Created on antlr-3.4 and hence in sync with only antlr-3.4

C#; C# 2

Maintainer: Johannes Luber
(contributed by: Kunle Odutola and Micheal Jordan)

In sync with ANTLR3 Development to 3.3, but a few errors make it beta for 3.3. There are separate targets for .NET 1.1 and .NET 2.

C# 3

Maintainer: Sam Harwell

(Added post-release 3.1.3) In sync with ANTLR3 Development, except no support for the -debug or -profile flags yet

D

?

?

Emacs ELisp

Ola Bini

He's working on this at the moment; http://github.com/olabini/antlr-elisp

Objective C

Alan Condit, Kay Roepke

Current with 3.3 version.

Java

Terence (parrt at cs usfca edu)

In sync with ANTLR3 Development

JavaScript

Joey Hurst

In sync with ANTLR3 Development

Python

Benjamin Niemann

Current with 3.1.3

Ruby

Kyle Yetter, previously Martin Traverso

Current with 3.3

Perl6

Bernhard Schmalhofer Bernhard.Schmalhofer@gmx.de

Inactive. No code produced yet. Takers wanted.

Perl

Ron Blaschke ron at rblasch.org

Early prototyping.  Simple lexer is working.

PHP

Sidharth Kuruvila, Yauhen Yakimovich, Geoff Speicher, Rolland Brunec

Primary milstone is aimed at verification of Lexer, Parser generation. The work towards implementation of StringTemplate is in progress

Oberon (yes, Oberon)

Dominik Holenstein

Planning and analyzing. First version expected for
Q1/2007.

Scala

Matthew Lloyd

 

Command line options

Usage:

java org.antlr.Tool [args] file.g [file2.g file3.g ...]

Option

description

-o outputDir

specify output directory where all output is generated; search for token vocabularies in here also

-fo outputDir

same as -o but force even files with relative paths to dir

-depend

generate file dependencies; don't actually run antlr

-lib dir

specify location of token files and important grammars

-report

print out a report about the grammar(s) processed

-print

print out the grammar without actions

-trace

generate a parser with trace output - if the default output is not enough, you can override the traceIn and traceOut methods

-debug

generate a parser that emits debugging events

-profile

generate a parser that computes profiling information

-nfa

generate an NFA for each rule

-dfa

generate a DFA for each decision point

-message-format name

specify output style for messages

-X

display extended option list

There are a bunch of less often used "extended" options as well.

Extended option

description

-Xgrtree

print the grammar AST

-Xdfa

print DFA as text

-Xnoprune

do not test EBNF block exit branches

-Xnocollapse

collapse incident edges into DFA states

-Xdbgconversion

dump lots of info during NFA conversion

-Xmultithreaded

run the analysis in 2 threads

-Xnomergestopstates

do not merge stop states

-Xdfaverbose

generate DFA states in DOT with NFA configs

-Xwatchconversion

print a message for each NFA before converting

-XdbgST

put tags at start/stop of all templates in output

-Xm m

max number of rule invocations during conversion

-Xmaxdfaedges m

max "comfortable" number of edges for single DFA state

-Xconversiontimeout t

set NFA conversion timeout for each decision

-Xmaxinlinedfastates m

max DFA states before table used rather than inlining

-Xnfastates

for nondeterminisms, list NFA states for each path

Attribute and Dynamic Scopes

Token attributes

attribute

description

text

 

type

 

line

 

index

 

pos

 

channel

 

tree

 

int

 

Rule attributes

Parsers

attribute

description

text

 

start

 

stop

 

tree

 

st

 

Tree parsers

attribute

description

text

 

start

 

tree

 

st

 

Lexers

attribute

description

text

 

type

 

line

 

index

 

pos

 

channel

 

start

 

stop

 

int

 

The Rule text Attribute in Tree Grammars

In a parser grammar, the relationship between the elements matched by a rule and the associated input text is very clear. A rule begins parsing at a particular token and stops parsing at a particular token. The text attribute for a rule, $text, is simply the concatenated text from all tokens in that range, including hidden channel tokens. What does $text mean in a tree grammar, though?

Tree grammar rules match nodes and trees not tokens. Fortunately, each node has an associated token start and stop index (See TreeAdaptor). As the parser builds trees, each rule sets the token indexes for its return AST to the start and stop token of that rule. We can then define the text attribute for a tree grammar rule to be the text concatenated from the range of tokens indicated by the range in the root of the first tree matched by the rule. This rule may seem strange, but is the most efficient implementation and works in almost all situations. Here are a few examples:

/** match tree created from, e.g., "int x;"
 *  $text would, therefore, be "int x;"
 *  $start node is VAR node.
 */
variable
    :   ^(VAR type ID) // $text derived from indexes in VAR node
    ;
/** match tree node created from, e.g., "int"
 *  $text would, therefore, be "int"
 *  $start node is 'int' node.
 */
type:   'int'          // $text derived from indexes in 'int' node
    |   'void'
    ;

The following code embodies the text attribute definition. The token range from a rule's start node defines the range of text for the entire rule.

// input is a TreeNodeStream implementation
int start = input.getTreeAdaptor().getTokenStartIndex($start);
int stop = input.getTreeAdaptor().getTokenStopIndex($start);
String text = input.getTokenStream().toString(start, stop);

Be careful when referencing the text of a rule that happens to be the root of a tree. The text of a rule is the text of all tokens underneath the first rout matched by the rule. In the following example, rule @r op matches a single node, but $op.text will include the text associated with the two operands as well. The parser that build the plus and multiply operator nodes will set the token range to include all tokens for that expression.

/** match subtrees for + and * created from input such as "1+4*2"
 *  $text and $op.text is "1+4*2" for first alternative.
 *  $text is just the INT node for second alternative.
 */
expr:   ^(op expr expr) ; // $op.text is same as $text!
    |   INT
    ;
op  :   o='+' | o='*'     // $text includes text of operands
    ;                     // $o.text is just node's text

Note that the text for a node label is always just the string returned from getText() invoked on that node whereas the text for a rule reference is always the text for the tree rooted at that labeled node.

Finally, here is the case where the definition of the text attribute does not do what you expect. The text attribute is derived from the first node matched by a rule, but a rule such as rule slist that matches multiple subtrees has an ill-defined text attribute because it only gives you the text for the first statement subtree:

func:   'void' ID '()' slist ; // $slist.text is text from first tree only
slist:  stat+ ;

In general, you just need to keep this in mind--the text attribute is natural in most cases.

Rule scopes

Global shared scopes

Lexical filters

ANTLR has a lexical filter mode that lets you sift through an input file looking for certain grammatical structures. The rules are prioritized in the order specified in case an input construct matches more than a single rule, with the first rule having the highest priority. The filter proceeds character-by-character looking for a match among the rules. If no match, consume that char and try again. The following example, prints found var foo for every field foo in the input:

lexer grammar FuzzyJava;
options {filter=true;}

FIELD
    :   TYPE WS name=ID '[]'? WS? (';'|'=')
        {System.out.println("found var "+$name.text);}
    ;

fragment
TYPE :   ID ('.' ID)*
        ;

fragment
ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
    ;

WS  :   (' '|'\t'|'\n')+
    ;

Don't forget that you must ignore text in comments, so add another rule:

COMMENT
    :   '/*' (options {greedy=false;} : . )* '*/'
        {System.out.println("found comment "+getText());}
    ;

Grammars

Grammar syntax

All grammars are of the form:

/** This is a grammar doc comment */
grammar-type grammar name;
options { name1 = value; name2 = value2; ... }
import delegateName1=grammar1, ..., delegateNameN=grammarN; // can omit delegateName
tokens { token-name1; token-name2 = value; ... }
scope global-scope-name-1 { «attribute-definitions» }
scope global-scope-name-2 { «attribute-definitions» }
...
@header { ... }
@lexer::header { ... }
@members { ... }

«rules»

The type of the grammar, specified via the grammar-type modifier above, can be one of: lexer, parser, tree, and combined (no modifier). To set the superclass of the generated parser class, use the superClass option. See Grammar options for a list of valid grammar options and their semantics.

Rule syntax

/** rule comment */
access-modifier rule-name[«arguments»] returns [«return-values»] throws name1, name2, ...
options {...}
scope {...}
scope global-scope-name, ..., global-scope-nameN;
@init {...}
@after {...}
    : «alternative-1» -> «rewrite-rule-1»
    | «alternative-2» -> «rewrite-rule-2»
    ...
    | «alternative-n» -> «rewrite-rule-n»
    ;
    catch [«exception-arg-1»] {...}
    catch [«exception-arg-2»] {...}
    finally {...}

See Rule and subrule options for a list of valid rule options and their semantics.

Lexer, Parser and Tree Parser rules

Rules in a grammar are special cases of identifier names. Lexer rules must start with an upper case letter, parser and tree parser rules must start with a lower case letter.

LexerRuleName : ('A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ;

ParserRuleName : ('a'..'z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ;

TreeParserRuleName : ('a'..'z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ;

Here are some common lexical rules for programming languages:

WS  : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;}
    ;
COMMENT
    : '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;
LINE_COMMENT
    : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    ;

The $channel=HIDDEN; action places those tokens on a hidden channel. They are still sent to the parser, but the parser does not see them. Actions, however, can ask for the hidden channel tokens. If you want to literally throw out tokens then use action skip(); (see org.antlr.runtime.Lexer.skip()).

Sometimes you will need some help or rules to make your lexer grammar more readable. Use the fragment modifier in front of the rule:

HexLiteral : '0' ('x'|'X') HexDigit+ ;
fragment HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;

In this case, HexDigit is not a token in its own right; it can only be called from HexLiteral.

Warning: T__ is considered a reserved token name and token name prefix. please don't use it as one of your rule names.

Tree grammar rules

Rules in tree grammars are identical to parser grammars except that they can specify a tree element to match. The syntax is ^( root child1 child2 ... childn ). For example:

decl : ^(DECL type declarator) {System.out.println($type.text+" "+$declarator.text);}
     ;

Attribute scope syntax

Attribute scopes are a set of attribute definitions of the form:

scope name {
    type1 attribute-name1;
    type2 attribute-name2;
}

Grammar action syntax

Grammar actions are, in general, of the form:

@action-name { ... }
@scope-name::action-name { ... }

The default scope-name is parser. For instance, @header is the same as @parser::header. Valid scope-name 's differ depending on the target, but most targets should support parser and lexer. Two common action-name 's are header and members. The header action is placed at the top of a generated class definition and the members action is inserted within the body of a generated class definition. For example, the following grammar actions would ensure generated parser and lexer Java classes include a package declaration:

@parser::header { package my.example.package; }
@lexer::header { package my.example.package; }

Rule elements

Rules may reference:

Element

Description

T

Token reference. An uppercase identifier; lexer grammars may use optional arguments for fragment token rules.

T<node=V> or T<V>

Token reference with the optional token option node to indicate tree construction note type; can be followed by arguments on right hand side of -> rewrite rule

T[«args»]

Lexer rule (token rule) reference. Lexer grammars may use optional arguments for fragment token rules.

r [«args»]

Rule reference. A lowercase identifier with optional arguments.

'«one-or-more-char»'

String or char literal in single quotes. In parser, a token reference; in lexer, match that string.

{«action»}

An action written in target language. Executed right after previous element and right before next element.

{«action»}?

Semantic predicate.

{«action»}?=>

Gated semantic predicate.

(«subrule»)=>

Syntactic predicate.

(«x»|«y»|«z»)

Subrule. Like a call to a rule with no name.

(«x»|«y»|«z»)?

Optional subrule.

(«x»|«y»|«z»)*

Zero-or-more subrule.

(«x»|«y»|«z»)+

One-or-more subrule.

«x»?

Optional element.

«x»*

Zero-or-more element.

«x»+

One-or-more element.

Grammar options

Taken from /org/antlr/tool/Grammar.java all allowed options are

Option

Description

language

The target language for code generation. Default is Java. See Code Generation Targets for list of currently supported target languages.

tokenVocab

Where ANTLR should get predefined tokens and token types. Tree grammars need it to get the token types from the parser that creates its trees. TODO: Default value? Example?

output

The type of output the generated parser should return. Valid values are AST and template. TODO: Briefly, what are the interpretations of these values? Default value?

ASTLabelType

Set the type of all tree labels and tree-valued expressions. Without this option, trees are of type Object. TODO: Cross-reference default impl (org.antlr.runtime.tree.CommonTree in Java)?

TokenLabelType

Set the type of all token-valued expressions. Without this option, tokens are of type org.antlr.runtime.Token in Java (IToken in C#).

superClass

Set the superclass of the generated recognizer. TODO: Default value (org.antlr.runtime.Parser in Java)?

filter

In the lexer, this allows you to try a list of lexer rules in order. The first one that matches, wins. This is the token that nextToken() returns. If nothing matches, the lexer consumes a single character and tries the list of rules again. See Lexical filters for more.

rewrite

Valid values are true and false. Default is false. Use this option when your translator output looks very much like the input. Your actions can modify the TokenRewriteStream to insert, delete, or replace ranges of tokens with another object. Used in conjunction with output=template, you can very easily build translators that tweak input files.

k

Limit the lookahead depth for the recognizer to at most k symbols. This prevents the decision from using acyclic LL* DFA.

backtrack

Valid values are true and false. Default is false. Taken from http://www.antlr.org:8080/pipermail/antlr-interest/2006-July/016818.html : The new feature (a big one) is the backtrack=true option for grammar, rule, and block that lets you type in any old crap and ANTLR will backtrack if it can't figure out what you meant. No errors are reported by antlr during analysis. It implicitly adds a syn pred in front of every production, using them only if static grammar LL* analysis fails. Syn pred code is not generated if the pred is not used in a decision. This is essentially a rapid prototyping mode. It is what I have used on the java.g. Oh, it doesn't memoize partial parses (i.e. rule parsing results) during backtracking automatically now. You must also say memoize=true. Can make a HUGE difference to turn on.

memoize

Valid values are true and false. When backtracking, remember whether or not rule references succeed so that the same input position cannot be parsed more than once by the same rule. This effectively guarantees linear parsing when backtracking at the cost of more memory. TODO: Default value (false)?

Rule and subrule options

option

description

k

Specify the exact lookahead to be used by the subrule.

greedy

Valid values are true and false. Default is true. Normally symbols are matched greedily so that if a symbol can be matched immediately or by exiting the subrule the parser chooses to match the symbol immediately.

backtrack

Rule-specific version of the backtrack grammar option. See Grammar options for details.

memoize

Rule-specific version of the memoize grammar option. See Grammar options for details.

Special symbols in actions

This table describes the complete set of special symbols you can use in actions within your grammar. These are translated by the codegen/action.g ANTLR v3 grammar (in filter mode). The rules mentioned below are found in action.g

Syntax

Description

$enclosingRule.attr

x is enclosing rule, y is a return value, parameter, or predefined property. Rule ENCLOSING_RULE_SCOPE_ATTR.

r[int i] returns [int j]
  :    {$r.i, $r.j, $r.start, $r.stop, $r.st, $r.tree}
  ;

$tokenLabel.prop
$tokenRef.prop

token scope attribute. Rule TOKEN_SCOPE_ATTR.

$rulelabel.attr
$ruleref.attr

Rule RULE_SCOPE_ATTR.

$label

either a token label or token/rule list label like label+=expr. Rule LABEL_REF.

$tokenref

in a non-lexer grammar ISOLATED_TOKEN_REF

$lexerruleref

Yields a Token object created from that rule or fragment rule. Rule ISOLATED_LEXER_RULE_REF.

$y

return value, parameter, predefined rule property, or token/rule
reference within enclosing rule's outermost alt.
y must be a "local" reference; i.e., it must be referring to
something defined within the enclosing rule. Rule LOCAL_ATTR.

r[int i] returns [int j]
  :    {$i, $j, $start, $stop, $st, $tree}
  ;

$x::y

the only way to access the attributes within a dynamic scope
regardless of whether or not you are in the defining rule. Rule DYNAMIC_SCOPE_ATTR.

            scope Symbols { List names; }
            r
            scope {int i;}
            scope Symbols;
                :    {$r::i=3;} s {$Symbols::names;}
                ;
            s    :    {$r::i; $Symbols::names;}
                ;

$x[-1]::y

previous (just under top of stack). Rule DYNAMIC_NEGATIVE_INDEXED_SCOPE_ATTR.

$x[-i]::y

top of stack - i where the '-' MUST BE PRESENT;
i.e., i cannot simply be negative without the '-' sign! Rule DYNAMIC_NEGATIVE_INDEXED_SCOPE_ATTR.

$x[i]::y

absolute index i (0..size-1). Rule DYNAMIC_NEGATIVE_INDEXED_SCOPE_ATTR.

$x[0]::y

is the absolute 0 indexed element (bottom of the stack). Rule DYNAMIC_NEGATIVE_INDEXED_SCOPE_ATTR.

$x.size()

returns the size of the current stack of the scope. Note: This particular syntax is target-dependent. Look at the target page for other targets than Java.

$r

r is a rule's dynamic scope or a global shared scope.
Isolated $rulename is not allowed unless it has a dynamic scope and there is no reference to rulename in the enclosing alternative, which would be ambiguous. Rule ISOLATED_DYNAMIC_SCOPE.

The following symbols relate to StringTemplate templates.

Syntax

Description

%foo(a={},b={},...)

Create instance of template foo, setting attribute arguments. Rule TEMPLATE_INSTANCE.

%({name-expr})(a={},...)

indirect template ctor reference. Rule INDIRECT_TEMPLATE_INSTANCE.

%x.y = z;

set template attribute y of x (always set never get attr)
to z [languages like python without ';' must still use the
';' which the code generator is free to remove during code gen]. Rule SET_ATTRIBUTE.

%{expr}.y = z;

template attribute y of StringTemplate-typed expr to z. Rule SET_EXPR_ATTRIBUTE.

%{string-expr}

anonymous template from string expr. Rule TEMPLATE_EXPR.

Template construction

ANTLR v3 has built-in support for constructing StringTemplate templates. there are two forms: Special symbols in actions and rewrite rules similar to AST construction. I am including a number of rules from the mantra example.

Sometimes you just need a string to become a template:

'void' -> {%{"void"}}

The following tree grammar rule illustrates some of the basic rewrite rules:

primary
    :   ID -> {%{$ID.text}} // create template from token text

        // create template using rule results as template attributes
    |   ^('new' typename args=expressionList)
            -> new(type={$typename.st},args={$args.st})

    |   listliteral -> {$listliteral.st} // reuse template built for listliteral

        // create template using token text as template attribute
    |   NUM_INT   -> int_literal(v={$NUM_INT.text})
    ;

And here are some more complicated examples:

assignment
    :   // special case "a[i] = expr;"
        ^('=' ^(EXPR ^(INDEX a=expression i=expression)) rhs=completeExpression)
        -> indexed_assignment(list={$a.st}, index={$i.st}, rhs={$rhs.st})
    |   ^('=' lvalue completeExpression)
        -> assignment(
                lhs={$lvalue.st},
                rhs={$completeExpression.st})
    |   ^(assign_op lvalue completeExpression)
        -> assignment_with_op(
                type={$assign_op.start.type.name},
                op={$assign_op.text},
                lhs={$lvalue.st},
                rhs={$completeExpression.st})
    ;

When you need to append multiple strings or templates into a another template use the += operator for a rule's return value (former use of toTemplates no longer required). For example adding variable declarations inside a struct template :

structDeclaration
    :   name=Ident (decls+=typeDecls)+ -> structDecl(name={$name.text},declList={$decls});
    ;

and the templates for a struct declaration may be something like this :

structDecl(name,declList) ::= <<
struct <name> {
    <declList; separator="\n">
}

More on string templates can be found here: String Template

Tree construction

There are two mechanisms in v3 for building abstract syntax trees (ASTs): operators and rewrite rules.

Operators

Nodes created for unmodified tokens and trees for unmodified rule references are added to the current subtree as children.

Operator

Description

!

do not include node or subtree (if referencing a rule) in subtree

^

make node root of subtree created for entire enclosing rule even if nested in a subrule

additiveExpression
	:	multiplicativeExpression ('+'^ multiplicativeExpression)*
	;

That is the same as the following in rewrite notation from the following section:

additiveExpression
	:	(a=multiplicativeExpression->$a) // set result
                (    '+' b=multiplicativeExpression
                     -> ^('+' $additiveExpression $b) // use previous rule result
                )*
	;

Rewrite rules

The rewrite syntax is more powerful than the operators. It suffices for most common tree transformations.
While the parser grammar specifies how to recognize input, the rewrites are generational grammars, specifying how to generate output. ANTLR figures out how to map input to output grammar. To create an imaginary node, just mention it like the following example (UNIT is a node created from an imaginary token and is used to group the compilation unit chunks):

compilationUnit
    :   packageDefinition? importDefinition* typeDefinition+
        -> ^(UNIT packageDefinition? importDefinition* typeDefinition*)
    ;

ANTLR tracks all elements with the same name into a single implicit list:

formalArgs
	:	formalArg (',' formalArg)* -> formalArg+
	|
	;

If the same rule or token is mentioned twice you generally must label the elements to distinguish them. If you want to combine multiple elements into a single list, list labels are very handy (though in this case since they have the same name ANTLR will automatically combine them):

('implements' i+=typename (',' i+=typename)*)?

Here is the entire rule:

classDefinition[MantraAST mod]
	:	'class' cname=ID
		('extends' sup=typename)?
		('implements' i+=typename (',' i+=typename)*)?
		'{'
		(	variableDefinition
		|	methodDefinition
		|	ctorDefinition
		)*
		'}'
		-> ^('class' ID {$mod} ^('extends' $sup)? ^('implements' $i+)?
		     variableDefinition* ctorDefinition* methodDefinition*
		    )
	;

Note that using a simple action in a rewrite means evaluate the expression and use as a tree node or subtree. The mod argument is a set of modifiers passed in from an enclosing rule.

Deleting tokens or rules is easy: just don't mention them:

packageDefinition
	:	'package' classname ';' -> ^('package' classname)
	;

If you need to build different trees based upon semantic information, use a semantic predicate:

variableDefinition
	:	modifiers typename ID ('=' completeExpression)? ';'
		-> {inMethod}? ^(VARIABLE ID modifiers? typename completeExpression?)
		->             ^(FIELD ID modifiers? typename completeExpression?)
	;

where inMethod is set by the method rule.

Often you will need to build a tree node from an input token but with the token type changed:

compoundStatement
	:	lc='{' statement* '}' -> ^(SLIST[$lc] statement*)
	;

SLIST by itself is a new node based upon token type SLIST but it has no line/column information nor text. By using SLIST[$lc], all information except the token type is copied to the new node.

Using a rewrite rule at a non-extreme-right-edge-of-production location is ok, but it still always sets the overall subtree for the enclosing rule.

'if' '(' equalityExpression ')' s1=statement
( 'else' s2=statement -> ^('if' ^(EXPR equalityExpression) $s1 $s2)
|                     -> ^('if' ^(EXPR equalityExpression) $s1)
)

You may reference the previous subtree for the enclosing rule using $rulename syntax

postfixExpression
	:	(primary->primary) // set return tree
		(	lp='(' args=expressionList ')' -> ^(CALL $postfixExpression $args)
		|	lb='[' ie=expression ']'       -> ^(INDEX $postfixExpression $ie)
		|	dot='.' p=primary              -> ^(FIELDACCESS $postfixExpression $p)
		|	c=':' cl=closure[false]        -> ^(APPLY ^(EXPR $postfixExpression) $cl)
		)*
	;

Imaginary nodes

References to tokens with rewrite not found on left of -> are imaginary tokens.

d : type ID ';' -> ^(DECL type ID) ; // DECL is imaginary

or

call : lp='(' ID args ')' -> ^(CALL[$lp] ID args) ;

Here, the CALL node has its line/column info set with info from '(' token. CALL node is "derived" from the '('.

Even tokens referenced within alternative result in nodes disassociated with tokens from left of -> if you put arguments on the references:

a : INT -> INT["99"] ; // node created from adaptor.create(INT, "99")

Tree construction during tree parsing

ANTLR 3.0.1 could not create trees during tree parsing. 3.1 introduces the ability to create a new AST from an incoming AST using rewrites rules:

  • Each rule returns a new tree.
  • An alternative without a rewrite duplicates the incoming tree.
  • The tree returned from the start rule is the new tree.
  • The new tree created with output=AST in a tree grammar is completely independent of the input tree as all nodes are duplicated (with and without rewrite -> operator).

The rewrites work just like they do for normal parsing:

a : INT ; // duplicate INT node and return
a : ID -> ; // delete ID node from tree
a : INT ID -> ID INT ; // reorder nodes
a : ^(ID INT) -> ^(INT ID) ; // flip order of nodes in tree
a : INT -> INT["99"] + // make new INT node
a : (^(ID INT))+ -> INT+ ID+ ; // break apart trees into sequences

Predicates can be used to choose between rewrites as well:

a : ^(ID INT) -> {some test}? ^(ID["ick"] INT)
              -> INT
  ;

Don't forget the wildcard (smile)

s : ^(ID c=.) -> $c ; // new tree is whatever matched wildcard

Polynomial differentiation example

For translations whose input and output languages are the same, it often makes sense to build a tree and them morph it towards the final output tree, which can then be converted to text. Polynomial differentiation is a great example of this. Recall that:

  • d/dx(n) = 0
  • d/dx(x) = 1
  • d/dx(nx) = n
  • d/dx(nx^m) = nmx^m-1
  • d/dx(foo + bar) = d/dx(foo) + d/dx(bar)

Ok, here's a parser that builds nice trees.

grammar Poly;
options {output=AST;}
tokens { MULT; } // imaginary token

poly: term ('+'^ term)*
    ;

term: INT ID  -> ^(MULT["*"] INT ID)
    | INT exp -> ^(MULT["*"] INT exp)
    | exp
    | INT
    | ID
    ;
exp : ID '^'^ INT
    ;

ID  : 'a'..'z'+ ;
INT : '0'..'9'+ ;
WS  : (' '|'\t'|'\r'|'\n')+ {skip();} ;

Then we differentiate:

tree grammar PolyDifferentiator;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=AST;
//  rewrite=true; // works either in rewrite or normal mode
}

poly:   ^('+' poly poly)
    |   ^(MULT INT ID)      -> INT
    |   ^(MULT c=INT ^('^' ID e=INT))
        {
        String c2 = String.valueOf($c.int*$e.int);
        String e2 = String.valueOf($e.int-1);
        }
                            -> ^(MULT["*"] INT[c2] ^('^' ID INT[e2]))
    |   ^('^' ID e=INT)
        {
        String c2 = String.valueOf($e.int);
        String e2 = String.valueOf($e.int-1);
        }
                            -> ^(MULT["*"] INT[c2] ^('^' ID INT[e2]))
    |   INT                 -> INT["0"]
    |   ID                  -> INT["1"]
    ;

then we simplify (a little anyway):

tree grammar Simplifier;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=AST;
    backtrack=true;
//  rewrite=true; // works either in rewrite or normal mode
}
/** Match some common patterns that we can reduce via identity
 *  definitions.  Since this is only run once, it will not be perfect.
 *  We'd need to run the tree into this until nothing
 *  changed to make it correct.
 */
poly:   ^('+' a=INT b=INT)  -> INT[String.valueOf($a.int+$b.int)]

    |   ^('+' ^('+' a=INT p=poly) b=INT)
                            -> ^('+' $p INT[String.valueOf($a.int+$b.int)])
    |   ^('+' ^('+' p=poly a=INT) b=INT)
                            -> ^('+' $p INT[String.valueOf($a.int+$b.int)])
    |   ^('+' p=poly q=poly)-> {$p.tree.toStringTree().equals("0")}? $q
                            -> {$q.tree.toStringTree().equals("0")}? $p
                            -> ^('+' $p $q)
    |   ^(MULT INT poly)    -> {$INT.int==1}? poly
                            -> ^(MULT INT poly)
    |   ^('^' ID e=INT)     -> {$e.int==1}? ID
                            -> {$e.int==0}? INT["1"]
                            -> ^('^' ID INT)
    |   INT
    |   ID
    ;

Finally we walk the tree to print it back out using simple templates:

tree grammar PolyPrinter;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=template;
}

poly:   ^('+'  a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a>+<b>"
    |   ^(MULT a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a><b>"
    |   ^('^'  a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a>^<b>"
    |   INT                     -> {%{$INT.text}}
    |   ID                      -> {%{$ID.text}}
    ;

Here is a test rig:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
    public static void main(String[] args) throws Exception {
        CharStream input = null;
        if ( args.length>0 ) {
            input = new ANTLRFileStream(args[0]);
        }
        else {
            input = new ANTLRInputStream(System.in);
        }

        // BUILD AST
        PolyLexer lex = new PolyLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lex);
        PolyParser parser = new PolyParser(tokens);
        PolyParser.poly_return r = parser.poly();
        System.out.println("tree="+((Tree)r.tree).toStringTree());

        // DIFFERENTIATE
        CommonTreeNodeStream nodes = new CommonTreeNodeStream((Tree)r.tree);
        nodes.setTokenStream(tokens);
        PolyDifferentiator differ = new PolyDifferentiator(nodes);
        PolyDifferentiator.poly_return r2 = differ.poly();
        System.out.println("d/dx="+((Tree)r2.tree).toStringTree());

        // SIMPLIFY / NORMALIZE
        nodes = new CommonTreeNodeStream((Tree)r2.tree);
        nodes.setTokenStream(tokens);
        Simplifier reducer = new Simplifier(nodes);
        Simplifier.poly_return r3 = reducer.poly();
        System.out.println("simplified="+((Tree)r3.tree).toStringTree());
        // CONVERT BACK TO POLYNOMIAL
        nodes = new CommonTreeNodeStream((Tree)r3.tree);
        nodes.setTokenStream(tokens);
        PolyPrinter printer = new PolyPrinter(nodes);
        PolyPrinter.poly_return r4 = printer.poly();
        System.out.println(r4.st.toString());
    }
}

Running the rig on "2x+3x^5" shows:

tree=(+ (* 2 x) (* 3 (^ x 5)))
d/dx=(+ 2 (* 15 (^ x 4)))
simplified=(+ 2 (* 15 (^ x 4)))
2+15x^4

Rewriting an existing AST

For efficiency, option rewrite=true does an in-line replacement for rewrite rules so you can avoid making a copy of an entire tree just to tweak a few nodes. For example, if you have a huge expression tree but only want to rewrite ^('+' INT INT) to be a single INT node, it's better not to duplicate the entire huge tree. The rewrite mode behaves exactly the same as nonrewrite mode except that rewrites stitch changes into the incoming tree. Nodes are not duplicated for rules w/o rewrites.

The result of a rule with a rewrite is the newly created tree. The result of a rule w/o a rewrite is simply the incoming tree. For chains of rule invocations as in the next example, ANTLR copies rewrites upwards so that the action in rule 's' prints out the tree created in b:

tree grammar TP;
options {output=AST; ASTLabelType=CommonTree; tokenVocab=T; rewrite=true;}
s : a {System.out.println($a.tree.toStringTree());} ;
a : b ;
b : ID INT -> INT ID ;

Heterogeneous tree nodes

By default, with output=AST, ANTLR creates trees of type CommonTree. To create different nodes depending on the incoming token type, you can override create(Token) and YourTreeClass.dupNode(Object) and errorNode() in a subclass of CommonTreeAdaptor or implement your own TreeAdaptor. Unfortunately, this only allows you to change the node type based upon the token type and not grammatical context. Sometimes you want to have ID become a VarNode and sometimes they MethodNode object. As of v3.1, you can use the node token option to indicate node type (in both parsers and tree parsers):

decl : 'int'<node=TypeNode> ID<node=VarNode> ';' ;

or equivalently

decl : 'int'<TypeNode> ID<VarNode> ';' ;

because node is assumed if there is only one option and it is not an option assignment. Token references with node options and invoke the following constructor during tree construction:

public V(Token t); // NEED SPECIAL CTOR for ID<V> on left of ->

You can specify the node type on any token reference including liberals:

a : ID<V> ';'<V>

The "become root" operator ^ is used following the token options:

e : INT '+'<PlusNode>^ INT ;

Labels are available as usual; e.g., x=ID<V> and x+=ID<V>.

Heterogeneous tree nodes are labeled on the right-hand side of the -> rewrite operator as well:

decl : 'int' ID -> ^('int'<TypeNode> ID<VarNode>) ;

You can also specify arguments on node type constructors on the right of -> rewrite operator. For example, the following two token references:

ID<V>[42,19,30] ID<V>[$ID,99]

invoke the following two constructors of V:

public V(int ttype, int x, int y, int z)
public V(int ttype, Token t, int x)

The TreeAdaptor is not called; instead for constructors are invoked directly. This is much more flexible because the list of arguments can change per type whereas the TreeAdaptor interface is fixed. Note that parameters are not allowed on token references to the left of ->:

a : ID<V>[23,21] ; // ILLEGAL

Use imaginary nodes as you normally would, but with the addition of the node type:

block : lc='{' stat+ '}' -> ^(BLOCK<StatementList>[$lc] stat+) ;

Here is a complete simple example:

grammar T;
options {output=AST;}
@members {
static class V extends CommonTree {
  public int x,y,z;
  public V(int ttype, int x, int y, int z) {
    this.x=x; this.y=y; this.z=z; token=new CommonToken(ttype,"");
  }
  public V(int ttype, Token t, int x) { token=t; this.x=x; }
  public String toString() {
    return (token!=null?token.getText():"")+"<V>;"+x+y+z;
  }
}
}
a : ID -> ID<V>[42,19,30] ID<V>[$ID,99] ;
ID : 'a'..'z'+ ;
WS : (' '|'\\n') {$channel=HIDDEN;} ;

Sometimes ANTLR must duplicate nodes to avoid cycles and to provide useful semantics. In the next example, the trees returned from rule type are type V. There is only one type specification in the input (e.g., "int a,b,c;") but multiple identifiers. to create multiple trees with 'int' at the root, that node must be duplicated by the rewrite rule ^(type ID)+.

grammar T;
options {output=AST;}
a : type ID (',' ID)* ';' -> ^(type ID)+;
type : 'int'<V> ;
ID : 'a'..'z'+ ;
INT : '0'..'9'+;
WS : (' '|'\\n') {$channel=HIDDEN;} ;

We want 3 trees, one for each identifier:

(int<V> a) (int<V> b) (int<V> c)

We need to override dupNode() in our node class definition as well as two constructors:

class V extends CommonTree {
    public V(Token t) { token=t;}                 // for 'int'<V>
    public V(V node) { super(node); }             // for dupNode
    public Tree dupNode() { return new V(this); } // for dup'ing type
    public String toString() { return token.getText()+"<V>";}
}

Here is a test rig:

public class Test {
    public static void main(String[] args) throws Exception {
        CharStream input = new ANTLRFileStream(args[0]);
        TLexer lex = new TLexer(input);
        TokenRewriteStream tokens = new TokenRewriteStream(lex);
        TParser parser = new TParser(tokens);
        TParser.a_return r = parser.a();
        if ( r.tree!=null ) {
            System.out.println(((Tree)r.tree).toStringTree());
            ((CommonTree)r.tree).sanityCheckParentAndChildIndexes();
        }
    }
}

Using custom AST node types

/** An adaptor that tells ANTLR to build CymbalAST nodes */
public static TreeAdaptor cymbalAdaptor = new CommonTreeAdaptor() {
    public Object create(Token token) {
        return new CymbalAST(token);
    }
    public Object dupNode(Object t) {
        if ( t==null ) {
            return null;
        }
        return create(((CymbalAST)t).token);
    }
    public Object errorNode(TokenStream input, Token start, Token stop,
                            RecognitionException e)
    {
        CymbalErrorNode t = new CymbalErrorNode(input, start, stop, e);
        return t;
    }
};

Here's a suitable error node:

/** A node representing erroneous token range in token stream */
public class CymbolErrorNode extends CymbolAST {
        org.antlr.runtime.tree.CommonErrorNode delegate;

        public CymbolErrorNode(TokenStream input, Token start, Token stop,
                                           RecognitionException e)
        {
                delegate = new CommonErrorNode(input,start,stop,e);
        }

        public boolean isNil() { return delegate.isNil(); }

        public int getType() { return delegate.getType(); }

        public String getText() { return delegate.getText(); }
        public String toString() { return delegate.toString(); }
}

Error Node Insertion Upon Syntax Error

Prior to v3.1, ANTLR AST-building parsers did not alter the resulting AST upon syntax error. After v3.1 ANTLR adds an error node as created by TreeAdaptor.errorNode(...) to represent the missing nodes or confusing input sequences. The first token in the error sequence is the token at which the parser first detected an error. The last token in the sequence is the last token consumed during error recovery. ANTLR creates a CommonErrorNode by default, but you can obviously create your own tree adapter and override this.

Let me demonstrate the new mechanism by example. Referring to the attached SimpleC.g from the v3 examples, here is some good input:

int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}

That input results in the following tree output:

tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

The grammar and tree construction for the FOR loop is as follows:

forStat
    :   'for' '(' start=assignStat ';' expr ';' next=assignStat ')' block
        -> ^('for' $start expr $next block)
    ;

Now, remove the first '(' of the or loop:

int foo() {
  for i=0; i<3; i=i+1) {
    x=9;
  }
}

You will see that ANTLR detects an error, but magically inserts the missing token. In this case, the parser was not asked to insert the '(' into the tree so there is no evidence of the error in the output tree:

line 2:6 missing '(' at 'i'
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

What about when you have a random extra token such as "22" before the '(':

int foo() {
  for 22 (i=0; i<3; i=i+1) {
    x=9;
  }
}
line 2:6 extraneous input '22' expecting '('
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

Again, ANTLR detects the error and is able to ignore the extraneous token to yield a valid tree.

If you forget a token that must go into the output tree, however, you will see an error node. Given a missing identifier at the start of the FOR loop:

int foo() {
  for (=0; i<3; i=i+1) {
    x=9;
  }
}

The parser emits:

line 2:7 missing ID at '='
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= <missing ID> 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

The toString() method of the error node yields "<missing ID>".

When the parser gets really confused, such as when it gets a NoViableAltException, you will see that it consumes a whole bunch of input and adds it to the tree as an error node (it indicates what tokens it consumes during resynchronization). Input:

);
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}

yields:

line 1:0 required (...)+ loop did not match anything at input ')'
tree=<error: );
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}>

Making custom error nodes

Just override errorNode() in TreeAdaptor. The default handling is as follows:

public Object errorNode(TokenStream input, Token start, Token stop,
			RecognitionException e)
{
	CommonErrorNode t = new CommonErrorNode(input, start, stop, e);
	return t;
}

Make sure that your error node type is a subclass of your node type so that you do not get class cast exceptions.

See the next section for example of how to override.

Turning off error node construction

To turn this off, just override errorNode:

class MyAdaptor extends CommonTreeAdaptor {
    public Object errorNode(TokenStream input, Token start, Token stop,
                            RecognitionException e)
    {
        return null;
    }
}

and then set

parser.setTreeAdaptor(new MyAdaptor());

What makes a language problem hard?

Given a source to target mapping. How can you characterize the difficulty of the translation?

  • Is the set of all input fixed? If you have a fixed set of files to convert, your job is much easier because the set of language construct combinations is fixed. For example, building a general Pascal to Java translator is much harder than building a translator for a set of 50 existing Pascal files.
  • Forward or external references? I.e., multiple passes needed? Pascal has a "forward" reference to handle intra-file procedure references, but references to procedures in other files via the USES clauses etc... require special handling.
  • Is input order of sentences close to output order? Are there multiple files to generate from a single input file or vice versa?
  • Context sensitive lexer? You can't decide what vocabulay symbol to match unless you know what kind of sentence you are parsing.
  • Are delimiters non-fixed for things like strings and comments? That makes it tough to build an efficient lexer.
  • Is language big; like lots of statements?
  • Are the source statements really similar; declarations vs expressions in C++?
  • Column sensitive input? E.g., are newlines significant like lines in a log file and does the position of an item change its meaning?
  • Case sensitivity problems like fortran?
  • Do you need good error recovery? Good reporting?
  • Well defined language or no manual; hacked for ages like gnucc by non-language designers? Is your language VisualBasic-like?
  • How fast does your translator have to be? It is often the case that building lots of translator phases simplifies your problem, but it can slow down the translation.
  • Does your input have comments as you do in programming languages that can occur anywhere in the input and need to go into the output in a sane location?
  • How much semantic information do you need to do the translation? For example, do you need to simply know that something is a type name or do you need to know that it is, say, an array whose indices are a set like (day,week,month) and contains records? Sometimes syntax alone is enough to do translation.
  • Equivalent syntaxes?  In C there are many different ways to dereference pointers.  You can normalize the language to a standard representation, but you might loose the original representation. The choice usually hinges on whether the output will be human-edited or not.  Designing the right tree structure has to incorporate decisions like this.
  • Jurgen Pfundt points out: The considered language might be small, but mapping is targeted for the conversion of huge files and this is really a challenge. An input file with a size of several megabytes restricts the usage of tree parsers or any other kind of memory consuming features. The transformation should be done in one single pass due to performance requirements and an extremly good and comfortable error reporting and error recovery is a must.

Integration with development environments

Table of Contents 


VisualStudio

C# Projects

For C# projects, you can integrate ANTLR with Visual Studio 2005/2008 by pasting the following snippet of XML near the end of your .csproj file. The <Target>-Tag must appear before <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets"/> for the Build to run successfully. (warning) Adjust the Included and OutputFiles to fit your project.

<ItemGroup>
    <Antlr3 Include="SimpleCalc.g">
      <OutputFiles>SimpleCalcLexer.cs;SimpleCalcParser.cs</OutputFiles>
    </Antlr3>
    <Antlr3 Include="BigCalc.g">
      <OutputFiles>BigCalcLexer.cs;BigCalcParser.cs</OutputFiles>
    </Antlr3>
  </ItemGroup>
  <Target Name="GenerateAntlrCode" Inputs="@(Antlr3)" Outputs="%(Antlr3.OutputFiles)">
    <Exec Command="java org.antlr.Tool -message-format vs2005 @(Antlr3)" Outputs="%(Antlr3.OutputFiles)"/>
  </Target>
  <PropertyGroup>
    <BuildDependsOn>GenerateAntlrCode;$(BuildDependsOn)</BuildDependsOn>
  </PropertyGroup>

(info) To finish off the integration, look higher in the same file and find the XML block that contains AssemblyInfo.cs:

<ItemGroup>
    <Compile Include="Program.cs" />
    <Compile Include="Properties\AssemblyInfo.cs" />
</ItemGroup>

  and add the additional Compile options below to have your .cs output files added to the project as dependent on your .g files.  Again, adjust the input and output names to fit your project.

<ItemGroup>
    <Compile Include="Program.cs" />
    <Compile Include="Properties\AssemblyInfo.cs" />
        <Compile Include="SimpleCalcLexer.cs">
            <AutoGen>True</AutoGen>
            <DesignTime>True</DesignTime>
            <DependentUpon>SimpleCalc.g</DependentUpon>
        </Compile>
        <Compile Include="SimpleCalcParser.cs">
            <AutoGen>True</AutoGen>
            <DesignTime>True</DesignTime>
            <DependentUpon>SimpleCalc.g</DependentUpon>
        </Compile>
        <Compile Include="BigCalcParser.cs">
            <AutoGen>True</AutoGen>
            <DesignTime>True</DesignTime>
            <DependentUpon>BigCalc.g</DependentUpon>
        </Compile>
        <Compile Include="BigCalcLexer.cs">
            <AutoGen>True</AutoGen>
            <DesignTime>True</DesignTime>
            <DependentUpon>BigCalc.g</DependentUpon>
        </Compile>
  </ItemGroup>

Finally, do not forget to add InitialTargets attribute into Project node.

<Project DefaultTargets="Build" InitialTargets="GenerateAntlrCode" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

With ANTLR 3.0, there are some minor integration issues. Sometimes error messages have an empty "location" string, so Visual Studio will not detect the error. Also, it is possible for ANTLR to generate the output file successfully even though your grammar has errors in it. When this happens, the grammar will not be recompiled until it is edited again (since the output is more recent than the grammar).

C/C++ .rules Files for Visual Studio

The C runtime distribution - use the 3.1 distribution, which you may need to get from the latest interim build at the time of writing - comes with a set of .rules files, which you can add in to your Visual Studio configuration. When you add a .g file to your Visual Studio project, it will ask you which rule file you wish to use. Note that you must tell Visual Studio whether this is a lexer only, parser only, parser+lexer, or tree grammar (you can change it later if you get this wrong).

The rulefiles are called:

antlr3lexer.rules
antlr3lexerandparser.rules
antlr3parser.rules
antlr3treeparser.rules

Right clicking on a .g file in solution explorer and selecting properties will allow you to configure any of the antlr command line options to suit your needs. The defaults are usually fine, but you may wish to configure the output and lib directories to conform to the usual layouts of directories that Visual Studio assumes, and -Xconversiontimeout is useful for more complicated grammars. The base directory will be the .vcproj directory and so your output may or may not be in this directory (do you store your .g files in the same directory as the .vcproj file basically).

The examples solution (C.sln), which is part of the downloadable examples tar/zip on the main ANTLR downloads page, uses this technique to build the ANTLR grammars, if you are looking for some examples. Look under the C subdirectory for the C target examples.

N.B.

These .rules files only work for the C target.


Eclipse

 Eclipse 3.3+ for Antlr 3.x

AntlrDT is a standard Eclipse plugin implementing an Antlr 3.1+ specific grammar editor, outline, and builder. Also includes a StringTemplate group file editor and outline view.

 See http://www.certiv.net/projects/plugins/antlrdt.html

ANTLR IDE. An eclipse plugin for ANTLRv3 grammars.

Features

  • Support for ANTLR 3.0.x/3.1.x
  • Integrated ANTLR/Java Launcher and Debugger(beta). Note: ANTLR breakpoints not supported yet
  • ANTLR Built-in Interpreter, Java runner and debugger
  • Railroad diagrams
  • Custom targets
  • Automatically (Ctrl+S)/Manually (Ctrl+Shift+G) code generator
  • Problem markers for errors and warnings in grammar files
  • Advanced text editor, code selection (F3) and code completion (Ctrl+Space)
  • Simple syntax highlighting for target language (action code)
  • Outline and quick outline (Ctrl + O) views for options, tokens, scopes, actions and rules
  • Search rules references
  • Mark generated resources as derived

More information? Please visit http://antlrv3ide.sourceforge.net/

IntelliJ

In version 7:

Navigate to File->Settings->IDE Settings->Plugins and install the "ANTLRWorks" plugin.

NetBeans

A plugin that provides some support for editing ANTLR v3 grammar files is available from the NetBeans Plugin Portal.  The author notes that the plugin supports the following while editing: Coloring, Code folding, Code completion, Hyperlink, Mark Occurrences, Navigator.

The plugin does not provide direct support for generating Java source files, nor compiling those files, however NetBeans uses an Ant-based build system, therefore support for these operations can be added to individual projects.  This functionality is added by editing the build.xml file for the project (available from the "Files" tab).

Adding Build and Clean Support to an Individual Project

This is one possible method of adding support for ANTLR v3 to an individual project.  Those who are familiar with Ant build scripts and the Netbeans build chain can modify this to suit their own needs.

Prerequisites

Install and Add Support to Project

First, download the ANTLR v3 Ant task and copy the task's antlr3.jar file into the NetBeansInstallDir/java2/ant/lib directory, where NetBeansInstallDir is the directory where NetBeans is installed (e.g. C:\Program Files\NetBeans 6.5 on a Windows system).

Next, download the Ant-contrib tasks and copy the ant-contrib-1.0b3.jar to the same directory as above.

Next, start NetBeans, and open the project that you'd like to add ANTLR support to. Click on the "Files" tab, and find the build.xml file. Double click to open it. Be careful not to change anything at the top of this file, especially the contents of the <project> and <import> tags. Scroll down past the comments, and find the closing </project> tag. Add the following above the closing </project> tag, but below the closing comment tag (-->). In the first three lines, below, replace AntlrInstallDir with the location of your ANTLR installation (e.g., mine is C:\java\antlr-3.1.2).

build.xml fragment for Antlr v3 support
    <property name="antlr.libdir" location="AntlrInstallDir/lib" />
    <property name="antlr.tooldir" location="AntlrInstallDir/lib" />
    <property name="antlr.runtimedir" location="AntlrInstallDir/lib" />


    <patternset id="antlr.libs">
        <include name="stringtemplate-3.1.jar" />
        <include name="antlr277.jar" />
    </patternset>

    <patternset id="antlr.tool">
        <include name="antlr-3.1.2.jar" />
    </patternset>

    <patternset id="antlr.runtime">
        <include name="antlr-runtime-3.1.2.jar" />
    </patternset>

    <path id="antlr.path">
        <fileset dir="${antlr.tooldir}" casesensitive="yes">
            <patternset refid="antlr.tool" />
        </fileset>

        <fileset dir="${antlr.runtimedir}" casesensitive="yes">
            <patternset refid="antlr.runtime" />
        </fileset>

        <fileset dir="${antlr.libdir}" casesensitive="yes">
            <patternset refid="antlr.libs" />
        </fileset>
    </path>


    <target name="-pre-init">
        <taskdef resource="net/sf/antcontrib/antlib.xml"/>
    </target>


    <target name="-post-clean">
        <fileset id="antlr.grammars" dir="${src.dir}" includes="**/*.g"/>

        <pathconvert property="antlr.clean.files" pathsep=',' refid="antlr.grammars">
            <compositemapper>
                <globmapper from="${basedir}${file.separator}${src.dir}${file.separator}*.g"
                    to="*.tokens"/>
                <globmapper from="${basedir}${file.separator}${src.dir}${file.separator}*.g"
                    to="*Parser.java"/>
                <globmapper from="${basedir}${file.separator}${src.dir}${file.separator}*.g"
                    to="*Lexer.java"/>
                <chainedmapper>
                    <globmapper from="${basedir}${file.separator}${src.dir}${file.separator}*.g"
                        to="*.g"/>
                    <compositemapper>
                        <regexpmapper from="(([^/]*/)*).*\.g"
                            to="\1__Test__.java" handledirsep="true"/>
                        <regexpmapper from="(([^/]*/)*).*\.g"
                            to="\1__Test___input.txt" handledirsep="true"/>
                    </compositemapper>
                </chainedmapper>
            </compositemapper>
        </pathconvert>
        <pathconvert property="antlr.clean.dirs" pathsep=',' refid="antlr.grammars">
            <chainedmapper>
                <globmapper from="${basedir}${file.separator}${src.dir}${file.separator}*.g"
                    to="*.g"/>
                <compositemapper>
                    <regexpmapper from="(([^/]*/)*).*\.g"
                        to="\1classes/**/*" handledirsep="true"/>
                    <regexpmapper from="(([^/]*/)*).*\.g"
                        to="\1classes" handledirsep="true"/>
                </compositemapper>
            </chainedmapper>
        </pathconvert>

        <if>
            <not>
                <equals arg1="${antlr.clean.files}" arg2=""/>
            </not>
            <then>
                <echo level="info">Cleaning ANTLR- and ANTLRWorks-generated files (if any exist):${line.separator}</echo>
                <delete quiet="true" verbose="true">
                    <FileSet dir="${src.dir}" includes="${antlr.clean.files}" excludes="${src.dir}"/>
                </delete>
            </then>
        </if>
        <if>
            <not>
                <equals arg1="${antlr.clean.dirs}" arg2=""/>
            </not>
            <then>
                <echo level="info">Cleaning ANTLRWorks-generated directories (if any exist):${line.separator}</echo>
                <delete quiet="true" verbose="true" includeemptydirs="true">
                    <FileSet dir="${src.dir}" includes="${antlr.clean.dirs}" excludes="${src.dir}"/>
                </delete>
            </then>
        </if>
    </target>


    <target name="-pre-compile-single">
        <basename property="javac.includes.base" file="${javac.includes}"/>
        <if>
            <equals arg1="${javac.includes.base}" arg2="*"/>
            <then>
                <for param="antlr.target">
                    <path>
                        <fileset dir="${src.dir}" includes="${javac.includes}.g"/>
                    </path>
                    <sequential>
                        <antlr:antlr3 xmlns:antlr="antlib:org/apache/tools/ant/antlr"
                            target="@{antlr.target}">
                            <classpath>
                                <path refid="antlr.path" />
                            </classpath>
                        </antlr:antlr3>
                    </sequential>
                </for>
            </then>
        </if>
    </target>


    <target name="-pre-compile">
        <for param="antlr.target">
            <path>
                <fileset dir="${src.dir}" includes="**/*.g"/>
            </path>
            <sequential>
                <antlr:antlr3 xmlns:antlr="antlib:org/apache/tools/ant/antlr"
                    target="@{antlr.target}">
                    <classpath>
                        <path refid="antlr.path" />
                    </classpath>
                </antlr:antlr3>
            </sequential>
        </for>
    </target>

Usage

The above integrates with the regular Java build/clean cycle in NetBeans. Execute the IDE's build operation, and the grammar file(s) will be passed through the ANTLR tool to generate the necessary Java code prior to the compile step. You can access the build operation either from the "Run" menu, by pressing <F11>, or by right-clicking on the project in the "Projects" tab, and selecting "Build" from the contect menu. You can also right-click a package and choose "Compile Package" to generate and compile for a single package. Context menu support is not available for the grammar file itself, unfortunately.

The IDE's clean operation will remove all generated .java and .tokens files, and it will also remove files generated from an ANTLRWorks debug session. The latter allows you to use both NetBeans and ANTLRWorks together on a grammar file, and when ready, clean and build from NetBeans. If your grammar file is named grammarFilename.g, then the following files and subdirectories will be deleted (if they exist) from the directory that contains the grammar file, when you invoke the clean operation:

  • grammarFilename.tokens
  • grammarFilenameLexer.java
  • grammarFilenameParser.java
  • _Test_.java (created from an ANTLRWorks debug session)
  • _Test_input.txt (created from an ANTLRWorks debug session)
  • classes directory found below the directory that grammarFilename.g is found in (created from an ANTLRWorks debug session)

The clean operation is accessed by pressing <Shift>-<F11> (for a clean and build in one step), selecting "Clean and Build" from the "Run" menu, or "Clean" (or "Clean and Build") from the project's context menu. You can control what does and does not get deleted by modifying the -post-clean target in the above example (requires knowledge of Ant build script syntax).

Xcode

will follow...