Five minute introduction to ANTLR 3

What is ANTLR 3?

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with it's new LL parsing engine. ANTLR (ANother Tool for Language Recognition) provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. It's immediate successor ANTLR 2 supported Java, C# and Python in addition to C++. ANTLR 3 supports Java, C#, Objective C, C, Python and Ruby as target languages and you can add your own. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected or already in progress.

What does ANTLR 3 do?

Put simply, ANTLR 3 generates - the source code for - language processing tools from a grammatical description. To this end, it is commonly categorised as a compiler generator or compiler compiler in the tradition of tools such as Lex/Flex and Yacc/Bison). ANTLR 3 can generate the source code for various tools that can be used to analyze and transform input in the language defined by the input grammar. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and TreeParsers (a.k.a tree walkers, c.f. visitors).

Why should I use ANTLR 3?

Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilers have a major, positive impact on developer productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, it's significantly enhanced parsing strength via LL parsing with arbitrary lookahead, it's vastly improved tree construction rewrite rules and the availability of the simply fantastic AntlrWorks IDE offers productivity benefits over other comparable generative language processing toolkits.

How do I use ANTLR 3?

1. Get ANTLR 3

Download and install ANTLR 3 from the ANTLR 3 page of the ANTLR website

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

...

Java

...

- ANother Tool for Language Recognition - is a tool that is used in the construction of formal language software tools (or just language tools) such as translators, compilers, recognizers and, static/dynamic program analyzers. Developers use ANTLR to reduce the time and effort needed to build and maintain language processing tools. In common terminology, ANTLR is a compiler generator or compiler compiler (in the tradition of tools such as Lex/Flex and Yacc/Bison) and it is used to generate the source code for language recognizers, analyzers and translators from language specifications. ANTLR takes as its input a grammar - a precise description of a language augmented with semantic actions - and generates source code files and other auxiliary files. The target language of the generated source code (e.g. Java, C/C++, C#, Python, Ruby) is specified in the grammar.

Software developers and language tool implementors can use ANTLR to implement Domain-Specific Languages, to generate parts of language compilers and translators, or even to help them build tools that parse complex XML.

As stated above, ANTLR 3 generates the source code for various tools that can be used to recognize, analyze and transform input data relative to a language that is defined in a specified grammar file. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and, TreeParsers (a.k.a tree walkers, c.f. visitors).

What exactly does ANTLR 3 do?

ANTLR reads a language description file called a grammar and generates a number of source code files and other auxiliary files. Most uses of ANTLR generates at least one (and quite often both) of these tools:

A Lexer: This reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using patterns you specify, and generates a token stream as output. It can also flag some tokens such as whitespace and comments as hidden using a protocol that ANTLR parsers automatically understand and respect.
A Parser: This reads a token stream (normally generated by a lexer), and matches phrases in your language via the rules (patterns) you specify, and typically performs some semantic action for each phrase (or sub-phrase) matched. Each match could invoke a custom action, write some text via StringTemplate, or generate an Abstract Syntax Tree for additional processing.

ANTLR's Abstract Syntax Tree (AST) processing is especially powerful. If you also specify a tree grammar, ANTLR will generate a Tree Parser for you that can contain custom actions or StringTemplate output statements. The next version of ANTLR (3.1) will support rewrite rules that can be used to express tree transformations.

Most language tools will:

Use a Lexer and Parser in series to check the word-level and phrase-level structure of the input and if no fatal errors are encountered, create an intermediate tree representation such as an Abstract Syntax Tree (AST),
Optionally modify (i.e tranform or rewrite) the intermediate tree representation (e.g. to perform optimizations) using one or more Tree Parsers, and
Produce the final output using a Tree Parser to process the final tree representation. This might be to generate source code or other textual representation from the tree (perhaps using StringTemplate) or, performing some other custom actions driven by the final tree representation.

Simpler language tools may omit the intermediate tree and build the actions or output stage directly into the parser. The calculator shown below uses only a Lexer and a Parser.

ANTLR, Then and Now

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with its new LL parsing engine. ANTLR provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. PCCTS's immediate successor was ANTLR 2 and it supported Java, C# and Python as target languages in addition to C++.

Target languages

ANTLR 3 already supports Java, C#, Objective C, C, Python and Ruby as target languages. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected or already in progress. This is all due in part to the fact that it is much easier to add support for a target language (or customize the code generated by an existing target) in ANTLR 3.

Why should I use ANTLR 3?

Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilers have a major, positive impact on developer productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, its significantly enhanced parsing strength via LL parsing with arbitrary lookahead, its vastly improved tree construction rewrite rules and the availability of the simply fantastic AntlrWorks IDE offers productivity benefits over other comparable generative language processing toolkits.

How do I use ANTLR 3?

1. Get ANTLR 3

Download and install ANTLR 3 from the ANTLR website.

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

To be written. Volunteers?

Java

Code Block


grammar SimpleCalc;

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

C#

Note: language=CSharp2 with ANTLR 3.1; ANTLR 3.0.1 uses the older CSharp target

Code Block


grammar SimpleCalc;

options {
    language=CSharp2;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void mainMain(Stringstring[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            Console.Error.WriteLine(e.printStackTrace(StackTrace);
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

C#

Code Block


grammar SimpleCalc;

options {
    language=CSharp;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void Main(string[] args)--*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = Hidden; } ;

fragment DIGIT	: '0'..'9' ;

Objective-C

To be written. Volunteers?

Code Block
grammar SimpleCalc; options { language=ObjC; } OR SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0])): '\|\|' ;

C

Code Block


grammar SimpleCalc;

options
{
    	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
    language=C;
}

tokens
{
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members
{

 #include "SimpleCalcLexer.h"

 int main(int argc, char * argv[])
 {

    pANTLR3_INPUT_STREAM  parser.expr();         }input;
catch (RecognitionException e)  {pSimpleCalcLexer             Console.Error.WriteLine(e.StackTrace)  lex;
    pANTLR3_COMMON_TOKEN_STREAM    tokens;
}    pSimpleCalcParser } }

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

Objective-C

Code Block
grammar SimpleCalc; options { language=ObjC; } OR : '\|\|' ;

C

Code Block


grammar SimpleCalc;

options
{
    language=C;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {     int main(int argc, char * argv[])
                parser;

    input  = antlr3AsciiFileStreamNew          ((pANTLR3_UINT8)argv[1]);
    lex    = SimpleCalcLexerNew                (input);
    tokens = antlr3CommonTokenStreamSourceNew  (ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
    parser = SimpleCalcParserNew               (tokens);

    parser  ->expr(parser);

    // Must manually clean up
    //
    parser ->free(parser);
    tokens ->free(tokens);
    lex    ->free(lex);
    input  ->close(input);

    return 0;
 }

}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term   ( ( PLUS | MINUS )  term   )*
        ;

{term	: factor ( ( MULT | DIV   )  factor )*
        pANTLR3_INPUT_STREAM input = antlr3AsciiFileStreamNew(argv[1]);

factor	: NUMBER
        ;


/*------------------------------------------------------------------
 * LEXER    pSimpleCalcLexer lex = SimpleCalcLexerNew(input);
 RULES
 *------------------------------------------------------------------*/

NUMBER	          : (DIGIT)+
        pANTLR3_COMMON_TOKEN_STREAM tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, lex->pLexer->tokSource);

WHITESPACE  : ( '\t' | ' ' | '\r' | '\n'| '\u000C'       pSimpleCalcParser parser = SimpleCalcParserNew(tokens);)+
              {
         parser->expr(parser);        $channel = HIDDEN;
           // Must manually clean}
up            ;

fragment
DIGIT	    :  parser->free(parser);'0'..'9'
            ;

Python

Code Block

grammar SimpleCalc;

options {
	language  tokens->free(tokens)= Python;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

 lex->free(lex);
                    input->close(input);

                    return 0;@header {
import sys
import traceback

from SimpleCalcLexer import SimpleCalcLexer
}

@main {
def main(argv, otherArg=None):
  char_stream = ANTLRFileStream(sys.argv[1])
  lexer = SimpleCalcLexer(char_stream)
  tokens = CommonTokenStream(lexer)
  parser = SimpleCalcParser(tokens);

  try:
        parser.expr()
  except RecognitionException:
  	traceback.print_stack()
}
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

...

Code Block
java org.antlr.Tool SimpleCalc.g

ANTLR will generate source files for the lexer and parser (e.g. SimpleCalcLexer.java and SimpleCalcParser.java). Copy these into the appropriate places for your development environment and compile them.

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

...

Note

title	Before you start

You can learn best by following along, experimenting, and looking at the generated source code. If so, you'll need:

A simple text editor,
An installed copy of ANTLR 3.01, or
An installed copy of ANTLR Works (free, highly recommended, and contains its own copy of ANTLR)

...

NUMBER defines a token (named "NUMBER") that contains any character between 0 and 9, inclusive, repeated one or more times. .. creates a character range, while { + } means "one or more times". (This suffix should look familiar if you know regular expressions.)
PLUS defines a token with a single character: { +}.
add defines a parser rule that says "expect a NUMBER token, a PLUS token, and a NUMBER token in that order." Any other tokens, or tokens in a different order, will trigger an error message.

...

First, we have to define white space:

A space is ' '
A tab is written '\t'
A newline (line feed) is written '\n'
A carriage return is written '\r'
A Form Feed has a decimal value of 12 and a hexidecimal value of $0C. ANTLR uses Unicode, so we define this as 4 hex digits: {{
u000C}} '\u000C'

Put these together with an "or", allow one or more to occur together, and you have

...

You hide the token by setting the token's $channel flag to the constant HIDDEN. This requires adding a little code to the lexer, which you do by adding curly brackets:

Code Block

title	Defining whitespace

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; };

...

Code Block

title	Main entry point for Java

@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcSimpleCalcParser parser = new SimpleCalcSimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

...

Code Block

title	Typical options block

grammar SimpleCalc;

options {
    language=CSharpCSharp2;
}

Your five minutes are up!

...

How to write lexer rules
How to write basic parser rules
How to direct tokens away from the parser (to ignore them)
How to insert executable code into a parser.parser rules
How to direct tokens away from the parser (to ignore them)
How to insert executable code into a parser.

Some points to consider:

You can insert custom actions anywhere.
Most of your custom code winds up in the last stage of the parsing process. Here it was in the Parser; if you used an AST, it would be in the tree parser.

What next?

This covers the majority of the things you need to know to develop a grammar. You may want to work through another of the tutorials:

Java programmers can try See Test-Driven Development with ANTLR for an example of building a grammar from the ground up.
Try the JSON Interpreter or the Simple tree-based interpeter to learn about Abstract Syntax Trees.

...

You could also:

Order The Definitive ANTLR Reference from the Pragmatic Programmers
Read the Antlr 3 Documentation
Chew on the Quick+Starter+on+Parser+Grammars
Browse the list of questions frequently asked about ANTLR 3
Try the AntlrWorks IDE for ANTLR 3. AntlrWorks can be downloaded from the AntlrWorks page on the ANTLR website
See Presentations on ANTLR

Special constructs (reference)

Construct	Description	Example
`(...)*`	Kleene closure - matches zero or more occurrences	`LETTER DIGIT*` - match a `LETTER` followed by zero or more occurrences of `DIGIT`
`(...)+`	Positive Kleene closure - matches one or more occurrences	`('0'..'9')+` - match one or more occurrences of a numerical digit `LETTER (LETTER\|DIGIT)+` - match a `LETTER` followed one or more occurrences of either `LETTER` or `DIGIT`
`fragment`	`fragment` in front of a lexer rule instructs ANTLR that the rule is only used as part of another lexer rule (i.e. it only builds a fragment of a recognized token)	`fragment` {{ DIGIT : '0'..'9' ; NUMBER : (DIGIT)+ ('.' (DIGIT)+ )? ;}}

Versions Compared

Old Version 34

New Version Current

Key

Five minute introduction to ANTLR 3

What is ANTLR 3?

What does ANTLR 3 do?

Why should I use ANTLR 3?

How do I use ANTLR 3?

1. Get ANTLR 3

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

What exactly does ANTLR 3 do?

ANTLR, Then and Now

Target languages

Why should I use ANTLR 3?

How do I use ANTLR 3?

1. Get ANTLR 3

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

Your five minutes are up!

What next?

Special constructs (reference)

Page Comparison

Versions Compared

Old Version 34

New Version Current

Key

Five minute introduction to ANTLR 3

What is ANTLR 3?

What does ANTLR 3 do?

Why should I use ANTLR 3?

How do I use ANTLR 3?

1. Get ANTLR 3

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

What exactly does ANTLR 3 do?

ANTLR, Then and Now

Target languages

Why should I use ANTLR 3?

How do I use ANTLR 3?

1. Get ANTLR 3

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

Your five minutes are up!

What next?

Special constructs (reference)