Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Five minute introduction to ANTLR 3

What is ANTLR 3?

ANTLR – ANother Tool for Language REcognition – is a tool that helps you write language processing tools. It's commonly categorised as - ANother Tool for Language Recognition - is a tool that is used in the construction of formal language software tools (or just language tools) such as translators, compilers, recognizers and, static/dynamic program analyzers. Developers use ANTLR to reduce the time and effort needed to build and maintain language processing tools. In common terminology, ANTLR is a compiler generator or compiler compiler (in the tradition of tools such as Lex/Flex and Yacc/Bison) . ANTLR takes a grammar description (which may define both a language and how to process the language's constructs) and emits multiple files in your chosen target language and it is used to generate the source code for language recognizers, analyzers and translators from language specifications. ANTLR takes as its input a grammar - a precise description of a language augmented with semantic actions - and generates source code files and other auxiliary files. The target language of the generated source code (e.g. Java, C/C++, C#, Python, Ruby...).Developers ) is specified in the grammar.

Software developers and language tool implementors can use ANTLR to implement Domain-Specific Languages, to write generate parts of language compilers and translators, and or even to help them build tools that parse complex XML.

As stated above, ANTLR 3 can generate generates the source code for various tools that can be used to recognize, analyze and transform input in the language defined by the input grammardata relative to a language that is defined in a specified grammar file. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and, TreeParsers (a.k.a tree walkers, c.f. visitors).

What exactly does ANTLR 3 do?

ANTLR reads a grammar language description file called a grammar and generates a number of source code files and other auxiliary files. Most uses of ANTLR generates at least two files for youone (and quite often both) of these tools:

  • A Lexer: This reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using patterns you specify, and generates a token stream as output. It can also hide information flag some tokens such as whitespace and comments from the next stageas hidden using a protocol that ANTLR parsers automatically understand and respect.
  • A Parser: This reads the a token stream (normally generated by a lexer), and matches it phrases in your language via the rules (patterns) you specify, and typically performs some semantic action for each rulephrase (or sub-phrase) matched. Each rule match could invoke a custom action, write some text via StringTemplate, or generate an Abstract Syntax Tree for additional processing.

ANTLR's Abstract Syntax Tree (AST) processing is especially powerful. If you also specify a tree grammar, ANTLR will generate a Tree Parser for you that can contain custom actions or StringTemplate output statements. The next version of ANTLR (3.1) will include rewriting rules to alter the tree into new formssupport rewrite rules that can be used to express tree transformations.

Most language tools will:

  1. Use a Lexer and Parser in series to create an Abstract Syntax Tree,Modify (rewrite) the tree (e.check the word-level and phrase-level structure of the input and if no fatal errors are encountered, create an intermediate tree representation such as an Abstract Syntax Tree (AST),
  2. Optionally modify (i.e tranform or rewrite) the intermediate tree representation (e.g. to perform optimizations) using one or more Tree Parsers, and
  3. Use Produce the final output using a Tree Parser at the end to read process the final tree and either perform custom actions or write out the result via StringTemplate.representation. This might be to generate source code or other textual representation from the tree (perhaps using StringTemplate) or, performing some other custom actions driven by the final tree representation.

Simpler language tools may omit the intermediate tree and build the actions or output stage directly into the parser. The calculator shown below uses only a Lexer and a Parser.

...

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with it's its new LL(*) (star) parsing engine. ANTLR (ANother Tool for Language Recognition) provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

...

Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilers have a major, positive impact on developer productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, it's its significantly enhanced parsing strength via LL(star) parsing with arbitrary lookahead, it's its vastly improved tree construction rewrite rules and the availability of the simply fantastic AntlrWorks IDE offers productivity benefits over other comparable generative language processing toolkits.

...

Download and install ANTLR 3 from the ANTLR 3 page of the ANTLR website.

2. Run ANTLR 3 on a simple grammar

...

grammar SimpleCalc; options { language=CSharp

Note: language=CSharp2 with ANTLR 3.1; ANTLR 3.0.1 uses the older CSharp target

Java

Code Block
grammar SimpleCalc;

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

C#

Code Block
Code Block

grammar SimpleCalc;

options {
    language=CSharp2;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void Main(string[] args) {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            Console.Error.WriteLine(e.StackTrace);
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDENHidden; } ;

fragment DIGIT	: '0'..'9' ;

Objective-C


To be written. Volunteers?

Code Block
grammar SimpleCalc;

options
{
    language=ObjC;
}

OR : '||' ;

C

Code Block
grammar SimpleCalc;

options
{
    language=C;
}

tokens
{
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members
{

 #include "SimpleCalcLexer.h"

 int main(int argc, char * argv[])
 {

    pANTLR3_INPUT_STREAM        {   input;
    pSimpleCalcLexer             pANTLR3_INPUT_STREAM input = antlr3AsciiFileStreamNew(argv[1]) lex;
    pANTLR3_COMMON_TOKEN_STREAM    tokens;
    pSimpleCalcParser              parser;

    input  = antlr3AsciiFileStreamNew          ((pANTLR3_UINT8)argv[1]);
   pSimpleCalcLexer lex    = SimpleCalcLexerNew                (input);
    tokens = antlr3CommonTokenStreamSourceNew  (ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
    parser = SimpleCalcParserNew               (tokens);

    parser  ->expr(parser);

    // Must manually clean up
    //
    parser ->free(parser);
    tokens ->free(tokens);
    lex    ->free(lex);
    input  ->close(input);

    return 0;
 }

}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term   ( ( PLUS | MINUS )  term   )*
        ;

term	: factor ( ( MULT | DIV   )  factor )*
        ;

factor	: NUMBER
        ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	    :  pANTLR3_COMMON_TOKEN_STREAM tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, lex->pLexer->tokSource);(DIGIT)+
            ;

WHITESPACE  : ( '\t' | ' ' | '\r' | pSimpleCalcParser parser = SimpleCalcParserNew(tokens);'\n'| '\u000C' )+
               {
     parser->expr(parser);              $channel = HIDDEN;
     // Must manually clean up     }
            ;

 parser->free(parser);fragment
DIGIT	    : '0'..'9'
            ;

Python

Code Block
 tokens->free(tokens);
                    lex->free(lex);
                    input->close(input);

                    return 0;
 grammar SimpleCalc;

options {
	language = Python;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@header {
import sys
import traceback

from SimpleCalcLexer import SimpleCalcLexer
}

@main {
def main(argv, otherArg=None):
  char_stream = ANTLRFileStream(sys.argv[1])
  lexer = SimpleCalcLexer(char_stream)
  tokens = CommonTokenStream(lexer)
  parser = SimpleCalcParser(tokens);

  try:
        parser.expr()
  except  }RecognitionException:
	traceback.print_stack()
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

...

Code Block
java org.antlr.Tool SimpleCalc.g

ANTLR will generate source files for the lexer and parser (e.g. SimpleCalcLexer.java and SimpleCalcParser.java). Copy these into the appropriate places for your development environment and compile them.

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

...

Note
titleBefore you start

You can learn best by following along, experimenting, and looking at the generated source code. If so, you'll need:

  • A simple text editor,
  • An installed copy of ANTLR 3.01, or
  • An installed copy of ANTLR Works (free, highly recommended, and contains its own copy of ANTLR)

...

  • NUMBER defines a token (named "NUMBER") that contains any character between 0 and 9, inclusive, repeated one or more times. .. creates a character range, while { + } means "one or more times". (This suffix should look familiar if you know regular expressions.)
  • PLUS defines a token with a single character: { +}.
  • add defines a parser rule that says "expect a NUMBER token, a PLUS token, and a NUMBER token in that order." Any other tokens, or tokens in a different order, will trigger an error message.

...

First, we have to define white space:

  • A space is ' '
  • A tab is written '\t'
  • A newline (line feed) is written '\n'
  • A carriage return is written '\r'
  • A Form Feed has a decimal value of 12 and a hexidecimal value of $0C. ANTLR uses Unicode, so we define this as 4 hex digits: {{
    u000C }} '\u000C'

Put these together with an "or", allow one or more to occur together, and you have

...

You hide the token by setting the token's $channel flag to the constant HIDDEN. This requires adding a little code to the lexer, which you do by adding curly brackets:

Code Block
titleDefining whitespace
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; };

...

Code Block
titleMain entry point for Java
@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcSimpleCalcParser parser = new SimpleCalcSimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

...

Code Block
titleTypical options block
grammar SimpleCalc;

options {
    language=CSharpCSharp2;
}

Your five minutes are up!

...

Construct

Description

Example

(...)*

Kleene closure - matches zero or more occurrences

LETTER DIGIT* - match a LETTER followed by zero or more occurrences of DIGIT

(...)+

Positive Kleene closure - matches one or more occurrences

('0'..'9')+ - match one or more occurrences of a numerical digit
LETTER (LETTER|DIGIT)+ - match a LETTER followed one or more occurrences of either LETTER or DIGIT

fragment

fragment in front of a lexer rule instructs ANTLR that the rule is only used as part of another lexer rule (i.e. it only builds a fragment of a recognized token)

fragment {{ DIGIT : '0'..'9' ;

NUMBER : (DIGIT)+ ('.' (DIGIT)+ )? ;}}