Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

ANTLR - ANother Tool for Language Recognition - is a tool that is used in the construction of formal language software tools (or just language tools) such as translators, compilers, recognizers and, static/dynamic program analyzers. Developers use ANTLR to reduce the time and effort needed to build and maintain language processing tools. In common terminology, ANTLR is a compiler generator or compiler compiler (in the tradition of tools such as Lex/Flex and Yacc/Bison) and it is used to generate the source code for language recognizers, analyzers and translators from language specifications. ANTLR takes as it's its input a grammar - a precise description of a language augmented with semantic actions - and generates source code files and other auxiliary files. The target language of the generated source code (e.g. Java, C/C++, C#, Python, Ruby) is specified in the grammar.

...

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with it's its new LL(star) parsing engine. ANTLR provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

...

Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilers have a major, positive impact on developer productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, it's its significantly enhanced parsing strength via LL(star) parsing with arbitrary lookahead, it's its vastly improved tree construction rewrite rules and the availability of the simply fantastic AntlrWorks IDE offers productivity benefits over other comparable generative language processing toolkits.

...

Download and install ANTLR 3 from the ANTLR 3 page of the ANTLR website.

2. Run ANTLR 3 on a simple grammar

...

Java

Code Block
grammar SimpleCalc;

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

C#

Note: language=CSharp2 with ANTLR 3.1; ANTLR 3.0.1 uses the older CSharp target

Code Block
grammar SimpleCalc;

options {
    language=CSharp2;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void Main(string[] args) {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            Console.Error.WriteLine(e.StackTrace);
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDENHidden; } ;

fragment DIGIT	: '0'..'9' ;

Objective-C


To be written. Volunteers?

Code Block
grammar SimpleCalc;

options
{
    language=ObjC;
}

OR : '||' ;

C

Code Block
grammar SimpleCalc;

options
{
    language=C;
}

tokens
{
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members
{

 #include "SimpleCalcLexer.h"

 int main(int argc, char * argv[])
 {

    pANTLR3_INPUT_STREAM        {   input;
    pSimpleCalcLexer            #include "SimpleCalcLexer.h"  lex;
    pANTLR3_COMMON_TOKEN_STREAM    tokens;
    pSimpleCalcParser     pANTLR3_INPUT_STREAM input = antlr3AsciiFileStreamNew(argv[1])         parser;

    input  = antlr3AsciiFileStreamNew          ((pANTLR3_UINT8)argv[1]);
   pSimpleCalcLexer lex    = SimpleCalcLexerNew(input);                (input);
    pANTLR3_COMMON_TOKEN_STREAM tokens = antlr3CommonTokenStreamSourceNew  (ANTLR3_SIZE_HINT, lex->pLexer->tokSourceTOKENSOURCE(lex));
    parser = SimpleCalcParserNew              pSimpleCalcParser parser = SimpleCalcParserNew(tokens);

    parser  ->expr(parser);

            parser->expr(parser);

                    // Must manually clean up
    //
               parserparser ->free(parser);
    tokens ->free(tokens);
    lex          tokens->free(tokenslex);
    input  ->close(input);

    return 0;
 }

}

 lex->free(lex);
                    input->close(input);

                    return 0;
               }
}

/*/*--------------------------------------------------------------------

 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term   ( ( PLUS | MINUS )  term   )*
        ;

term	: factor ( ( MULT | DIV   )  factor )*
        ;

factor	: NUMBER
        ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	    : (DIGIT)+
            ;

WHITESPACE  : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; }'| '\u000C' )+
              {
                 $channel = HIDDEN;
              }
            ;

fragment
DIGIT	    : '0'..'9'
            ;

Python

Code Block
grammar SimpleCalc;

options {
	language = Python;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@header {
import sys
import traceback

from SimpleCalcLexer import SimpleCalcLexer
}

@main {
def main(argv, otherArg=None):
  char_stream = ANTLRFileStream(sys.argv[1])
  lexer = SimpleCalcLexer(char_stream)
  tokens = CommonTokenStream(lexer)
  parser = SimpleCalcParser(tokens);

  try:
        parser.expr()
  except RecognitionException:
	traceback.print_stack()
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

...

Code Block
java org.antlr.Tool SimpleCalc.g

ANTLR will generate source files for the lexer and parser (e.g. SimpleCalcLexer.java and SimpleCalcParser.java). Copy these into the appropriate places for your development environment and compile them.

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

...

First, we have to define white space:

  • A space is ' '
  • A tab is written '\t'
  • A newline (line feed) is written '\n'
  • A carriage return is written '\r'
  • A Form Feed has a decimal value of 12 and a hexidecimal value of $0C. ANTLR uses Unicode, so we define this as 4 hex digits: {{
    u000C }} '\u000C'

Put these together with an "or", allow one or more to occur together, and you have

...

Code Block
titleMain entry point for Java
@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcSimpleCalcParser parser = new SimpleCalcSimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

...

Construct

Description

Example

(...)*

Kleene closure - matches zero or more occurrences

LETTER DIGIT* - match a LETTER followed by zero or more occurrences of DIGIT

(...)+

Positive Kleene closure - matches one or more occurrences

('0'..'9')+ - match one or more occurrences of a numerical digit
LETTER (LETTER|DIGIT)+ - match a LETTER followed one or more occurrences of either LETTER or DIGIT

fragment

fragment in front of a lexer rule instructs ANTLR that the rule is only used as part of another lexer rule (i.e. it only builds a fragment of a recognized token)

fragment {{ DIGIT : '0'..'9' ;

NUMBER : (DIGIT)+ ('.' (DIGIT)+ )? ;}}