Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 34 Next »

Five minute introduction to ANTLR 3

What is ANTLR 3?

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with it's new LL(star) parsing engine. ANTLR (ANother Tool for Language Recognition) provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. It's immediate successor ANTLR 2 supported Java, C# and Python in addition to C++. ANTLR 3 supports Java, C#, Objective C, C, Python and Ruby as target languages and you can add your own. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected or already in progress.

What does ANTLR 3 do?

Put simply, ANTLR 3 generates - the source code for - language processing tools from a grammatical description. To this end, it is commonly categorised as a compiler generator or compiler compiler in the tradition of tools such as Lex/Flex and Yacc/Bison). ANTLR 3 can generate the source code for various tools that can be used to analyze and transform input in the language defined by the input grammar. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and TreeParsers (a.k.a tree walkers, c.f. visitors).

Why should I use ANTLR 3?

Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilers have a major, positive impact on developer productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, it's significantly enhanced parsing strength via LL(star) parsing with arbitrary lookahead, it's vastly improved tree construction rewrite rules and the availability of the simply fantastic AntlrWorks IDE offers productivity benefits over other comparable generative language processing toolkits.

How do I use ANTLR 3?

1. Get ANTLR 3

Download and install ANTLR 3 from the ANTLR 3 page of the ANTLR website

2. Run ANTLR 3 on a simple grammar

2.1 Create a simple grammar

Java

grammar SimpleCalc;

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

C#

grammar SimpleCalc;

options {
    language=CSharp;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {
    public static void Main(string[] args) {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalcParser parser = new SimpleCalcParser(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            Console.Error.WriteLine(e.StackTrace);
        }
    }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

Objective-C


To be written. Volunteers?

grammar SimpleCalc;

options
{
    language=ObjC;
}

OR : '||' ;

C

grammar SimpleCalc;

options
{
    language=C;
}

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

@members {     int main(int argc, char * argv[])
               {
                    pANTLR3_INPUT_STREAM input = antlr3AsciiFileStreamNew(argv[1]);
                    pSimpleCalcLexer lex = SimpleCalcLexerNew(input);
                    pANTLR3_COMMON_TOKEN_STREAM tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, lex->pLexer->tokSource);
                    pSimpleCalcParser parser = SimpleCalcParserNew(tokens);

                    parser->expr(parser);

                    // Must manually clean up
                    parser->free(parser);
                    tokens->free(tokens);
                    lex->free(lex);
                    input->close(input);

                    return 0;
               }
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;


/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;

2.2 Run ANTLR 3 on the simple grammar

java org.antlr.Tool SimpleCalc.g

2.3 Revisit the simple grammar and learn basic ANTLR 3 syntax

Let's break this example down and develop the simple calculator from the beginning. This can help you learn ANTLR by example.

Before you start

You can learn best by following along, experimenting, and looking at the generated source code. If so, you'll need:

  • A simple text editor,
  • An installed copy of ANTLR 3.0, or
  • An installed copy of ANTLR Works (free, highly recommended, and contains its own copy of ANTLR)
Define a trivial grammar

Any language processing system has at least two components:

  1. A lexer that takes a stream of characters and divides the stream into tokens according to pre-set rules, and
  2. A parser that reads the tokens and interprets them according to its rules.

Let's start by defining the rules for a simple arithmetic expression: 100+23:

Trivial calculator
grammar SimpleCalc;

add	: NUMBER PLUS NUMBER;

NUMBER	: ('0'..'9')+ ;

PLUS 	: '+';

This example contains two lexer rules - NUMBER and PLUS - and the parser rule add. Lexer rules always start with an uppercase letter, while parser rules start with lowercase letters.

  • NUMBER defines a token (named "NUMBER") that contains any character between 0 and 9, inclusive, repeated one or more times. .. creates a character range, while {+} means "one or more times". (This suffix should look familiar if you know regular expressions.)
  • PLUS defines a token with a single character: {+}.
  • add defines a parser rule that says "expect a NUMBER token, a PLUS token, and a NUMBER token in that order." Any other tokens, or tokens in a different order, will trigger an error message.
Flesh out the calculator

Let's first allow more complex expressions such as 1 or 1+2 or 1+2+3+4 This starts with a single number, then can add a plus sign and a number (possibly more than once):

repeated addition
add: NUMBER (PLUS NUMBER)*

The * symbol means "zero or more times".

If you want to implement both addition and subtraction, you can make a small adjustment:

Addition and subtraction
add: NUMBER ((PLUS | MINUS) NUMBER)*

MINUS : '-';

As you may have guessed, | means "or" as in "PLUS or MINUS".

If you want to parse complete arithmetic expressions such as 1+2*3, there's a standard recursive way to do it:

Recursive expression definition
expr	: term ( ( PLUS | MINUS )  term )* ;
term	: factor ( ( MULT | DIV ) factor )* ;
factor	: NUMBER ;

MULT : '*';
DIV  : '/';

To evaluate an expression, always start with expr.

Handle white space

Our grammar is intolerant of white space: it will give warnings about spaces, tabs, returns, etc. Let's tell the Lexer that it's safe to discard any white space it finds.

First, we have to define white space:

  • A space is ' '
  • A tab is written '\t'
  • A newline (line feed) is written '\n'
  • A carriage return is written '\r'
  • A Form Feed has a decimal value of 12 and a hexidecimal value of $0C. ANTLR uses Unicode, so we define this as 4 hex digits: {{
    u000C}}

Put these together with an "or", allow one or more to occur together, and you have

Defining whitespace
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+;

However, if we write the expression 3 + 4*5, the lexer will generate NUMBER WHITESPACE PLUS WHITESPACE NUMBER MULT NUMBER and this will cause the parser to complain about the unknown WHITESPACE tokens. We need a way to hide them from the parser.

ANTLR maintains two channels of communication between the lexer and the parser - a default channel and a hidden channel. The parser listens to only one channel at a time (usually the default one), so you can "hide" a token by assigning it to the hidden channel.

There can be more than two channels and the parser can listen to them individually or get the text from all the channels merged together. This is useful when you are writing a text-processing tool that needs to pass through the whitespace and comments to the output while letting the parser ignore those elements.

You hide the token by setting the token's $channel flag to the constant HIDDEN. This requires adding a little code to the lexer, which you do by adding curly brackets:

Defining whitespace
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; };

If you're following along at the keyboard, try generating the Lexer and Parser code now and search the Lexer for channel = HIDDEN

ANTLR generates Java code by default. You'll learn how to change that in just a minute.

Tidy up the code

You can use a few techniques to make your grammar more readable:

  1. Add comments including single-line // and multi-line /* ... */
  2. Gather your simple token definitions (single characters, single words, etc.) into a tokens section at the top of the file.
  3. Consider defining sub-parts of tokens with fragment rules. A fragment will never generate a token by itself but can be used as part of the rule defining another token.

Here's a tidied-up copy:

Tidier grammar
grammar SimpleCalc;

tokens {
	PLUS 	= '+' ;
	MINUS	= '-' ;
	MULT	= '*' ;
	DIV	= '/' ;
}

/*------------------------------------------------------------------
 * PARSER RULES
 *------------------------------------------------------------------*/

expr	: term ( ( PLUS | MINUS )  term )* ;

term	: factor ( ( MULT | DIV ) factor )* ;

factor	: NUMBER ;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

NUMBER	: (DIGIT)+ ;

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ 	{ $channel = HIDDEN; } ;

fragment DIGIT	: '0'..'9' ;
Turn this into a stand-alone program

If you try to run this parser, nothing happens.

You need to add some code to make this work as a stand-alone tool, and you may want to add some variables at the top of the parser. All of this happens in a {{@header { ... }}} block at the top of the file:

Main entry point for Java
@members {
    public static void main(String[] args) throws Exception {
        SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0]));
       	CommonTokenStream tokens = new CommonTokenStream(lex);

        SimpleCalc parser = new SimpleCalc(tokens);

        try {
            parser.expr();
        } catch (RecognitionException e)  {
            e.printStackTrace();
        }
    }
}

This shows the usual pattern: take an input stream, feed it to the generated Lexer, get a token stream from the lexer, feed that to the parser, and then call one of the methods on the parser. (Each parser rule adds a corresponding method to the parser.)

Don't like Java? You can use ANTLR to generate code for Java, C, C++, C#, Objective-C, Python, Ruby, and other languages (see Code Generation Targets) as well as add your own generator. Use an options block to switch languages:

Typical options block
grammar SimpleCalc;

options {
    language=CSharp;
}

Your five minutes are up!

You've just seen:

  • How to write lexer rules
  • How to write basic parser rules
  • How to direct tokens away from the parser (to ignore them)
  • How to insert executable code into a parser.

This covers the majority of the things you need to know to develop a grammar. You may want to work through another of the tutorials:

What next?

Read the Antlr 3 Documentation

Chew on the Quick+Starter+on+Parser+Grammars

Browse the list of questions frequently asked about ANTLR 3

Try the AntlrWorks IDE for ANTLR 3. AntlrWorks can be downloaded from the AntlrWorks page on the ANTLR website

Special constructs (reference)

Construct

Description

Example

(...)*

Kleene closure - matches zero or more occurrences

LETTER DIGIT* - match a LETTER followed by zero or more occurrences of DIGIT

(...)+

Positive Kleene closure - matches one or more occurrences

('0'..'9')+ - match one or more occurrences of a numerical digit
LETTER (LETTER|DIGIT)+ - match a LETTER followed one or more occurrences of either LETTER or DIGIT

fragment

fragment in front of a lexer rule instructs ANTLR that the rule is only used as part of another lexer rule (i.e. it only builds a fragment of a recognized token)

fragment {{DIGIT : '0'..'9' ;

NUMBER : (DIGIT)+ ('.' (DIGIT)+ )? ;}}

  • No labels