ANTLR - ANother Tool for Language Recognition - is a tool that is used in the construction of formal language software tools (or just language tools) such as translators, compilers, recognizers and, static/dynamic program analyzers. Developers use ANTLR to reduce the time and effort needed to build and maintain language processing tools. In common terminology, ANTLR is a compiler generator or compiler compiler (in the tradition of tools such as Lex/Flex and Yacc/Bison) and it is used to generate the source code for language recognizers, analyzers and translators from language specifications. ANTLR takes as its input a grammar - a precise description of a language augmented with semantic actions - and generates source code files and other auxiliary files. The target language of the generated source code (e.g. Java, C/C++, C#, Python, Ruby) is specified in the grammar.
Software developers and language tool implementors can use ANTLR to implement Domain-Specific Languages, to generate parts of language compilers and translators, or even to help them build tools that parse complex XML.
As stated above, ANTLR 3 generates the source code for various tools that can be used to recognize, analyze and transform input data relative to a language that is defined in a specified grammar file. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and, TreeParsers (a.k.a tree walkers, c.f. visitors).
ANTLR reads a language description file called a grammar and generates a number of source code files and other auxiliary files. Most uses of ANTLR generates at least one (and quite often both) of these tools:
ANTLR's Abstract Syntax Tree (AST) processing is especially powerful. If you also specify a tree grammar, ANTLR will generate a Tree Parser for you that can contain custom actions or StringTemplate output statements. The next version of ANTLR (3.1) will support rewrite rules that can be used to express tree transformations.
Most language tools will:
Simpler language tools may omit the intermediate tree and build the actions or output stage directly into the parser. The calculator shown below uses only a Lexer and a Parser.
ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with its new LL parsing engine. ANTLR provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).
When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. PCCTS's immediate successor was ANTLR 2 and it supported Java, C# and Python as target languages in addition to C++.
ANTLR 3 already supports Java, C#, Objective C, C, Python and Ruby as target languages. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected or already in progress. This is all due in part to the fact that it is much easier to add support for a target language (or customize the code generated by an existing target) in ANTLR 3.
Because it can save you time and resources by automating significant portions of the effort involved in building language processing tools. It is well established that generative tools such as compiler compilers have a major, positive impact on developer productivity. In addition, many of ANTLR v3's new features including an improved analysis engine, its significantly enhanced parsing strength via LL parsing with arbitrary lookahead, its vastly improved tree construction rewrite rules and the availability of the simply fantastic AntlrWorks IDE offers productivity benefits over other comparable generative language processing toolkits.
Download and install ANTLR 3 from the ANTLR website.
Java |
|
|
---|---|---|
C# |
Note: language=CSharp2 with ANTLR 3.1; ANTLR 3.0.1 uses the older CSharp target
|
|
Objective-C |
|
|
C |
|
|
Python |
|
java org.antlr.Tool SimpleCalc.g |
ANTLR will generate source files for the lexer and parser (e.g. SimpleCalcLexer.java and SimpleCalcParser.java). Copy these into the appropriate places for your development environment and compile them.
Let's break this example down and develop the simple calculator from the beginning. This can help you learn ANTLR by example.
You can learn best by following along, experimenting, and looking at the generated source code. If so, you'll need:
|
Any language processing system has at least two components:
Let's start by defining the rules for a simple arithmetic expression: 100+23
:
grammar SimpleCalc; add : NUMBER PLUS NUMBER; NUMBER : ('0'..'9')+ ; PLUS : '+'; |
This example contains two lexer rules - NUMBER
and PLUS
- and the parser rule add
. Lexer rules always start with an uppercase letter, while parser rules start with lowercase letters.
..
creates a character range, while + means "one or more times". (This suffix should look familiar if you know regular expressions.)Let's first allow more complex expressions such as 1
or 1+2
or 1+2+3+4
This starts with a single number, then can add a plus sign and a number (possibly more than once):
add: NUMBER (PLUS NUMBER)* |
The *
symbol means "zero or more times".
If you want to implement both addition and subtraction, you can make a small adjustment:
add: NUMBER ((PLUS | MINUS) NUMBER)* MINUS : '-'; |
As you may have guessed, |
means "or" as in "PLUS or MINUS".
If you want to parse complete arithmetic expressions such as 1+2*3
, there's a standard recursive way to do it:
expr : term ( ( PLUS | MINUS ) term )* ; term : factor ( ( MULT | DIV ) factor )* ; factor : NUMBER ; MULT : '*'; DIV : '/'; |
To evaluate an expression, always start with |
Our grammar is intolerant of white space: it will give warnings about spaces, tabs, returns, etc. Let's tell the Lexer that it's safe to discard any white space it finds.
First, we have to define white space:
' '
'\t'
'\n'
'\r'
'\u000C'
Put these together with an "or", allow one or more to occur together, and you have
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+; |
However, if we write the expression 3 + 4*5
, the lexer will generate NUMBER WHITESPACE PLUS WHITESPACE NUMBER MULT NUMBER
and this will cause the parser to complain about the unknown WHITESPACE
tokens. We need a way to hide them from the parser.
ANTLR maintains two channels of communication between the lexer and the parser - a default channel and a hidden channel. The parser listens to only one channel at a time (usually the default one), so you can "hide" a token by assigning it to the hidden channel.
There can be more than two channels and the parser can listen to them individually or get the text from all the channels merged together. This is useful when you are writing a text-processing tool that needs to pass through the whitespace and comments to the output while letting the parser ignore those elements. |
You hide the token by setting the token's $channel
flag to the constant HIDDEN
. This requires adding a little code to the lexer, which you do by adding curly brackets:
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; }; |
If you're following along at the keyboard, try generating the Lexer and Parser code now and search the Lexer for ANTLR generates Java code by default. You'll learn how to change that in just a minute. |
You can use a few techniques to make your grammar more readable:
//
and multi-line /* ... */
tokens
section at the top of the file.fragment
rules. A fragment
will never generate a token by itself but can be used as part of the rule defining another token.Here's a tidied-up copy:
grammar SimpleCalc; tokens { PLUS = '+' ; MINUS = '-' ; MULT = '*' ; DIV = '/' ; } /*------------------------------------------------------------------ * PARSER RULES *------------------------------------------------------------------*/ expr : term ( ( PLUS | MINUS ) term )* ; term : factor ( ( MULT | DIV ) factor )* ; factor : NUMBER ; /*------------------------------------------------------------------ * LEXER RULES *------------------------------------------------------------------*/ NUMBER : (DIGIT)+ ; WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ; fragment DIGIT : '0'..'9' ; |
If you try to run this parser, nothing happens.
You need to add some code to make this work as a stand-alone tool, and you may want to add some variables at the top of the parser. All of this happens in a {{ @header { ... } }} block at the top of the file:
@members { public static void main(String[] args) throws Exception { SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0])); CommonTokenStream tokens = new CommonTokenStream(lex); SimpleCalcParser parser = new SimpleCalcParser(tokens); try { parser.expr(); } catch (RecognitionException e) { e.printStackTrace(); } } } |
This shows the usual pattern: take an input stream, feed it to the generated Lexer, get a token stream from the lexer, feed that to the parser, and then call one of the methods on the parser. (Each parser rule adds a corresponding method to the parser.)
Don't like Java? You can use ANTLR to generate code for Java, C, C++, C#, Objective-C, Python, Ruby, and other languages (see Code Generation Targets) as well as add your own generator. Use an options
block to switch languages:
grammar SimpleCalc; options { language=CSharp2; } |
You've just seen:
Some points to consider:
This covers the majority of the things you need to know to develop a grammar. You may want to work through another of the tutorials:
You could also:
Construct |
Description |
Example |
---|---|---|
|
Kleene closure - matches zero or more occurrences |
|
|
Positive Kleene closure - matches one or more occurrences |
|
|
|
|