Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added a preamble that (hopefully) provides a gentler introduction

Five minute introduction to ANTLR 3

What is ANTLR 3?

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with it's new LL(*) parsing engine. ANTLR (ANother Tool for Language Recognition) provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. PCCTS's immediate successor was ANTLR 2 and it supported Java, C# and Python as target languages in addition to C++. ANTLR 3 already supports Java, C#, Objective C, C, Python and Ruby as target languages. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected or already in progress. This is all due in part to the fact that it is much easier to add support for a target language (or customize the code generated by an existing target) in ANTLR 3.

What does ANTLR 3 do?

Put simply, ANTLR 3 generates - the source code for - language processing tools from a grammatical description. To this end, it is commonly categorised as a compiler generator or compiler compiler in the tradition of tools such as Lex/Flex and Yacc/Bison). ANTLR 3 can generate the source code for various tools that can be used to analyze and transform input in the language defined by the input grammar. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and TreeParsers (a.k.a tree walkers, c.f. visitors)– *AN*other *T*ool for *L*anguage *RE*cognition – is a tool that helps you write language processing tools. It's commonly categorised as a compiler generator or compiler compiler in the tradition of tools such as Lex/Flex and Yacc/Bison). ANTLR takes a grammar description (which may define both a language and how to process the language's constructs) and emits multiple files in your chosen target language (e.g. Java, C/C++, C#, Python, Ruby...).

Developers use ANTLR to implement Domain-Specific Languages, to write language compilers and translators, and even to parse complex XML.

ANTLR 3 can generate the source code for various tools that can be used to analyze and transform input in the language defined by the input grammar. The basic types of language processing tools that ANTLR can generates are Lexers (a.k.a scanners, tokenizers), Parsers and TreeParsers (a.k.a tree walkers, c.f. visitors).

What exactly does ANTLR 3 do?

ANTLR reads a grammar description file and generates at least two files for you:

  • A Lexer: This reads an input stream (characters, binary data, etc.), divides it into tokens using patterns you specify, and generates a token stream as output. It can also hide information such as whitespace and comments from the next stage.
  • A Parser: This reads the token stream, matches it via the rules (patterns) you specify, and performs some action for each rule. Each rule could invoke a custom action, write some text via StringTemplate, or generate an Abstract Syntax Tree for additional processing.

ANTLR's Abstract Syntax Tree (AST) processing is especially powerful. If you also specify a tree grammar, ANTLR will generate a Tree Parser for you that can contain custom actions or StringTemplate output statements. The next version of ANTLR (3.1) will include rewriting rules to alter the tree into new forms.

Most language tools will:

  1. Use a Lexer and Parser in series to create an Abstract Syntax Tree,
  2. Modify (rewrite) the tree (e.g. to perform optimizations), and
  3. Use a Tree Parser at the end to read the final tree and either perform custom actions or write out the result via StringTemplate.

Simpler language tools may omit the intermediate tree and build the actions or output stage directly into the parser. The calculator shown below uses only a Lexer and a Parser.

ANTLR, Then and Now

ANTLR 3 is the latest version of a language processing toolkit that was originally released as PCCTS in the mid-1990s. As was the case then, this release of the ANTLR toolkit advances the state of the art with it's new LL(*) parsing engine. ANTLR (ANother Tool for Language Recognition) provides a framework for the generation of recognizers, compilers, and translators from grammatical descriptions. ANTLR grammatical descriptions can optionally include action code written in what is termed the target language (i.e. the implementation language of the source code artifacts generated by ANTLR).

When it was released, PCCTS supported C as its only target language, but through consulting with NeXT Computer, PCCTS gained C++ support after 1994. PCCTS's immediate successor was ANTLR 2 and it supported Java, C# and Python as target languages in addition to C++.

Target languages

ANTLR 3 already supports Java, C#, Objective C, C, Python and Ruby as target languages. Support for additional target languages including C++, Perl6 and Oberon (yes, Oberon) is either expected or already in progress. This is all due in part to the fact that it is much easier to add support for a target language (or customize the code generated by an existing target) in ANTLR 3.

Why should I use ANTLR 3?

...