What is the difference between ANTLR v2 and v3?

[See also Migrating from ANTLR 2 to ANTLR 3#NewinANTLR3]

There are number of things have changed for the better in v3 including:

  • a brand-new very powerful extension to LL(k) called LL(*)
  • an auto backtracking mode
  • partial parsing result memoization to increase the speed of backtracking
  • a really nice AST rewrite rule mechanism
  • integration of the StringTemplate template engine for generating structured text
  • improved error reporting and recovery
  • a truly retargetable code generator that makes it easy to build backends ("targets"); currently we have the following Code Generation Targets: Java, C#, C, Objective-C, Python with others in development
  • BSD license

Perhaps most importantly, Jean Bovet built the amazing ANTLRWorks grammar development environment.

For ANTLR v3 I decided to make the most common tasks easy by default rather. This means that some of the basic objects are heavier weight than some speed demons would like, but they are free to pare it down leaving most programmers the luxury of having it "just work." For example, to read in some input, tweak it, and write it back out preserving whitespace, is easy in v3.

LL(*) parsing

ANTLR v3 does not require you to specify a lookahead depth (k option in v2). ANTLR's new LL(*) algorithm allows parser lookahead to roam arbitrarily far ahead looking for a token or sequence that disambiguates a decision. LL(k) can only look a fixed k symbols ahead. It is the difference between using a cyclic DFA and an acyclic DFA to make lookahead decisions.

method : type ID '(' arg* ')' ';'
       | type ID '(' arg* ')' '{' body '}'
       ;

This is not LL(k) for any finite k. From the left edge, lookahead unbounded to see the ';' vs '{'. We need arbitrary lookahead because of the arg*. If you have actions after ID, you can't easily refactor the rule. ANTLR v3 handles is no problem as long as rule arg is not recursive. ANTLR builds a tight little DFA to scan ahead looking for ';' vs '{'.

Auto-backtracking

Sometimes even LL(*) will not be powerful enough to handle a particular decision or multiple decisions within a grammar. Rather than have to add a bunch of syntactic predicates in those decisions as you would in v2, you can turn on the auto backtracking feature (set option backtrack=true). This feature essentially tells ANTLR that if it fails to analyze a decision, simply figure it out at run time by backtracking across the alternatives within that decision.

Take the method rule from the previous section. If rule arg is now recursive (as it is in C), ANTLR will fail to build a DFA because DFA have no stacks. A full parser is required to match the argument list. Just turn on the auto backtracking feature and ANTLR will gladly accept just about any non-left-recursive grammar.

grammar C;
options {
  backtrack=true;
}
method : type ID '(' arg* ')' ';'
       | type ID '(' arg* ')' '{' body '}'
       ;

AST rewrite rules

v3 introduces a really nice way to build ASTs. The following example demonstrates the AST rewrite mechanism. Anything after -> is considered a tree grammar. ANTLR figures out how to map the parser grammar to tree grammar.

packageDefinition
	:	'package' classname ';' -> ^('package' classname)
	;

This builds a tree with 'package' at the root and classname as its first child. Here's another example:

formalArgs
	:	typename declarator (',' typename declarator )*
		-> ^(ARG typename declarator)+
	;

This builds tree sequences like:

^(ARG int v1) ^(ARG int v2)

StringTemplate integration

ANTLR v3 tightly integrates the StringTemplate template engine, which is useful for generating any kind of structure text such as source code or XML. In the following grammar, the rule matches a simple assignment statement and implicitly returns a template. The template is an instance of the assign template pulled from a group of templates (set by you in your main()). The arguments to the template are a series of template attribute assignments. This is the interface between the template mechanism and the parser attribute mechanism.

assign_statement : ID '=' INT ';' -> assign(x={$ID.text},y={$INT.text}) ;

Here's a rule that creates a template called 'import' for each import definition found in the input stream:

grammar Java;
options {
  output=template;
}
...
importDefinition
    :   'import' identifierStar SEMI
        -> import(name={$identifierStar.st},
                begin={$identifierStar.start},
                end={$identifierStar.stop})
    ;

The attributes are set via assignments in the argument list. The arguments are actions with arbitrary expressions in the target language. The .st label property is the result template from a rule reference. There is a nice shorthand in actions too:

  • %foo(a={},b={},...) ctor
  • %({name-expr})(a={},...) indirect template ctor reference
  • %{string-expr} anonymous template from string expr
  • %{expr}.y = z; template attribute y of StringTemplate-typed expr to z
  • %x.y = z; set template attribute y of x (always set never get attr)
    to z [languages like python without ';' must still use the
    ';' which the code generator is free to remove during code gen]
    Same as '(x).setAttribute("y", z);'

See Template construction for more information.

Integrated lexers and parsers

Lexers are much easier due to the LL(*)algorithm as well. Previously these two lexer rules would cause trouble because ANTLR couldn't distinguish between them with finite lookahead to see the decimal point:

INT : ('0'..'9')+ ;
FLOAT : INT '.' INT ;

The syntax is almost identical for features in common, but you should note that labels are always '=' not ':'. So do id=ID not id:ID.

You can do combined lexer/parser grammars again (ala PCCTS) both lexer and parser rules are defined in the same file. See the examples. Really nice. You can reference strings and characters in the grammar and ANTLR will generate the lexer for you.

Semantic predicates hoisting

Introduced in ANTLR v1, semantic predicates hoisting is back in v3 (it was absent in v2).

Attributes

The attribute structure has been enhanced. Rules may have multiple return values, for example. Further, there are dynamically scoped attributes whereby a rule may define a value usable by any rule it invokes directly or indirectly w/o having to pass a parameter all the way down.

finally block

Code in the finally {...} action executes in the finally block (java target) after all other stuff like rule memoization. Example:

foo : FOO ;
    finally { System.out.println("on the way out of foo"); }

Source code

The ANTLR source code is much prettier and there are about 800 unit tests so far.

You'll also note that the run-time classes are conveniently encapsulated in the org.antlr.runtime package.