ANTLR v3.1

August 12, 2008

Terence Parr
ANTLR project lead and supreme dictator for life
University of San Francisco
Credits

ANTLR v3.1 introduces a number of important new features, improvements, and bug fixes. v3.1 also introduces a new tool to the ANTLR family: gUnit. There are six Code Generation Targets up to v3.1:

Java; current maintainer Terence Parr
C#; current maintainer Johannes Luber
Python; current maintainer Benjamin Niemann
C; current maintainer Jim Idle
ActionScript; current maintainer George Scott
JavaScript; current maintainer Joey Hurst

Two new examples are available: composite-java that demonstrates grammar imports and polydiff that demonstrates tree rewriting the tree grammars.

ANTLR 3.1 should read your 3.0.1 grammars, but you must regenerate your recognizers from the grammars to be compatible with the new runtime. See the section below on incompatibilities for more information.

New features

gUnit Grammar unit testing tool (Jen-Yuan Leon Su). Here are a few of the possible things you can test:

gunit C; // use C.g grammar
decl:
    "int x" FAIL     // expect failure, because of missing ';' in the input string
    "int x;" -> "x"  // expect standard output "x" from rule
    "int x;" -> (DECL int x) // test AST construction
funcHeader:
    "void bar(int x)" returns ["int"]  // expect return string "int" from funcHeader

Heterogeneous tree construction. Token references can be modified with a node option, which can be omitted because it is the default option. ID<VarNode> is the same as ID<node=VarNode> and creates an AST node of type VarNode rather than CommonTree, the default without the token option. The modifiers are available during automatic AST construction as well as in rewrites:
```
a : ID INT -> ^(INT<V> ID<W>) ;
```

AST construction and rewriting for tree parsers. In 3.0.1, only parser grammars could construct ASTs. 3.1 allows you to build new ASTs from existing ASTs. This amounts to a tree rewriting ability. For efficiency, option rewrite=true does an in-line replacement for rewrite rules so you can avoid making a copy of an entire tree just to tweak a few nodes. From the polynomial differentiator example, here are a few simple rewrites:
```
poly:
    |   ^(MULT INT ID)      -> INT      // 2x -> 2
    |   INT                 -> INT["0"] // 34 -> 0
    |   ID                  -> INT["1"] // x -> 1
   ...
    ;
```
Grammar composition. ANTLR v3.1 introduces a grammar composition mechanism to simultaneously allow the logical organization of large grammars, provide more opportunities for grammar reuse, and allow the programmer to control the size of the generated classes. From the composite-java example, here is the root grammar that imports all of the delegates:
```
grammar Java;
options {backtrack=true; memoize=true;}

import JavaDecl, JavaAnnotations, JavaExpr, JavaStat, JavaLexerRules;

compilationUnit
    :   annotations? packageDeclaration? importDeclaration* typeDeclaration*
    ;
```

Improvements

Errors and warnings

Added UnwantedTokenException and MissingTokenException to make match() problems more precise in case you want to catch differently. Updated getErrorMessage() to be more precise. Says:
```
line 2:9 missing EQ at '0'
```
now instead of
```
line 2:9 mismatched input '0' expecting EQ
```
Input "x=9 9;" gives
```
line 3:8 extraneous input '9' expecting ';'
```
When very confused, "x=9 for;", you still get old mismatched message:
```
line 3:8 extraneous input 'for' expecting ';'
line 3:11 mismatched input ';' expecting '('
```

Improved insufficiently covered alt warnings from:

warning(203): T.g:2:3: The following alternatives are insufficiently covered with predicates: 1

to:

warning(203): T.g:2:3: Input B is insufficiently covered with predicates
at locations: alt 1: line 3:15, alt 2: line 2:9

Insufficiently covered (with semantic predicates) alt warnings are now emitted before nondeterminisms so it's clear the nondeterminism is a result of insufficient preds.
Unreachable alt warnings are now errors.
AST construction now inserts error nodes into trees upon syntax error.

Runtime libraries

Added syntaxError recognizer state var so you can easily tell if a recognizer failed. Added getNumberOfSyntaxErrors() to recognizers.
Interpreter throws FailedPredicateException now when it sees a predicate; before it was silently failing. I'll make it work one of these days.

Added reset() to CommonTreeNodeStream, token stream too. This allows you to reuse the same recognizer object to avoid object construction costs. Just reset the stream and call the start symbol. Should behave just like any newly created recognizer object.

Adding errorNode to TreeAdaptor and various debug events/listeners. Had to add new class runtime.tree.CommonErrorNode to hold all the goodies: input stream, start/stop objects. See Tree construction.

Improved parse trees; added toInputString() method to show original input source using real nodes and hidden tokens.

ANTLR Tool speed

Working with Kay Roepke, we got about 15% speed improvement in overall ANTLR exec time. Memory footprint seems to be about 50% smaller. Some speed improvements for grammar processing are more dramatic but not always. One user reports that v3.0.1 parser generation takes 5 minutes. v3.1 takes 55 seconds.

Element properties

Made refs to rule/token properties use conditional operator ?: to avoid null ptr exceptions. $label.st, for example, is now label!=null?label.st:null not label.st. This is useful not only for optional rule/token refs, but also during error recovery. If ID is not matched, $ID.text won't cause a null ptr.
added "int" property for token and lexer rule refs. super convenient. E.g.,
```
  a : b=INT {int x = $b.int;} ;
```
Added token options to terminals: ID<node=V; foo="Big bob"> etc... node option is common enough that you can omit it so you can do ID<V> for hetero tree types.

Miscellaneous

If -dfa with combined grammar T.g, builds T.dec-*.dot and TLexer.dec-*.dot
Added
```
{{...}}
```
forced action that is executed even during backtracking. Useful for managing a symbol table even when backtracking; those actions cannot be turned off because the results of symbol table lookups direct the parse often.

From bug tracking system

ANTLR-43 - set proper exit code when error condition has been encountered
ANTLR-44 - I do not think that I generate dangling state errors anymore
ANTLR-111 - improve return value, parameter action parsing; can't do generics for example
ANTLR-143 - find a way to deal with optional rule/token references in actions
ANTLR-165 - make lexer grammar TLexer generate TLexer not TLexerLexer.java
ANTLR-169 - delete tmp lexer file T__.g when finished
ANTLR-193 - add error node to AST upon syntax error
ANTLR-210 - actions execute even after syntax error recovery
ANTLR-213 - Do single token insertion and deletion only on those tokens not referenced in actions or rewrite rules
ANTLR-218 - antlr got slower?
ANTLR-220 - need to implement dupTree and dupNode debugging event listener events
ANTLR-225 - setting token types in invoked lexer rules is wacky
ANTLR-226 - make generated classes from imported grammars use full path in name
ANTLR-233 - make all lexer vars like $type global
ANTLR-234 - Generated delegate parsers should have prefix of delegator
ANTLR-235 - 12-31.14 release is 10x faster than 12-31.17
ANTLR-239 - allow token level options not just single word for tree node type
ANTLR-267 - ParseTreeBuilder does not respect decision contexts

Bug fixes

Single token error recovery was not properly taking into consideration EOF.
ANTLR-39 - ANTLR not catching non-LL(*) decision, spinning forever
ANTLR-105 - Extra backslash generated for literals in lexer
ANTLR-119 - superClass for lexer doesn't work
ANTLR-123 - getting epsilon in k=4 decision ambiguity
ANTLR-124 - ErrorManager.setErrorListener() leaks
ANTLR-125 - NullPointerException when processing an empty or incomplete grammar
ANTLR-130 - ANTLR doesn't finish
ANTLR-140 - Not possible to declare arrays as rule arguments
ANTLR-157 - Duplicate Hashtable using alias generated for grammar with backtracking and template output
ANTLR-177 - inconsistent equals/hashcode for Label
ANTLR-178 - bad DFA generated without error in rule statement's decision (22)
ANTLR-179 - missing attribute in combined grammar
ANTLR-180 - dependency on v2
ANTLR-181 - AST rewrite doesn't ref label of string literal properly
ANTLR-182 - extra commas in tree parser attribute reference
ANTLR-185 - Ref to scoped var on left and right of assignment causes problem.
ANTLR-188 - check float type auto init values. 0.0 is a double not float.
ANTLR-190 - $p for undefined rule ref gives null ptr
ANTLR-194 - scoped attribute in rule parameter list parsed improperly
ANTLR-195 - scoped attribute translation issue regarding type
ANTLR-197 - off by one ack error with remote debug event proxy
ANTLR-199 - underscore in rule name not allowed as attribute reference in rule reference argument list
ANTLR-202 - no warning on ambiguous reference to self-recursive rule reference
ANTLR-206 - ANTLR fails to detect left recursion
ANTLR-207 - not inheriting token vocab properly
ANTLR-208 - include -o dir in path for .tokens files
ANTLR-209 - lexer consuming to characters instead of one upon error
ANTLR-211 - translation problem with parameter actions in syntactic predicates
ANTLR-212 - Copy ctor not copying start and stop for common token
ANTLR-217 - scope list needs comma
ANTLR-219 - AST construction from rule block sets not working
ANTLR-221 - Exception when using AST operators ! or ^ without output=AST
ANTLR-222 - Semantic predicates are hoisted over actions
ANTLR-223 - error node has type issue with user-defined types
ANTLR-224 - give better error when ! AST operator used with rewrite
ANTLR-227 - can ref args in rewrite rules
ANTLR-228 - missing wildcard templates
ANTLR-229 - interpreter bug
ANTLR-230 - single quote mixed up in actions?
ANTLR-237 - Build dependency thing doesn't know about imports
ANTLR-240 - can't escape < in template
ANTLR-241 - extra comma in parameter / arg action
ANTLR-246 - RandomPhrase (andProbably interpreter) no longer works
ANTLR-249 - semantic predicate hoisting stops hoisting all predicates

Analysis engine changes

I abort entire DFA construction now when I see recursion in > 1 alt. Decision is non-LL(*) even if some pieces are LL(*). Safer to bail out and try with fixed k. If user set fixed k then it continues because analysis will eventually terminate for sure. If a pred is encountered and k=* and it's non-LL(*), it aborts and retries at k=1 but does NOT emit an error.

Decided that recursion overflow while computing a lookahead DFA is serious enough that I should bail out of entire DFA computation. Previously analysis tried to keep going and made the rules about how analysis worked more complicated. Better to simply abort when decision can't be computed with current max stack (-Xm option). User can adjust or add predicate etc... This is now an error not a warning.

Optimized the analysis engine for LL(1). Doesn't attempt LL(*) unless LL(1) fails. If not LL(1) but autobacktracking but no other kind of predicate, it also avoids LL(*). This is only important for really big 4000 line grammars etc...
Dangling states ("decision cannot distinguish between alternatives for at least one input sequence") is now an error not a warning.

Changed -Xnoinlinedfa to -Xmaxinlinedfastates m where m is maximum number of states a DFA can have before ANTLR avoids inlining it. Instead, you get a table-based DFA. This affectively avoids some acyclic DFA that still have many states with multiple incident edges. The combinatorial explosion smacks of infinite loop.

Miscellaneous

ANTLR no longer tries to recover in tree parsers inline using single node deletion or insertion; throw exception. Trees should be well formed as they are not created by users.

Incompatibility issues

For the most part, v3.1 should be a drop in replacement for v3.0.1, but you will definitely have to regenerate output code from your grammars to be consistent with the new runtime libraries.

Debug event listener interface has changed; Updated debug protocol for debugging composite grammars. enter/exit rule needs grammar to know when to flip display in AW.

Added getSourceName to IntStream and TokenSource interfaces and also the BaseRecognizer. Have to know where char come from for error messages.

Added get/setInputStream to Token interface and affected classes.

$channel was a global variable in 3.0.1 unlike $type which did not affect an invoking lexer rule. Now it's local too. Only $type and $channel are ever set with regularity. Setting those should not affect an invoking lexer rule as in the following should work:
```
  X : ID WS? '=' ID ;  // result is X on normal channel
  WS : ' '+ {$channel = HIDDEN; } ;

  STRING : '"' (ESC|.)* '"' ;  // result is STRING not ESC

  FLOAT : INT '.' INT? ; // should be FLOAT
  INT : Digit+ ;
  fragment
  Digit : '0'..'9' ;
```

For those that override BaseRecognizer.match(). I had turned off single token insertion and deletion because I could not figure out how to work with trees and actions. Figured that out and so I turned it back on. match() returns the Object matched now (parser, tree parser) so we can set labels on token refs properly after single token ins/del error recovery. Allows actions and tree construction to proceed normally even though we recover in the middle of an alternative. Added methods for conjuring up missing symbols: getMissingSymbol(). Recover methods etc... also return an Object now.

public Object dupTree(Object tree) moved to BaseTreeAdaptor from BaseTree.

ANTLR 3.1 Release Notes

ANTLR v3.1

New features

Improvements

Errors and warnings

Runtime libraries

ANTLR Tool speed

Element properties

Miscellaneous

From bug tracking system

Bug fixes

Analysis engine changes

Miscellaneous

Incompatibility issues