ANTLR 3.1 Release Notes
ANTLR v3.1
August 12, 2008
Terence Parr
ANTLR project lead and supreme dictator for life
University of San Francisco
Credits
ANTLR v3.1 introduces a number of important new features, improvements, and bug fixes. v3.1 also introduces a new tool to the ANTLR family: gUnit. There are six Code Generation Targets up to v3.1:
- Java; current maintainer Terence Parr
- C#; current maintainer Johannes Luber
- Python; current maintainer Benjamin Niemann
- C; current maintainer Jim Idle
- ActionScript; current maintainer George Scott
- JavaScript; current maintainer Joey Hurst
Two new examples are available: composite-java that demonstrates grammar imports and polydiff that demonstrates tree rewriting the tree grammars.
ANTLR 3.1 should read your 3.0.1 grammars, but you must regenerate your recognizers from the grammars to be compatible with the new runtime. See the section below on incompatibilities for more information.
New features
- gUnit Grammar unit testing tool (Jen-Yuan Leon Su). Here are a few of the possible things you can test:
gunit C; // use C.g grammar decl: "int x" FAIL // expect failure, because of missing ';' in the input string "int x;" -> "x" // expect standard output "x" from rule "int x;" -> (DECL int x) // test AST construction funcHeader: "void bar(int x)" returns ["int"] // expect return string "int" from funcHeader
- Heterogeneous tree construction. Token references can be modified with a
node
option, which can be omitted because it is the default option.ID<VarNode>
is the same asID<node=VarNode>
and creates an AST node of typeVarNode
rather thanCommonTree
, the default without the token option. The modifiers are available during automatic AST construction as well as in rewrites:a : ID INT -> ^(INT<V> ID<W>) ;
- AST construction and rewriting for tree parsers. In 3.0.1, only parser grammars could construct ASTs. 3.1 allows you to build new ASTs from existing ASTs. This amounts to a tree rewriting ability. For efficiency, option
rewrite=true
does an in-line replacement for rewrite rules so you can avoid making a copy of an entire tree just to tweak a few nodes. From the polynomial differentiator example, here are a few simple rewrites:poly: | ^(MULT INT ID) -> INT // 2x -> 2 | INT -> INT["0"] // 34 -> 0 | ID -> INT["1"] // x -> 1 ... ;
- Grammar composition. ANTLR v3.1 introduces a grammar composition mechanism to simultaneously allow the logical organization of large grammars, provide more opportunities for grammar reuse, and allow the programmer to control the size of the generated classes. From the composite-java example, here is the root grammar that imports all of the delegates:
grammar Java; options {backtrack=true; memoize=true;} import JavaDecl, JavaAnnotations, JavaExpr, JavaStat, JavaLexerRules; compilationUnit : annotations? packageDeclaration? importDeclaration* typeDeclaration* ;
Improvements
Errors and warnings
- Added UnwantedTokenException and MissingTokenException to make match() problems more precise in case you want to catch differently. Updated getErrorMessage() to be more precise. Says:
now instead of
line 2:9 missing EQ at '0'
Input "x=9 9;" givesline 2:9 mismatched input '0' expecting EQ
When very confused, "x=9 for;", you still get old mismatched message:line 3:8 extraneous input '9' expecting ';'
line 3:8 extraneous input 'for' expecting ';' line 3:11 mismatched input ';' expecting '('
- Improved insufficiently covered alt warnings from:
to:
warning(203): T.g:2:3: The following alternatives are insufficiently covered with predicates: 1
warning(203): T.g:2:3: Input B is insufficiently covered with predicates at locations: alt 1: line 3:15, alt 2: line 2:9
- Insufficiently covered (with semantic predicates) alt warnings are now emitted before nondeterminisms so it's clear the nondeterminism is a result of insufficient preds.
- Unreachable alt warnings are now errors.
- AST construction now inserts error nodes into trees upon syntax error.
Runtime libraries
- Added syntaxError recognizer state var so you can easily tell if a recognizer failed. Added getNumberOfSyntaxErrors() to recognizers.
- Interpreter throws FailedPredicateException now when it sees a predicate; before it was silently failing. I'll make it work one of these days.
- Added reset() to CommonTreeNodeStream, token stream too. This allows you to reuse the same recognizer object to avoid object construction costs. Just reset the stream and call the start symbol. Should behave just like any newly created recognizer object.
- Adding errorNode to TreeAdaptor and various debug events/listeners. Had to add new class runtime.tree.CommonErrorNode to hold all the goodies: input stream, start/stop objects. See Tree construction.
- Improved parse trees; added toInputString() method to show original input source using real nodes and hidden tokens.
ANTLR Tool speed
- Working with Kay Roepke, we got about 15% speed improvement in overall ANTLR exec time. Memory footprint seems to be about 50% smaller. Some speed improvements for grammar processing are more dramatic but not always. One user reports that v3.0.1 parser generation takes 5 minutes. v3.1 takes 55 seconds.
Element properties
- Made refs to rule/token properties use conditional operator ?: to avoid null ptr exceptions. $label.st, for example, is now
label!=null?label.st:null
notlabel.st
. This is useful not only for optional rule/token refs, but also during error recovery. If ID is not matched, $ID.text won't cause a null ptr. - added "int" property for token and lexer rule refs. super convenient. E.g.,
a : b=INT {int x = $b.int;} ;
- Added token options to terminals: ID<node=V; foo="Big bob"> etc...
node
option is common enough that you can omit it so you can do ID<V> for hetero tree types.
Miscellaneous
- If -dfa with combined grammar T.g, builds T.dec-*.dot and TLexer.dec-*.dot
- Added
forced action that is executed even during backtracking. Useful for managing a symbol table even when backtracking; those actions cannot be turned off because the results of symbol table lookups direct the parse often.
{{...}}
From bug tracking system
- ANTLR-43 - set proper exit code when error condition has been encountered
- ANTLR-44 - I do not think that I generate dangling state errors anymore
- ANTLR-111 - improve return value, parameter action parsing; can't do generics for example
- ANTLR-143 - find a way to deal with optional rule/token references in actions
- ANTLR-165 - make lexer grammar TLexer generate TLexer not TLexerLexer.java
- ANTLR-169 - delete tmp lexer file T__.g when finished
- ANTLR-193 - add error node to AST upon syntax error
- ANTLR-210 - actions execute even after syntax error recovery
- ANTLR-213 - Do single token insertion and deletion only on those tokens not referenced in actions or rewrite rules
- ANTLR-218 - antlr got slower?
- ANTLR-220 - need to implement dupTree and dupNode debugging event listener events
- ANTLR-225 - setting token types in invoked lexer rules is wacky
- ANTLR-226 - make generated classes from imported grammars use full path in name
- ANTLR-233 - make all lexer vars like $type global
- ANTLR-234 - Generated delegate parsers should have prefix of delegator
- ANTLR-235 - 12-31.14 release is 10x faster than 12-31.17
- ANTLR-239 - allow token level options not just single word for tree node type
- ANTLR-267 - ParseTreeBuilder does not respect decision contexts
Bug fixes
- Single token error recovery was not properly taking into consideration EOF.
- ANTLR-39 - ANTLR not catching non-LL(*) decision, spinning forever
- ANTLR-105 - Extra backslash generated for literals in lexer
- ANTLR-119 - superClass for lexer doesn't work
- ANTLR-123 - getting epsilon in k=4 decision ambiguity
- ANTLR-124 - ErrorManager.setErrorListener() leaks
- ANTLR-125 - NullPointerException when processing an empty or incomplete grammar
- ANTLR-130 - ANTLR doesn't finish
- ANTLR-140 - Not possible to declare arrays as rule arguments
- ANTLR-157 - Duplicate Hashtable using alias generated for grammar with backtracking and template output
- ANTLR-177 - inconsistent equals/hashcode for Label
- ANTLR-178 - bad DFA generated without error in rule statement's decision (22)
- ANTLR-179 - missing attribute in combined grammar
- ANTLR-180 - dependency on v2
- ANTLR-181 - AST rewrite doesn't ref label of string literal properly
- ANTLR-182 - extra commas in tree parser attribute reference
- ANTLR-185 - Ref to scoped var on left and right of assignment causes problem.
- ANTLR-188 - check float type auto init values. 0.0 is a double not float.
- ANTLR-190 - $p for undefined rule ref gives null ptr
- ANTLR-194 - scoped attribute in rule parameter list parsed improperly
- ANTLR-195 - scoped attribute translation issue regarding type
- ANTLR-197 - off by one ack error with remote debug event proxy
- ANTLR-199 - underscore in rule name not allowed as attribute reference in rule reference argument list
- ANTLR-202 - no warning on ambiguous reference to self-recursive rule reference
- ANTLR-206 - ANTLR fails to detect left recursion
- ANTLR-207 - not inheriting token vocab properly
- ANTLR-208 - include -o dir in path for .tokens files
- ANTLR-209 - lexer consuming to characters instead of one upon error
- ANTLR-211 - translation problem with parameter actions in syntactic predicates
- ANTLR-212 - Copy ctor not copying start and stop for common token
- ANTLR-217 - scope list needs comma
- ANTLR-219 - AST construction from rule block sets not working
- ANTLR-221 - Exception when using AST operators ! or ^ without output=AST
- ANTLR-222 - Semantic predicates are hoisted over actions
- ANTLR-223 - error node has type issue with user-defined types
- ANTLR-224 - give better error when ! AST operator used with rewrite
- ANTLR-227 - can ref args in rewrite rules
- ANTLR-228 - missing wildcard templates
- ANTLR-229 - interpreter bug
- ANTLR-230 - single quote mixed up in actions?
- ANTLR-237 - Build dependency thing doesn't know about imports
- ANTLR-240 - can't escape < in template
- ANTLR-241 - extra comma in parameter / arg action
- ANTLR-246 - RandomPhrase (andProbably interpreter) no longer works
- ANTLR-249 - semantic predicate hoisting stops hoisting all predicates
Analysis engine changes
- I abort entire DFA construction now when I see recursion in > 1 alt. Decision is non-LL(*) even if some pieces are LL(*). Safer to bail out and try with fixed k. If user set fixed k then it continues because analysis will eventually terminate for sure. If a pred is encountered and k=* and it's non-LL(*), it aborts and retries at k=1 but does NOT emit an error.
- Decided that recursion overflow while computing a lookahead DFA is serious enough that I should bail out of entire DFA computation. Previously analysis tried to keep going and made the rules about how analysis worked more complicated. Better to simply abort when decision can't be computed with current max stack (-Xm option). User can adjust or add predicate etc... This is now an error not a warning.
- Optimized the analysis engine for LL(1). Doesn't attempt LL(*) unless LL(1) fails. If not LL(1) but autobacktracking but no other kind of predicate, it also avoids LL(*). This is only important for really big 4000 line grammars etc...
- Dangling states ("decision cannot distinguish between alternatives for at least one input sequence") is now an error not a warning.
- Changed -Xnoinlinedfa to -Xmaxinlinedfastates m where m is maximum number of states a DFA can have before ANTLR avoids inlining it. Instead, you get a table-based DFA. This affectively avoids some acyclic DFA that still have many states with multiple incident edges. The combinatorial explosion smacks of infinite loop.
Miscellaneous
- ANTLR no longer tries to recover in tree parsers inline using single node deletion or insertion; throw exception. Trees should be well formed as they are not created by users.
Incompatibility issues
For the most part, v3.1 should be a drop in replacement for v3.0.1, but you will definitely have to regenerate output code from your grammars to be consistent with the new runtime libraries.
- Debug event listener interface has changed; Updated debug protocol for debugging composite grammars. enter/exit rule needs grammar to know when to flip display in AW.
- Added getSourceName to IntStream and TokenSource interfaces and also the BaseRecognizer. Have to know where char come from for error messages.
- Added get/setInputStream to Token interface and affected classes.
- $channel was a global variable in 3.0.1 unlike $type which did not affect an invoking lexer rule. Now it's local too. Only $type and $channel are ever set with regularity. Setting those should not affect an invoking lexer rule as in the following should work:
X : ID WS? '=' ID ; // result is X on normal channel WS : ' '+ {$channel = HIDDEN; } ; STRING : '"' (ESC|.)* '"' ; // result is STRING not ESC FLOAT : INT '.' INT? ; // should be FLOAT INT : Digit+ ; fragment Digit : '0'..'9' ;
- For those that override BaseRecognizer.match(). I had turned off single token insertion and deletion because I could not figure out how to work with trees and actions. Figured that out and so I turned it back on. match() returns the Object matched now (parser, tree parser) so we can set labels on token refs properly after single token ins/del error recovery. Allows actions and tree construction to proceed normally even though we recover in the middle of an alternative. Added methods for conjuring up missing symbols: getMissingSymbol(). Recover methods etc... also return an Object now.
- public Object dupTree(Object tree) moved to BaseTreeAdaptor from BaseTree.