How to remove global backtracking from your grammar
More complex grammars, especially those taken from sources which provide only LR grammars and translated literally into ANTLR representation, generate errors such as the ones as below:
"[fatal] rule compilation_unit has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option." "Alternative 1: after matching input such as OP_LT decision cannot predict what comes next due to recursion overflow to rule2 from rule1"
The simple solution to enable global backtracking (option: backtrack=true) may mask errors and leads to inefficient parsers. This tutorial explains the basic steps to resolve the issue without enabling global backtracking. It is assumed that the reader already has a syntactically correct ANTLR grammar available. Also ANTLRworks is being used as the IDE - for this kind of work it is near ideal.
The guidelines covered in this tutorial originate from Jim Idle but are enriched with explanations that stem from my personal experience with ANTLR. This also means that I may not have seen every possible problem or that I haven't really found the optimal solution for a particular problem. Feedback is always welcome!
Preparations
Create a new temporary file into which the header (like options and @members) of the old file is copied. If a separate .tokens file is referenced then either save the new file in the same directory or copy the .tokens file. Change also the grammar name. Make sure that no (mutual) left-recursive rules are included - those can't be solved with the techniques detailed below.
First step
Copy a small chunk of rules from the original grammar. It is advisable to include the start rule from the beginning. If you happen to have sorted the rules into groups (a facility available in ANTLRworks) then use the groups as delimiters. If the groups are too big you can divide them further.
Second step
Copying only parts of the grammar will make ANTLRworks complain about unknown rules - rules which haven't been copied yet. To remove the warnings, create new rule instances. It is recommend to use the keywords as the sole content of the rule and to use for each rule a different keyword. This firstly allows the warnings to be suppressed by ANTLR(works) and prevents new misleading errors due to empty rules which aren't empty in the original grammar.
Third step
Now check the grammar. Beware: ANTLRworks reports successful parser generation even if ANTLR emits a warning (newer versions of ANTLRworks may have been fixed already). One has to check the rules, if some of them turned red. If yes, then solve the issue at hand. After solving all open issues repeat these three steps for the next chunk of rules until all rules have been copied into the temporary file. Look at chapter 14 in TDAR (The Definitive ANTLR Reference) about Syntactic Predicates for some tips and general explanations. But there are some issues which are not mentioned and only when working with my C# grammar I became aware of these issues.
There are four ways to influence ANTLR's ability to generate a non-ambiguous grammar: Using left-factoring, syntactic predicates, backtracking and the k option. The last one doesn't actually turn an ambiguous grammar into a non-ambiguous one, but the reverse may be true. k determines the look-ahead. If no look-ahead (unbounded look-ahead) is specified, ANTLR uses a flexible number of look-ahead tokens.
Unbounded look-ahead tends to be slower than a bounded look-ahead. Many grammars, however, defeat this optimization in their original form. The recommendation is to use unbounded look-ahead until everything else has been resolved. Then determine the lowest number of look-ahead tokens, which causes no warnings, and reduce the value by one. Give the complaining rules a local k option with the previous value. Repeat this until you get to "k=1;" or you are content with the number of treated rules.
Left-factoring
Now to the three real tools against non-determinism. The first is left-factoring. Look at the following rule:
a : L+ K | L+ M ;
As you can see, the two L at the beginning prevent that ANTLR can decide which path is the right path. As a consequence, ANTLR excludes the second one. Left-factoring can solve this. The name of this technique stems from the fact that you factor out the common part, which is always on the left side:
a : L+ (K | M) ;
Left-factoring can be done even between several rules. The starting point is then slightly different:
a : b | c ; b : L+ K ; c : L+ M ;
are turned into
a : L+ (b | c) ; b : K ; c : M ;
As the rules are changed significantly, I create outcommented copies before I change the rules. Furthermore, there is another possibility to solve this issue - syntactic predicates.
Syntactic Predicates
Syntactic predicates tell ANTLR: "Before you go further, check if the input conforms to the following rules. If yes, then take this path, if no, try the next alternative." This is done via the following syntax:
a : (L K)=> b | c ; b : L K ; c : L M ;
It is advisable to order the affected subrules such that the last subrule doesn't have a predicate. This speeds things up as ANTLR executes this rule when the other rules aren't chosen.
An issue remains: Two solutions for the same problem provoke the question, when to use which solution. I prefer left-factoring when it can be done easily, even beyond rule boundaries. Left-factoring is faster than using predicates and backtracking. But there are cases in which it is too complicated or impossible to separate the common subset without changing the meaning of the grammar. Or the rules to differentiate between cases aren't represented by the existing grammar rules.
Then a predicate is the right solution. Predicates appear in parentheses and may contain any regular combination of lexer and parser rules. This allows for the following snippet, where cast_expression and primary_expression require disambiguation:
unary_expression // Syntactic predicate rule to disambiguate "(x)y" from cast and expression. // | if the sequence of tokens is correct grammar for type, but not for an expression : (scan_for_cast_generic_precedence | OPEN_PARENS predefined_type)=> cast_expression | primary_expression | PLUS unary_expression | MINUS unary_expression | BANG unary_expression | TILDE unary_expression | pre_increment_expression | pre_decrement_expression | pointer_indirection_expression | addressof_expression ; // The sequence of tokens is correct grammar for a type, and the token immediately // following the closing parentheses is the token '~', the token '!', the token '(', // an IDENTIFIER, a literal, or any keyword except AS and IS. scan_for_cast_generic_precedence options { k = 1; } : OPEN_PARENS type CLOSE_PARENS cast_disambiguation_token ; // One of these tokens must follow a valid cast in an expression, in order to // eliminate a grammar ambiguity. cast_disambiguation_token : (TILDE | BANG | OPEN_PARENS | IDENTIFIER | literal | ABSTRACT | BASE | BOOL | BREAK | BYTE | CASE | CATCH | CHAR | CHECKED | CLASS | CONST | CONTINUE | DECIMAL | DEFAULT | DELEGATE | DO | DOUBLE | ELSE | ENUM | EVENT | EXPLICIT | EXTERN | FINALLY | FIXED | FLOAT | FOR | FOREACH | GOTO | IF | IMPLICIT | IN | INT | INTERFACE | INTERNAL | LOCK | LONG | NAMESPACE | NEW | OBJECT | OPERATOR | OUT | OVERRIDE | PARAMS | PRIVATE | PROTECTED | PUBLIC | READONLY | REF | RETURN | SBYTE | SEALED | SHORT | SIZEOF | STACKALLOC | STATIC | STRING | STRUCT | SWITCH | THIS | THROW | TRY | TYPEOF | UINT | ULONG | UNCHECKED | UNSAFE | USHORT | USING | VIRTUAL | VOID | VOLATILE | WHILE ) ;
Backtracking
Backtracking is the slowest of all alternatives. The parser simply tests which alternative is the right one. In fact, I haven't had to use it so far myself, so I can't say when it is required. My recommendation is to use left-factoring and syntactic predicates as much as possible. I also noted that there is some confusion regarding backtracking and syntactic predicates.
Terence Parr uses the image that backtracking is like going into every branch of a maze, marking every taken turn on the walls, before jotting down the right path into a notebook, while syntactic predicates are like sending trained monkeys into certain branches and wait for an OK from them. As far as my understanding goes, the difference between backtracking and predicates is that backtracking requires the actual execution of the participating rules (although no actions are executed), using implicitly a stack. Also backtracking creates a predicate at each (non-ambiguous) alternative and checks the entire rule instead of only the necessary tokens.
Predicates on the other hand allow the parser to look ahead if the token stream matches by computing a DFA (deterministic finite automaton) that recognizes the token sequence described by the predicate. Only if the DFA can recognize the sequence the corresponding subrule is selected. In effect, backtracking is more powerful, but also slower than syntactic predicates.
It is possible to use backtracking on rules locally. This is preferable over the global option.
Techniques to Solve Ambiguities
A) Do not create new ambiguities. It may sound simple or silly but solving an ambiguity, while another one shows then up, is a fairly common occurrence. It is possible that ANTLR(works) has silently swallowed a warning before, so the chosen solution is actually correct. But the more common case from my experience is the accidental creation of an ambiguity which has not been in the original grammar. Usually it means that you have chosen the wrong way to solve the first ambiguity.
I had the case that the rule for the instance constructor became a true superset for the static constructor, as I added the "static" modifier to the allowed instance constructor modifiers. This allowed the instance constructor rule to match the static constructor sequence. The solution was to subsume the static constructor rule into the instance constructor rule and to check later, what kind of constructor it actually is (see B).
B) Create a two-pass parser if your grammar includes context-dependent rules which are unsolvable with the local information. An example is the decision whether "foo" refers to a variable or a type. You can require like in C that the symbol table already has an entry for "foo" to determine the validity of "i = foo(1.0)" or, in case of its absence, reject the program anyway. Other languages like Java allow a more form-free placement. The method "foo" may come later in the file, so the parser can merely record names and their types, as well check if the input conforms to a syntactic grammar. Only in the second pass the parser can know whether "foo" has been defined, or if a constructor-like looking method has the name of the class.
The general case is if two or more recognition paths include the same token set or overlap at least in a subset from the same starting point. The solution is to create a syntactic superset and to left-factor the commonalities. The second pass looks for forbidden combinations and can then even create better error messages.
An example would be from C# where both classes and methods have the "public" modifier, but only methods have the "virtual" modifier. The C# grammar provides each rule with a specialized modifier rule which has to be replaced with a common modifier rule. This modifier rule has all possible modifiers and has been left-factored into the parent rules.
C) Respect dependencies when working on a left-edge ambiguity - one symbol appears in several subrules at the first place. The problem in such a case is that the context is provided by other rules and may cause a new ambiguity. An example of a real case is "s: A | (A B)?;". If one were to solve this with left-factoring, one would have to create "A? B?" and check for invalid combinations. Unfortunately, there is some interaction in this situation: "r: A (s | A? B);". Now the input sequence "A B" can be recognized via s, too - which was not possible in the original variant. The proper way to handle this rule is to use syntactic predicates and not left-factoring.
D) Never use "." in syntactic predicates - in many cases this will still choose the first path instead of the second one, as a key sequence for the first path may show up much later in the source file. "." will then simply include the key sequence for the second path.
E) You can use a predicate to suppress unnecessary warnings. See here:
if_statement : IF OPEN_PARENS bool_expression CLOSE_PARENS embedded_statement ((ELSE)=> ELSE embedded_statement)? ;
The kind of ambiguity for if-then-else occurs on other places, too. I had a case where the ternary operator "?:"and "?" for nullable types were interacting and I did not recognize it at first. Assuming that you do not find any feasible alternatives beyond global backtracking, there is a way to find out for sure that you have encountered such a situation. Global backtracking suppresses the warning, but local backtracking does not.
F) Solve all ambiguities, but start with the ones you know how to solve. Maybe solving easier ambiguities first makes solving harder ones later easier. If you can't think of a good way, then let it rest for a while and try again later. Also ask on the email list for those which seem to be unsolvable.
Fourth step
Once satisfied with results, put the contents back into the original file.