Left-Recursion Removal - Print Edition
IntroductionÂ
Writing grammars, one will encounter left-recursive rules. As ANTLR is a recursive-descent parser, it cannot cope with them and thus they have to be removed. But what is a left-recursive rule exactly? It is a recursive rule which calls itself on the left-edge of a global alternative. An example would be:
a : a b | b ;
ANTLRworks can resolve this problem and turns rule a into
a : b+;
But ANTLRworks can do this automatically only for the above kind of left-recursive rules, which exhibit the left recursion within the implementation of the recursive rule itself. I would call these single left-recursive (SLR) rules. The other kind are mutually left-recursive (MLR) rules where the left-recursion occurs over several rule invocations. The simplest example would be the following:
a : b | c ; b : a | c ;
Rule a references rule b, while rule b references rules a. A more complex example would show such a 're-reference' taking place after many subsequent rule invocations. As stated above, the resolvement has to be done manually here.
As the strategy to solve this problem isn't that obvious, I decided to write a tutorial about two real grammar cases I have encountered while working on a C# grammar. It should be noted that this tutorial describes what I wished to have happened, as I visited many more dead-ends than noted. Furthermore, the strategy to resolve MLR rules is based on the description by Gavin Lambert.
First Example
Preparations
It is important to work with a copy of the rules, because changing the grammar into the normal form can become quite messy. Create a new grammar file and copy only the rules into the file which are reported in the error message. After that, create rules for missing rule references with ANTLRworks, so that the tool doesn't complain about them.
These rules may neither be empty nor be duplicates - I find the use of keywords here simple and efficient. It is advisable to move those rules to the end of the file so they don't get mixed up with the recursive ones. As every grammar with mutually left-recursive rules is kind of unique in its own sense, the following procedure is to be viewed as a heuristic instead of an algorithm.
After the execution of the first step I had the following C# grammar snippet:
grammar CSharpRecursion; type : value_type | reference_type | type_parameter | pointer_type ; value_type : struct_type | enum_type ; struct_type : type_name | simple_type | nullable_type ; nullable_type : non_nullable_value_type INTERR ; non_nullable_value_type : type ; reference_type : class_type | interface_type | array_type | delegate_type ; array_type : non_array_type rank_specifier+ ; non_array_type : type ; pointer_type : unmanaged_type STAR | VOID STAR ; unmanaged_type : type ; ///////////////////////// // $<OutOfSight type_parameter : AS ; delegate_type : BASE ; interface_type : BOOL ; class_type : BREAK ; simple_type : BYTE ; type_name : CASE ; enum_type : CATCH ; rank_specifier : CHAR ; ABSTRACT : 'abstract'; AS : 'as'; BASE : 'base'; BOOL : 'bool'; BREAK : 'break'; BYTE : 'byte'; CASE : 'case'; CATCH : 'catch'; CHAR : 'char'; CHECKED : 'checked'; CLASS : 'class'; CONST : 'const'; CONTINUE : 'continue'; DECIMAL : 'decimal'; DEFAULT : 'default'; DELEGATE : 'delegate'; DO : 'do'; DOUBLE : 'double'; ELSE : 'else'; ENUM : 'enum'; EVENT : 'event'; EXPLICIT : 'explicit'; EXTERN : 'extern'; FALSE : 'false'; FINALLY : 'finally'; FIXED : 'fixed'; FLOAT : 'float'; FOR : 'for'; FOREACH : 'foreach'; GOTO : 'goto'; IF : 'if'; IMPLICIT : 'implicit'; IN : 'in'; INT : 'int'; INTERFACE : 'interface'; INTERNAL : 'internal'; IS : 'is'; LOCK : 'lock'; LONG : 'long'; NAMESPACE : 'namespace'; NEW : 'new'; NULL : 'null'; OBJECT : 'object'; OPERATOR : 'operator'; OUT : 'out'; OVERRIDE : 'override'; PARAMS : 'params'; PRIVATE : 'private'; PROTECTED : 'protected'; PUBLIC : 'public'; READONLY : 'readonly'; REF : 'ref'; RETURN : 'return'; SBYTE : 'sbyte'; SEALED : 'sealed'; SHORT : 'short'; SIZEOF : 'sizeof'; STACKALLOC : 'stackalloc'; STATIC : 'static'; STRING : 'string'; STRUCT : 'struct'; SWITCH : 'switch'; THIS : 'this'; THROW : 'throw'; TRUE : 'true'; TRY : 'try'; TYPEOF : 'typeof'; UINT : 'uint'; ULONG : 'ulong'; UNCHECKED : 'unchecked'; UNSAFE : 'unsafe'; USHORT : 'ushort'; USING : 'using'; VIRTUAL : 'virtual'; VOID : 'void'; VOLATILE : 'volatile'; WHILE : 'while'; INTERR : '?'; STAR: '*'; // $>
After this all mutually left-recursive rules are checked to see if they are referenced in any other rules besides themselves in the original grammar. The five rules being referenced elsewhere are type, array_type, non_array_type, pointer_type and unmanaged_type. To simplify my future work I inline all unreferenced rules first and save the result:
type : type_name | simple_type | type INTERR | enum_type | class_type | interface_type | array_type | delegate_type | type_parameter | pointer_type ; array_type : non_array_type rank_specifier+ ; non_array_type : type ; pointer_type : unmanaged_type STAR | VOID STAR ; unmanaged_type : type ;
Now I don't have to repeat as many steps. This would be a good point to make a copy of this grammar file because when a mistake can't be easily undone one can start easily anew. Saving intermediate copies of the grammar once you complete each important step, would be a good practice, too. It is important to note that inlining leads always to only one single survivor so the process has to be repeated for every required rule.
It is also possible that rules become single left-recursive and have to be changed by ANTLRworks first (unless that rule is supposed to be the last rule - then postponing this to the last possible step is a good idea). Rules may become ambiguous, too, so doing grammar checks after single left-recursion removal (SLRR) is recommended. Analyze the cause and remove it before proceeding further.
When inlining is done, ANTLRworks automatically adds extra parentheses around the inlined statement. Those parentheses may be superfluous and should be removed as soon as possible, since SLRR may not be possible otherwise. A related problem is that after an SLRR that the trailing prevents the naive deletion of the parentheses around the main body.
This can be solved with "in-factoring": "(a|b) c" => "ac|bc". Further analysis of the syntax diagram can point out other simplifications like "(ac|bc) (c)*" => "(a|b) c+". Don't do them aggressively until only the last rule is left - further inlining may solve asymmetries where not all alternatives share the exact same parts.
Rule Inlining type
Every rule besides type (and the ones in the OutOfSight group) is inlined and all parentheses have been removed. The result is:
type : type_name | simple_type | type INTERR | enum_type | class_type | interface_type | type rank_specifier+ | delegate_type | type_parameter | type STAR | VOID STAR ;
Now SLRR leads to:
type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID STAR ) (INTERR | rank_specifier+ | STAR)* ;
At this point, a grammar check would reveal an ambiguity, which can be easily resolved by the removal of the '+':
type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID STAR ) (INTERR | rank_specifier | STAR)* ;
Rule Inlining array_type
This time we remove the left-recursion of the rule type, right at the beginning:
type : (type_name | simple_type | enum_type | class_type | interface_type | array_type | delegate_type | type_parameter | pointer_type ) INTERR* ;
Now we inline type and simplify further. I inline unmanaged_type as next and in-factor "INTERR* STAR" to each subrule, so that I can remove the left-recursion in pointer_type. Because of the assymmetric "VOID STAR" subrule I inline pointer_type instead doing some optimization, except turning the trailing into "(INTERR | STAR)*". Then we arrive at
non_array_type : (type_name | simple_type | enum_type | class_type | interface_type | array_type | delegate_type | type_parameter | (type_name INTERR* STAR | simple_type INTERR* STAR | enum_type INTERR* STAR | class_type INTERR* STAR | interface_type INTERR* STAR | array_type INTERR* STAR | delegate_type INTERR* STAR | type_parameter INTERR* STAR | VOID STAR ) (INTERR | STAR)* ) INTERR* ;
Taking the near duplicates into account, "INTERR* STAR (INTERR | STAR)" is transformable into "INTERR* STAR? (INTERR | STAR)" and this is "(INTERR | STAR)*". We then obtain
non_array_type : (type_name (INTERR | STAR)* | simple_type (INTERR | STAR)* | enum_type (INTERR | STAR)* | class_type (INTERR | STAR)* | interface_type (INTERR | STAR)* | array_type (INTERR | STAR)* | delegate_type (INTERR | STAR)* | type_parameter (INTERR | STAR)* | VOID STAR (INTERR | STAR)* ) INTERR* ;
The last INTERR* can be safely deleted. Then after some rule-inling, SLRR and simplification we arrive at
array_type : type_name (STAR | INTERR)* rank_specifier+ | simple_type (STAR | INTERR)* rank_specifier+ | enum_type (STAR | INTERR)* rank_specifier+ | class_type (STAR | INTERR)* rank_specifier+ | interface_type (STAR | INTERR)* rank_specifier+ | array_type (STAR | INTERR)* rank_specifier+ | delegate_type (STAR | INTERR)* rank_specifier+ | type_parameter (STAR | INTERR)* rank_specifier+ | VOID STAR (STAR | INTERR)* rank_specifier+ ;
Now repeat SLRR and simplify the result to
array_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID STAR ) ((STAR | INTERR)* rank_specifier)+ ;
"((STAR | INTERR)* rank_specifier)" is similar to "(+STAR +| INTERR | rank_specifier+)+", if there is at least one rank_specifier. This can be checked with:
array_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID STAR ) (STAR | INTERR | r+=rank_specifier)+ {$r.Count >= 1}? ;
Rule Inlining unmanaged_type
We start here with the two saved unmanaged_type and array_type rules. We remove the left-recursion from array_type and simplify it to
array_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | unmanaged_type STAR | VOID STAR ) (INTERR* rank_specifier)+ ;
Then we only need to inline it and simplify unmanaged_type to
unmanaged_type : type_name (INTERR | rank_specifier)* | simple_type (INTERR | rank_specifier)* | enum_type (INTERR | rank_specifier)* | class_type (INTERR | rank_specifier)* | interface_type (INTERR | rank_specifier)* | delegate_type (INTERR | rank_specifier)* | type_parameter (INTERR | rank_specifier)* | unmanaged_type STAR (INTERR | rank_specifier)* | VOID STAR (INTERR | rank_specifier)* ;
Now SLRR and simplification steps have to be repeated:
unmanaged_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID STAR ) (STAR | INTERR | rank_specifier)* ;
Rule Inlining non_array_type
We repeat the known procedure, where non_array_type won't have its left-recursion removal until the last step. We then get
non_array_type : (type_name | simple_type | enum_type | class_type | interface_type | non_array_type rank_specifier+ | delegate_type | type_parameter | pointer_type ) INTERR* ; pointer_type : type_name INTERR* STAR | simple_type INTERR* STAR | enum_type INTERR* STAR | class_type INTERR* STAR | interface_type INTERR* STAR | non_array_type rank_specifier+ INTERR* STAR | delegate_type INTERR* STAR | type_parameter INTERR* STAR | pointer_type INTERR* STAR | VOID STAR ;
pointer_type is turned into
pointer_type : (type_name INTERR* STAR | simple_type INTERR* STAR | enum_type INTERR* STAR | class_type INTERR* STAR | interface_type INTERR* STAR | non_array_type rank_specifier+ INTERR* STAR | delegate_type INTERR* STAR | type_parameter INTERR* STAR | VOID STAR ) (INTERR | STAR)* ;
Inlining pointer_type leads with some simplification to
non_array_type : (type_name | simple_type | enum_type | class_type | interface_type | non_array_type rank_specifier+ | delegate_type | type_parameter | type_name INTERR* STAR (INTERR | STAR)* | simple_type INTERR* STAR (INTERR | STAR)* | enum_type INTERR* STAR (INTERR | STAR)* | class_type INTERR* STAR (INTERR | STAR)* | interface_type INTERR* STAR (INTERR | STAR)* | non_array_type rank_specifier+ INTERR* STAR (INTERR | STAR)* | delegate_type INTERR* STAR (INTERR | STAR)* | type_parameter INTERR* STAR (INTERR | STAR)* | VOID STAR (INTERR | STAR)* ) INTERR* ;
Following the next few steps as done when inlining array_type results in:
non_array_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID STAR ) (rank_specifier | INTERR | STAR)* ;
Rule Inlining pointer_type
Starting with the first intermediate result of non_array_type, we turn non_array_type into
non_array_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | pointer_type ) (rank_specifier | INTERR)* ;
After inlining and simplifying we arrive at
pointer_type : type_name (rank_specifier | INTERR)* STAR | simple_type (rank_specifier | INTERR)* STAR | enum_type (rank_specifier | INTERR)* STAR | class_type (rank_specifier | INTERR)* STAR | interface_type (rank_specifier | INTERR)* STAR | delegate_type (rank_specifier | INTERR)* STAR | type_parameter (rank_specifier | INTERR)* STAR | pointer_type (rank_specifier | INTERR)* STAR | VOID STAR ;
This can be turned into
pointer_type : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID ) (rank_specifier | INTERR | STAR)* STAR ;
Note that I moved the STAR to the end to allow additional transformations, but in this form there can be rank_specifier and INTERR between VOID and the first STAR. This is alleviated by
pointer_type @init { bool allowAll = true; } : (type_name | simple_type | enum_type | class_type | interface_type | delegate_type | type_parameter | VOID {allowAll = false;} ) ({allowAll}?=> rank_specifier \|{allowAll}?=> INTERR | STAR {allowAll = true;})* STAR ;
Second Example
The first example showed only a limited pool of possibilities, so I decided to use the second mutual left-recursive rules loop in the C# grammar as another example.
Preparations
After the execution of the first step, as explained in the first example, I had the following C# grammar snippet. I realized at this point that it would be a good idea to use more of the original rules instead stand-ins because I wanted to be sure that there is no hidden interaction between the two recursion loops.
grammar CSharpRecursion2; primary_expression : array_creation_expression | primary_no_array_creation_expression ; primary_no_array_creation_expression : literal | simple_name | parenthesized_expression | member_access | invocation_expression | element_access | this_access | base_access | post_increment_expression | post_decrement_expression | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression | pointer_member_access | pointer_element_access ; member_access : primary_expression DOT IDENTIFIER type_argument_list? | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? ; invocation_expression : primary_expression OPEN_PARENS argument_list? CLOSE_PARENS ; post_increment_expression : primary_expression OP_INC ; post_decrement_expression : primary_expression OP_DEC ; pointer_member_access : primary_expression OP_PTR IDENTIFIER ; pointer_element_access : primary_no_array_creation_expression OPEN_BRACKET expression CLOSE_BRACKET ; element_access : primary_no_array_creation_expression OPEN_BRACKET expression_list CLOSE_BRACKET ; // $<OutOfSight // Here additional rules needed to resolve errors. argument_list :COMMA ; expression :DEFAULT ; expression_list :THIS ; array_creation_expression :DELEGATE ; anonymous_method_expression :DO ; default_value_expression :DOUBLE ; unchecked_expression :ELSE ; checked_expression :ENUM ; typeof_expression :EVENT ; anonymous_object_creation_expression :EXPLICIT ; delegate_creation_expression :EXTERN ; object_creation_expression :FALSE ; base_access :FINALLY ; this_access :FIXED ; parenthesized_expression :FLOAT ; simple_name :FOR ; literal :FOREACH ; contextual_keyword[string identifier] : { input.LT(1).Text == $identifier }? IDENTIFIER ; predefined_type : BOOL ; right_shift : OP_LT OP_LT ; sizeof_expression : SIZEOF OPEN_PARENS unmanaged_type CLOSE_PARENS ; unmanaged_type : (namespace_or_type_name | simple_type | OBJECT | STRING | VOID STAR ) (STAR | INTERR | rank_specifier)* ; namespace_or_type_name : (IDENTIFIER type_argument_list? | qualified_alias_member) (DOT IDENTIFIER type_argument_list?)* ; IDENTIFIER : 'a'..'z' ; simple_type : BOOL ; rank_specifier : OPEN_BRACKET COMMA* CLOSE_BRACKET ; qualified_alias_member : IDENTIFIER DOUBLE_COLON IDENTIFIER type_argument_list? ; type_argument_list : OP_LT type_arguments OP_GT ; type_arguments : type_argument (COMMA type_argument)* ; type_argument : type ; type : (namespace_or_type_name | simple_type | OBJECT | STRING | VOID STAR ) (STAR | INTERR | rank_specifier)* ; ABSTRACT:'abstract'; AS:'as'; BASE:'base'; BOOL:'bool'; BREAK:'break'; BYTE:'byte'; CASE:'case'; CATCH:'catch'; CHAR:'char'; CHECKED:'checked'; CLASS:'class'; CONST:'const'; CONTINUE:'continue'; DECIMAL:'decimal'; DEFAULT:'default'; DELEGATE:'delegate'; DO:'do'; DOUBLE:'double'; ELSE:'else'; ENUM:'enum'; EVENT:'event'; EXPLICIT:'explicit'; EXTERN:'extern'; FALSE:'false'; FINALLY:'finally'; FIXED:'fixed'; FLOAT:'float'; FOR:'for'; FOREACH:'foreach'; GOTO:'goto'; IF:'if'; IMPLICIT:'implicit'; IN:'in'; INT:'int'; INTERFACE:'interface'; INTERNAL:'internal'; IS:'is'; LOCK:'lock'; LONG:'long'; NAMESPACE:'namespace'; NEW:'new'; NULL:'null'; OBJECT:'object'; OPERATOR:'operator'; OUT:'out'; OVERRIDE:'override'; PARAMS:'params'; PRIVATE:'private'; PROTECTED:'protected'; PUBLIC:'public'; READONLY:'readonly'; REF:'ref'; RETURN:'return'; SBYTE:'sbyte'; SEALED:'sealed'; SHORT:'short'; SIZEOF:'sizeof'; STACKALLOC:'stackalloc'; STATIC:'static'; STRING:'string'; STRUCT:'struct'; SWITCH:'switch'; THIS:'this'; THROW:'throw'; TRUE:'true'; TRY:'try'; TYPEOF:'typeof'; UINT:'uint'; ULONG:'ulong'; UNCHECKED:'unchecked'; UNSAFE:'unsafe'; USHORT:'ushort'; USING:'using'; VIRTUAL:'virtual'; VOID:'void'; VOLATILE:'volatile'; WHILE:'while'; OPEN_PARENS:'('; CLOSE_PARENS:')'; OPEN_BRACKET:'['; CLOSE_BRACKET:']'; OPEN_BRACE:'{'; CLOSE_BRACE:'}'; STAR:'*'; INTERR:'?'; DOT:'.'; COMMA:','; OP_PTR:'->'; DOUBLE_COLON:'::'; OP_LT:'<'; OP_GT:'>'; OP_INC:'++'; OP_DEC:'--'; // $>
After this all mutually left-recursive rules are checked if they are referenced in any other rule besides themselves in the original grammar. The original list consists of invocation_expression, primary_expression, primary_no_array_creation_expression, post_increment_expression, post_decrement_expression, element_access, member_access, pointer_element_access and pointer_member_access.
Out of those, only element_access, pointer_element_access and pointer_member_access aren't referenced elsewhere, so they are inlined before I save a copy. Next to the absence of the aforementioned rules primary_no_array_creation_expression looks like this. Its single left-recursion points it to be the first surviving rule.
primary_no_array_creation_expression : literal | simple_name | parenthesized_expression | member_access | invocation_expression | primary_no_array_creation_expression OPEN_BRACKET expression_list CLOSE_BRACKET | this_access | base_access | post_increment_expression | post_decrement_expression | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression | primary_expression OP_PTR IDENTIFIER | primary_no_array_creation_expression OPEN_BRACKET expression CLOSE_BRACKET ;
Rule Inlining primary_no_array_creation_expression
Inlining primary_expression leads to in-factoring at every other rule. The inlining of the other rules is straight-forward and gives us
primary_no_array_creation_expression : literal | simple_name | parenthesized_expression | array_creation_expression DOT IDENTIFIER type_argument_list? | primary_no_array_creation_expression DOT IDENTIFIER type_argument_list? | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS | primary_no_array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS | primary_no_array_creation_expression OPEN_BRACKET expression_list CLOSE_BRACKET | this_access | base_access | array_creation_expression OP_INC | primary_no_array_creation_expression OP_INC | array_creation_expression OP_DEC | primary_no_array_creation_expression OP_DEC | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression | array_creation_expression OP_PTR IDENTIFIER | primary_no_array_creation_expression OP_PTR IDENTIFIER | primary_no_array_creation_expression OPEN_BRACKET expression CLOSE_BRACKET ;
After SLRR we get
primary_no_array_creation_expression : (literal | simple_name | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | this_access | base_access | array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS | array_creation_expression DOT IDENTIFIER type_argument_list? | array_creation_expression OP_INC | array_creation_expression OP_DEC | array_creation_expression OP_PTR IDENTIFIER | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | parenthesized_expression | typeof_expression | sizeof_expression | default_value_expression | anonymous_method_expression | checked_expression | unchecked_expression ) (DOT IDENTIFIER type_argument_list? | OPEN_PARENS argument_list? CLOSE_PARENS | OP_INC | OP_DEC | OP_PTR IDENTIFIER | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET )* ;
Since this doesn't show any prospect of an ability to simplify, so we go on with the next rule.
Rule Inlining primary_expression
Firstly, we can do SLRR and simplification with primary_no_array_creation_expression:
primary_no_array_creation_expression : (literal | simple_name | parenthesized_expression | member_access | invocation_expression | this_access | base_access | post_increment_expression | post_decrement_expression | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression | primary_expression OP_PTR IDENTIFIER ) (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* ;
The rest of the rule inlining is simple. It leads to
primary_expression : array_creation_expression | (literal | simple_name | parenthesized_expression | primary_expression DOT IDENTIFIER type_argument_list? | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | primary_expression OPEN_PARENS argument_list? CLOSE_PARENS | this_access | base_access | primary_expression OP_INC | primary_expression OP_DEC | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression | primary_expression OP_PTR IDENTIFIER ) (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* ;
This point seems to be problematic. primary_expression is actually SLR. But due to the parentheses ANTLR considers them MLR. In-factoring is the only and somewhat ugly solution to allow SLRR:
primary_expression : (array_creation_expression | literal (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | simple_name (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | parenthesized_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | predefined_type DOT IDENTIFIER type_argument_list? (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | qualified_alias_member DOT IDENTIFIER type_argument_list? (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | this_access (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | base_access (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | object_creation_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | delegate_creation_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | anonymous_object_creation_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | typeof_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | checked_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | unchecked_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | default_value_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | anonymous_method_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* | sizeof_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)* ) ( (DOT IDENTIFIER type_argument_list? | OPEN_PARENS argument_list? CLOSE_PARENS | OP_INC | OP_DEC | OP_PTR IDENTIFIER ) (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*)* ;
The only simplification which preserves the equivalence without checking for validity would be the out-factoring of "(OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*", but as I use a second pass for my compiler a further simplification can be done.
The first move is to pretend the line "array_creation_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*" and out-factor the inner trailing. Thus the outer trailing can be turn into:
primary_expression : (array_creation_expression | literal | simple_name | parenthesized_expression | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | this_access | base_access | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (DOT IDENTIFIER type_argument_list? | OPEN_PARENS argument_list? CLOSE_PARENS | OP_INC | OP_DEC | OP_PTR IDENTIFIER | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET )* ;
For now we can leave this rule like this and move on.
Rule Inlining member_access
We inline primary_expression and in-factor all rules before we save a new copy of the grammar:
primary_no_array_creation_expression : literal | simple_name | parenthesized_expression | member_access | invocation_expression | primary_no_array_creation_expression OPEN_BRACKET expression_list CLOSE_BRACKET | this_access | base_access | post_increment_expression | post_decrement_expression | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression | array_creation_expression OP_PTR IDENTIFIER | primary_no_array_creation_expression OP_PTR IDENTIFIER | primary_no_array_creation_expression OPEN_BRACKET expression CLOSE_BRACKET ; member_access : array_creation_expression DOT IDENTIFIER type_argument_list? | primary_no_array_creation_expression DOT IDENTIFIER type_argument_list? | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? ; invocation_expression : array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS | primary_no_array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS ; post_increment_expression : array_creation_expression OP_INC | primary_no_array_creation_expression OP_INC ; post_decrement_expression : array_creation_expression OP_DEC | primary_no_array_creation_expression OP_DEC ;
We postpone the SLRR on primary_no_array_creation_expression and its inlining until the last moment as this rule is referenced by other to-be-inlined rules. We inline now invocation_expression, post_increment_expression and post_decrement_expression and remove SLRR on primary_no_array_creation_expression to get:
primary_no_array_creation_expression : (literal | simple_name | parenthesized_expression | member_access | this_access | base_access | array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS | array_creation_expression OP_INC | array_creation_expression OP_DEC | array_creation_expression OP_PTR IDENTIFIER | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (OPEN_PARENS argument_list? CLOSE_PARENS | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_DEC | OP_PTR IDENTIFIER )* ;
Now after inlining member_access, it has to be in-factored to allow SLRR. Due to the size of the trailing it was out-factored into a separate rule temporarily:
member_access : (array_creation_expression DOT IDENTIFIER type_argument_list? | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | literal member_access_trailing | simple_name member_access_trailing | parenthesized_expression member_access_trailing | this_access member_access_trailing | base_access member_access_trailing | array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS member_access_trailing | array_creation_expression OP_INC member_access_trailing | array_creation_expression OP_DEC member_access_trailing | array_creation_expression OP_PTR IDENTIFIER member_access_trailing | object_creation_expression member_access_trailing | delegate_creation_expression member_access_trailing | anonymous_object_creation_expression member_access_trailing | typeof_expression member_access_trailing | checked_expression member_access_trailing | unchecked_expression member_access_trailing | default_value_expression member_access_trailing | anonymous_method_expression member_access_trailing | sizeof_expression member_access_trailing ) member_access_trailing* ; member_access_trailing : (OPEN_PARENS argument_list? CLOSE_PARENS | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_DEC | OP_PTR IDENTIFIER )* DOT IDENTIFIER type_argument_list? ;
As one can easily see, most alternatives of member_access include member_access_trailing. Looking more closely, the only three alternatives, which don't include the member_access_trailing, use only the zero-or-more-closure of member_access_trailing.
The simplification is complicated by the fact that there are four other array_creation_expression-s. Taking all five alternatives into account, we can actually combine all array_creation_expression-s into one. Using our "we check it in the next pass"-trick we can simplify member_access to:
member_access : (array_creation_expression | predefined_type | qualified_alias_member | literal | simple_name | parenthesized_expression | this_access | base_access | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (OPEN_PARENS argument_list? CLOSE_PARENS | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_DEC | OP_PTR IDENTIFIER | DOT IDENTIFIER type_argument_list? )+ ;
The second pass has to check for the alternatives predefined_type and qualified_alias_member, that they are directly followed by a DOT. array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that DOT IDENTIFIER type_argument_list? is the last matched sequence.
Rule Inlining invocation_expression
We start by inlining member_access, post_increment_expression and post_decrement_expression. Then we do SLRR on primary_no_array_creation_expression:
primary_no_array_creation_expression : (literal | simple_name | parenthesized_expression | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | invocation_expression | this_access | base_access | array_creation_expression DOT IDENTIFIER type_argument_list? | array_creation_expression OP_INC | array_creation_expression OP_DEC | array_creation_expression OP_PTR IDENTIFIER | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (DOT IDENTIFIER type_argument_list? | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_DEC | OP_PTR IDENTIFIER )* ;
Then we inline primary_no_array_creation_expression itself. Again we extract the trailing into a temporary subrule and in-factor its reference, before we do another SLRR.
invocation_expression : (array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS | literal invocation_expression_trailing | simple_name invocation_expression_trailing | parenthesized_expression invocation_expression_trailing | predefined_type DOT IDENTIFIER type_argument_list? invocation_expression_trailing | qualified_alias_member DOT IDENTIFIER type_argument_list? invocation_expression_trailing | this_access invocation_expression_trailing | base_access invocation_expression_trailing | array_creation_expression DOT IDENTIFIER type_argument_list? invocation_expression_trailing | array_creation_expression OP_INC invocation_expression_trailing | array_creation_expression OP_DEC invocation_expression_trailing | array_creation_expression OP_PTR IDENTIFIER invocation_expression_trailing | object_creation_expression invocation_expression_trailing | delegate_creation_expression invocation_expression_trailing | anonymous_object_creation_expression invocation_expression_trailing | typeof_expression invocation_expression_trailing | checked_expression invocation_expression_trailing | unchecked_expression invocation_expression_trailing | default_value_expression invocation_expression_trailing | anonymous_method_expression invocation_expression_trailing | sizeof_expression invocation_expression_trailing ) invocation_expression_trailing* ; invocation_expression_trailing : (DOT IDENTIFIER type_argument_list? | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_DEC | OP_PTR IDENTIFIER)* OPEN_PARENS argument_list? CLOSE_PARENS ;
The situation is analogous to member_expression above.
invocation_expression : (array_creation_expression | literal | simple_name | parenthesized_expression | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | this_access | base_access | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (DOT IDENTIFIER type_argument_list? | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_DEC | OP_PTR IDENTIFIER | OPEN_PARENS argument_list? CLOSE_PARENS )+ ;
The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OPEN_PARENS argument_list? CLOSE_PARENS is the last matched sequence.
Rule Inlining post_increment_expression
We have to repeat the same steps again as for invocation_expression to get:
post_increment_expression : (array_creation_expression | literal | simple_name | parenthesized_expression | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | this_access | base_access | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (DOT IDENTIFIER type_argument_list? | OPEN_PARENS argument_list? CLOSE_PARENS | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_DEC | OP_PTR IDENTIFIER | OP_INC )+ ;
The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OP_INC is the last matched sequence.
Rule Inlining post_decrement_expression
We have to repeat the same steps again as for invocation_expression to get:
post_decrement_expression : (array_creation_expression | literal | simple_name | parenthesized_expression | predefined_type DOT IDENTIFIER type_argument_list? | qualified_alias_member DOT IDENTIFIER type_argument_list? | this_access | base_access | object_creation_expression | delegate_creation_expression | anonymous_object_creation_expression | typeof_expression | checked_expression | unchecked_expression | default_value_expression | anonymous_method_expression | sizeof_expression ) (DOT IDENTIFIER type_argument_list? | OPEN_PARENS argument_list? CLOSE_PARENS | OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET | OP_INC | OP_PTR IDENTIFIER | OP_DEC )+ ;
The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OP_DEC is the last matched sequence.
Rule Normalization
The next step would be ordinarily to look for extractable subrules but one additional step is helpful here. The rules often have a few alternatives whose rule structure is a special case of the rule structure of the other alternatives. The best is to make those special cases equal in rule structure to the alternatives to achieve a better subrule extraction.
Of course, the change of the grammar results in the recognition of a different language. To prevent this one can either use actions like shown with pointer_type to disallow invalid structures. Or one can use 2-pass compiler and check in the next pass specifically for correctness. This approach permits to turn the rules in the first pass even more uniformous, so maybe only one superset rule remains. An additional advantage are also the already mentioned improved error messages.