Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Code Block
a   :    b
    |    c
    ;

b   :    a
    |    c
    ;

Rule a references rule b, while rule b references rules a. A more complex example would show such a 're-reference' taking place after many subsequent rule invocations. As stated above, the resolvement has to be done manually here.

...

After this all mutually left-recursive rules are checked to see if they are referenced in any other rules besides themselves in the original grammar. The five rules being referenced elsewhere are type, array_type, non_array_type, pointer_type and unmanaged_type. To simplify my future work I inline all unreferenced rules first and save the result:

...

This can be solved with "in-factoring": "(a|b) c" => "ac|bc". Further analysis of the syntax diagram can point out other simplifications like "(ac|bc) (c)*" => "(a|b) c+". Don't do them aggressively until only the last rule is left - further inlining may solve asymmetries where not all alternatives share the exact same parts.

Rule Inlining type

Every rule besides type (and the ones in the OutOfSight group) is inlined and all parentheses have been removed. The result is:

...

This time we remove the left-recursion of the rule type, right at the beginning:

...

Now we inline type and simplify further. I inline unmanaged_type as next and in-factor "INTERR* STAR" to each subrule, so that I can remove the left-recursion in pointer_type. Because of the assymmetric "VOID STAR" subrule I inline pointer_type instead doing some optimization, except turning the trailing into "(INTERR | STAR)*". Then we arrive at

Code Block
non_array_type
    :    (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    array_type
        |    delegate_type
        |    type_parameter
        |   (type_name INTERR* STAR
            |    simple_type INTERR* STAR
            |    enum_type INTERR* STAR
            |    class_type INTERR* STAR
            |    interface_type INTERR* STAR
            |    array_type INTERR* STAR
            |    delegate_type INTERR* STAR
            |    type_parameter INTERR* STAR
            |    VOID STAR
            ) (INTERR | STAR)*
        ) INTERR*
    ;

Taking the near duplicates into account, "INTERR* STAR (INTERR | STAR)" is transformable into "INTERR* STAR? (INTERR | STAR)" and this is "(INTERR | STAR)*". We then obtain

Code Block
non_array_type
    :   (type_name (INTERR | STAR)*
        |    simple_type (INTERR | STAR)*
        |    enum_type (INTERR | STAR)*
        |    class_type (INTERR | STAR)*
        |    interface_type (INTERR | STAR)*
        |    array_type (INTERR | STAR)*
        |    delegate_type (INTERR | STAR)*
        |    type_parameter (INTERR | STAR)*
        |    VOID STAR (INTERR | STAR)*
        ) INTERR*
    ;

The last INTERR* can be safely deleted. Then after some rule-inling, SLRR and simplification we arrive at

...

Code Block
array_type
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    delegate_type
        |    type_parameter
        |    VOID STAR
        ) ((STAR | INTERR)* rank_specifier)+
;

"((STAR | INTERR)* rank_specifier)" is similar to "(+STAR +| INTERR | rank_specifier+)+", if there is at least one rank_specifier. This can be checked with:

...

We start here with the two saved unmanaged_type and array_type rules. We remove the left-recursion from array_type and simplify it to

Code Block
array_type
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    delegate_type
        |    type_parameter
        |    unmanaged_type STAR
        |    VOID STAR
        ) (INTERR* rank_specifier)+
    ;

Then we only need to inline it and simplify unmanaged_type to

Code Block
unmanaged_type
    :    type_name (INTERR | rank_specifier)*
    |    simple_type (INTERR | rank_specifier)*
    |    enum_type (INTERR | rank_specifier)*
    |    class_type (INTERR | rank_specifier)*
    |    interface_type (INTERR | rank_specifier)*
    |    delegate_type (INTERR | rank_specifier)*
    |    type_parameter (INTERR | rank_specifier)*
    |    unmanaged_type STAR (INTERR | rank_specifier)*
    |    VOID STAR (INTERR | rank_specifier)*
    ;

...

We repeat the known procedure, where non_array_type won't have its left-recursion removal until the last step. We then get

Code Block
non_array_type
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    non_array_type rank_specifier+
        |    delegate_type
        |    type_parameter
        |    pointer_type
        ) INTERR*
    ;

pointer_type
    :    type_name INTERR* STAR
    |    simple_type INTERR* STAR
    |    enum_type INTERR* STAR
    |    class_type INTERR* STAR
    |    interface_type INTERR* STAR
    |    non_array_type rank_specifier+ INTERR* STAR
    |    delegate_type INTERR* STAR
    |    type_parameter INTERR* STAR
    |    pointer_type INTERR* STAR
    |    VOID STAR
    ;

pointer_type is turned into

Code Block
pointer_type
    :   (type_name INTERR* STAR
        |    simple_type INTERR* STAR
        |    enum_type INTERR* STAR
        |    class_type INTERR* STAR
        |    interface_type INTERR* STAR
        |    non_array_type rank_specifier+ INTERR* STAR
        |    delegate_type INTERR* STAR
        |    type_parameter INTERR* STAR
        |    VOID STAR
        ) (INTERR | STAR)*
    ;

Inlining pointer_type leads with some simplification to

...

Following the next few steps as done when inlining array_type results in:

Code Block
non_array_type
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    delegate_type
        |    type_parameter
        |    VOID STAR
        ) (rank_specifier | INTERR | STAR)*
    ;

...

Starting with the first intermediate result of non_array_type, we turn non_array_type into

Code Block
non_array_type
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    delegate_type
        |    type_parameter
        |    pointer_type
        ) (rank_specifier | INTERR)*
    ;

...

Code Block
pointer_type
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    delegate_type
        |    type_parameter
        |    VOID
        ) (rank_specifier | INTERR | STAR)* STAR
    ;

Note that I moved the STAR to the end to allow additional transformations, but in this form there can be rank_specifier and INTERR between VOID and the first STAR. This is alleviated by

Code Block
pointer_type
@init {
    bool allowAll = true;
}
    :   (type_name
        |    simple_type
        |    enum_type
        |    class_type
        |    interface_type
        |    delegate_type
        |    type_parameter
        |    VOID {allowAll = false;}
        ) ({allowAll}?=> rank_specifier \|{allowAll}?=> INTERR | STAR {allowAll = true;})* STAR
    ;

...

After this all mutually left-recursive rules are checked if they are referenced in any other rule besides themselves in the original grammar. The original list consists of invocation_expression, primary_expression, primary_no_array_creation_expression, post_increment_expression, post_decrement_expression, element_access, member_access, pointer_element_access and pointer_member_access.

Out of those, only element_access, pointer_element_access and pointer_member_access aren't referenced elsewhere, so they are inlined before I save a copy. Next to the absence of the aforementioned rules primary_no_array_creation_expression looks like this. Its single left-recursion points it to be the first surviving rule.

...

Rule Inlining primary_no_array_creation_expression

Inlining primary_expression leads to in-factoring at every other rule. The inlining of the other rules is straight-forward and gives us

...

Firstly, we can do SLRR and simplification with primary_no_array_creation_expression:

Code Block
primary_no_array_creation_expression
    :   (literal
        |    simple_name
        |    parenthesized_expression
        |    member_access
        |    invocation_expression
        |    this_access
        |    base_access
        |    post_increment_expression
        |    post_decrement_expression
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        |    primary_expression OP_PTR IDENTIFIER
        ) (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*
    ;

...

This point seems to be problematic. primary_expression is actually SLR. But due to the parentheses ANTLR considers them MLR. In-factoring is the only and somewhat ugly solution to allow SLRR:

...

The only simplification which preserves the equivalence without checking for validity would be the out-factoring of "(OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*", but as I use a second pass for my compiler a further simplification can be done.

The first move is to pretend the line "array_creation_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*" and out-factor the inner trailing. Thus the outer trailing can be turn into:

...

Rule Inlining member_access

We inline primary_expression and in-factor all rules before we save a new copy of the grammar:

Code Block
primary_no_array_creation_expression
    :    literal
    |    simple_name
    |    parenthesized_expression
    |    member_access
    |    invocation_expression
    |    primary_no_array_creation_expression OPEN_BRACKET expression_list CLOSE_BRACKET
    |    this_access
    |    base_access
    |    post_increment_expression
    |    post_decrement_expression
    |    object_creation_expression
    |    delegate_creation_expression
    |    anonymous_object_creation_expression
    |    typeof_expression
    |    checked_expression
    |    unchecked_expression
    |    default_value_expression
    |    anonymous_method_expression
    |    sizeof_expression
    |    array_creation_expression OP_PTR IDENTIFIER
    |    primary_no_array_creation_expression OP_PTR IDENTIFIER
    |    primary_no_array_creation_expression OPEN_BRACKET expression CLOSE_BRACKET
    ;

member_access
    :    array_creation_expression DOT IDENTIFIER type_argument_list?
    |    primary_no_array_creation_expression DOT IDENTIFIER type_argument_list?
    |    predefined_type DOT IDENTIFIER type_argument_list?
    |    qualified_alias_member DOT IDENTIFIER type_argument_list?
    ;

invocation_expression
    :    array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS
    |    primary_no_array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS
    ;

post_increment_expression
    :    array_creation_expression OP_INC
    |    primary_no_array_creation_expression OP_INC
    ;

post_decrement_expression
    :    array_creation_expression OP_DEC
    |    primary_no_array_creation_expression OP_DEC
    ;

We postpone the SLRR on primary_no_array_creation_expression and its inlining until the last moment as this rule is referenced by other to-be-inlined rules. We inline now invocation_expression, post_increment_expression and post_decrement_expression and remove SLRR on primary_no_array_creation_expression to get:

Code Block
primary_no_array_creation_expression
    :   (literal
        |    simple_name
        |    parenthesized_expression
        |    member_access
        |    this_access
        |    base_access
        |    array_creation_expression OPEN_PARENS argument_list? CLOSE_PARENS
        |    array_creation_expression OP_INC
        |    array_creation_expression OP_DEC
        |    array_creation_expression OP_PTR IDENTIFIER
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        )   (OPEN_PARENS argument_list? CLOSE_PARENS
            |    OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET
            |    OP_INC
            |    OP_DEC
            |    OP_PTR IDENTIFIER
            )*
    ;

Now after inlining member_access, it has to be in-factored to allow SLRR. Due to the size of the trailing it was out-factored into a separate rule temporarily:

...

As one can easily see, most alternatives of member_access include member_access_trailing. Looking more closely, the only three alternatives, which don't include the member_access_trailing, use only the zero-or-more-closure of member_access_trailing.

The simplification is complicated by the fact that there are four other array_creation_expression-s. Taking all five alternatives into account, we can actually combine all array_creation_expression-s into one. Using our "we check it in the next pass"-trick we can simplify member_access to:

Code Block
member_access
    :   (array_creation_expression
        |    predefined_type
        |    qualified_alias_member
        |    literal
        |    simple_name
        |    parenthesized_expression
        |    this_access
        |    base_access
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        )   (OPEN_PARENS argument_list? CLOSE_PARENS
            |    OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET
            |    OP_INC
            |    OP_DEC
            |    OP_PTR IDENTIFIER
            |    DOT IDENTIFIER type_argument_list?
            )+
    ;

The second pass has to check for the alternatives predefined_type and qualified_alias_member, that they are directly followed by a DOT. array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that DOT IDENTIFIER type_argument_list? is the last matched sequence.

Rule Inlining invocation_expression

We start by inlining member_access, post_increment_expression and post_decrement_expression. Then we do SLRR on primary_no_array_creation_expression:

Code Block
primary_no_array_creation_expression
    :   (literal
        |    simple_name
        |    parenthesized_expression
        |    predefined_type DOT IDENTIFIER type_argument_list?
        |    qualified_alias_member DOT IDENTIFIER type_argument_list?
        |    invocation_expression
        |    this_access
        |    base_access
        |    array_creation_expression DOT IDENTIFIER type_argument_list?
        |    array_creation_expression OP_INC
        |    array_creation_expression OP_DEC
        |    array_creation_expression OP_PTR IDENTIFIER
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        )   (DOT IDENTIFIER type_argument_list?
            |    OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET
            |    OP_INC
            |    OP_DEC
            |    OP_PTR IDENTIFIER
            )*
    ;

Then we inline primary_no_array_creation_expression itself. Again we extract the trailing into a temporary subrule and in-factor its reference, before we do another SLRR.

...

The situation is analogous to member_expression above.

Code Block
invocation_expression
    :   (array_creation_expression
        |    literal
        |    simple_name
        |    parenthesized_expression
        |    predefined_type DOT IDENTIFIER type_argument_list?
        |    qualified_alias_member DOT IDENTIFIER type_argument_list?
        |    this_access
        |    base_access
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        )   (DOT IDENTIFIER type_argument_list?
            |    OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET
            |    OP_INC
            |    OP_DEC
            |    OP_PTR IDENTIFIER
            |    OPEN_PARENS argument_list? CLOSE_PARENS
            )+
    ;

The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OPEN_PARENS argument_list? CLOSE_PARENS is the last matched sequence.

...

We have to repeat the same steps again as for invocation_expression to get:

Code Block
post_increment_expression
    :   (array_creation_expression
        |    literal
        |    simple_name
        |    parenthesized_expression
        |    predefined_type DOT IDENTIFIER type_argument_list?
        |    qualified_alias_member DOT IDENTIFIER type_argument_list?
        |    this_access
        |    base_access
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        )   (DOT IDENTIFIER type_argument_list?
            |    OPEN_PARENS argument_list? CLOSE_PARENS
            |    OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET
            |    OP_DEC
            |    OP_PTR IDENTIFIER
            |    OP_INC
            )+
    ;

The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OP_INC is the last matched sequence.

...

We have to repeat the same steps again as for invocation_expression to get:

Code Block
post_decrement_expression
    :   (array_creation_expression
        |    literal
        |    simple_name
        |    parenthesized_expression
        |    predefined_type DOT IDENTIFIER type_argument_list?
        |    qualified_alias_member DOT IDENTIFIER type_argument_list?
        |    this_access
        |    base_access
        |    object_creation_expression
        |    delegate_creation_expression
        |    anonymous_object_creation_expression
        |    typeof_expression
        |    checked_expression
        |    unchecked_expression
        |    default_value_expression
        |    anonymous_method_expression
        |    sizeof_expression
        )   (DOT IDENTIFIER type_argument_list?
            |    OPEN_PARENS argument_list? CLOSE_PARENS
            |    OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET
            |    OP_INC
            |    OP_PTR IDENTIFIER
            |    OP_DEC
            )+
    ;

The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OP_DEC is the last matched sequence.

...

Of course, the change of the grammar results in the recognition of a different language. To prevent this one can either use actions like shown with pointer_type to disallow invalid structures. Or one can use 2-pass compiler and check in the next pass specifically for correctness. This approach permits to turn the rules in the first pass even more uniformous, so maybe only one superset rule remains. An additional advantage are also the already mentioned improved error messages.