Faster expression parsing for ANTLR

I should be working on something else but got to thinking about how annoying it is specifying expressions in recursive descent parsers. You have to have a new rule for each precedence level. This is also very slow. Just to match 34 it has to descend about 15 method calls. I built a prototype single-rule (plus primary and suffix) operator matching thingie which I enclose below. I should be able to generate it from some metameta syntax in antlr. For example, this is what I would really like to specify when doing expressions rather than the 15 or so rules to define the precedence levels:

e   :   primary
    |   primary '^' e // right associative
    |   e '*' e
    |   ('++'|'-') e
    |   e ('+'|'-') e
    |   suffix
    |   e '++'
    ;

suffix
    :   e '.' ID
    |   e '[' e ']'
    |   e '(' e (',' e)* ')'
    ;

primary
    :   '(' e ')'
    |   INT
    |   ID
    ;

The beauty of this is that it would automatically know how to build ASTs (or with a little bit of help from you for the complicated ones like method calls and array indexing). Further, the precedence specified as usual by the order of the alternatives. Anything that starts and stops with e is considered a binary (or possibly trinary) operator. Any other alternative that starts with e is a suffix operation. Any other alternative that ends with e is a unitary operation. Alts that are simply rule references identify groups of operations at the same level such as the suffix rule.

Perhaps as Gavin Lambert pointed out, we simply need to do associativity with an option:

e   :   primary
    |   exponent
    |   e '*' e
    |   ('++'|'-') e
    |   e ('+'|'-') e
    |   e '.' ID // higher than array/method call
    |   suffix
    |   e '++'
    ;

exponent
options {associativity = right;}
    :   e '^' e
    ;

suffix
    :   e '[' e ']'
    |   e '(' e (',' e)* ')'
    ;

primary
    :   '(' e ')'
    |   INT
    |   ID
    ;

The one funny thing about all this is that we like to see usually things in order from lowest to highest precedence for these kinds of expression rules, but ANTLR assumes everything is highest to lowest precedence.

I could generate something like the following:

/** Test faster recursive-descent expression parsing.
*  Goal: avoid recursing for *each* precedence level.
*  Recurse for changes in precedence, avoiding repeated
*  tests for each level.  The key is passing into expression
*  the min precedence level to match, which is based upon
*  previous operator.
*
*     v3.1 required due to tree bulding bug fix).
*
*  Based upon "precedence climbling" by Theodore Norvell:
*      http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm
*
*  NOTES:
*      1. Holy crap!  This actually works!
*      2. Should be able to autogenerate from list of operators, precedence
*         etc...
*/
grammar T;

options { output=AST; ASTLabelType=CommonTree; }

tokens { PREINC; POSTINC; CALL; INDEX; }

@members {
public static final int LEFT = 1;
public static final int RIGHT = 2;
static int[] prec = new int[tokenNames.length];
static int[] uprec = new int[tokenNames.length];
static int[] postprec = new int[tokenNames.length];
static int[] assoc = new int[tokenNames.length];
static {
   for (int i=0; i<tokenNames.length; i++) { assoc[i]=LEFT; }
   prec[PLUS] = 1;
   prec[MINUS] = 1;
   prec[STAR] = 3;
   prec[CARET] = 4;
   prec[DOT] = 6;
   assoc[CARET] = RIGHT;

   uprec[MINUS] = 2;       // sits between +/- binary and * binary ops
   postprec[LPAREN] = 5;   // lower than DOT for p.f()
   postprec[LBRACK] = 5;
   postprec[INC] = 5;
}

int nextp(int p) {
   int prevOpType = input.LA(-1);
   if ( assoc[prevOpType]==LEFT ) return prec[prevOpType]+1;
   else return prec[prevOpType];
}
}

expr : e[0] {System.out.println($e.tree.toStringTree());} ;

/** This could be autogenerated if you give me primary and suffix and precedence levels */
e[int p]
   :   (primary->primary)
       (   {prec[input.LA(1)]>=p}?=>     bop r=e[nextp(p)] -> ^(bop $e $r)
       |   {postprec[input.LA(1)]>=p}?=> suffix[$e.tree]   -> {$suffix.tree}
       )*
   ;

primary
   :   INT
   |   ID
   |   uop^ {int q=uprec[input.LA(-1)];} e[q]
   |   '(' expr ')' -> expr
   ;

suffix[CommonTree lhs]
   :   t='[' expr ']'                -> ^(INDEX[$t] {$lhs} expr)
   |   t='(' (expr (',' expr)*)? ')' -> ^(CALL[$t] {$lhs} expr*)
   |   t='++'                        -> ^(POSTINC[$t] {$lhs})
   ;

bop :   '+' | '-' | '*' | '^' | '.' ;

uop :   '-' | t='++' -> PREINC[$t];

INC : '++';
LPAREN : '(' ;
LBRACK : '[' ;
PLUS: '+';
MINUS: '-';
STAR: '*';
DOT : '.';
CARET:'^';
INT : '0'..'9'+ ;
ID  : 'a'..'z'+ ;
WS  : (' '|'\n')+ ;

Another really cool feature is that none of the target code generators would have to change because all of this happens sort of like a macro preprocessor within ANTLR.