Grammar Design Patterns
Implementing precedence rules
add : mult ('+'^ mult)* ; // left association mult : pow ('*'^ pow)* ; // left association pow : atom ('^'^ pow)? ; // right association atom : ID | INT | '('^ add ')'! ; // recursion
It is often useful to create imaginary tokens for the branch nodes.
This requires a more involved syntax to achieve the same result.
tokens { ADD; MULT; POW; ATOM; } add : (mult -> mult) ('+' m=mult -> ^(ADD[$add] $add $m))* ; mult : (pow -> pow) ('*' p=pow) -> ^(MULT[$mult] $mult $p))* ; pow : (atom '^' p=pow -> ^(POW[$pow] atom $p) | (atom -> atom) ; atom : ID | INT | '(' add ')' -> ^(ATOM[$atom] add) ;
Detecting statement terminator
It is common for statements to be optionally terminated by a new line or a semicolon.
(Python, VisualBasic, Bash, etc.)
String containing escaped characters
This is a case of an 'island-grammar'.
At some point the content of the string will need to be transformed.
For example "\u20\u6e\u69\u63\u6f\u64\u65" would be transformed into "Unicode".
When should this transformation be performed?
By the lexer?
It is generally a bad practice to modify the input stream.
ESC : '\\' ( 'n' {this.setText("\n");} | 't' {this.setText("\t");} | 'v' {this.setText("\013");} | 'b' {this.setText("\b");} | 'r' {this.setText("\r");} | 'f' {this.setText("\r");} | 'a' {this.setText("\007");} | '\\' {this.setText("\\");} | '?' {this.setText("?");} | '\'' {this.setText("'");} | '"' {this.setText("\"");} | OCTDIGIT (OCTDIGIT? OCTDIGIT)? { char[] realc = new char[1]; realc[0] = (char) Integer.valueOf($text, 8).intValue(); this.setText(new String(realc)); } | 'x' HEXDIGIT HEXDIGIT? { char[] realc = new char[1]; realc[0] = (char) Integer.valueOf($text.substring(1), 16).intValue(); this.setText(new String(realc)); } | 'u' HEXDIGIT ((HEXDIGIT? HEXDIGIT)? HEXDIGIT)? { char[] realc = new char[1]; realc[0] = (char) Integer.valueOf($text.substring(1), 16).intValue(); this.setText(new String(realc)); } ) ;
An alternative approach that may be useful.
fragment MARKER : '"' ; ESCCHAR : '\\' ; LITERAL : MARKER (options {greedy=false;}: ESCCHAR . | .)* MARKER ;
By the parser?
UNICODE_LITERAL : '\\u' HEXDIGIT ((HEXDIGIT? HEXDIGIT)? HEXDIGIT)? ; literal returns [char value] : UNICODE_LITERAL { $value = (char)Integer.valueOf($text.substring(1), 16).intValue(); } ;
By the renderer?
contextBody(foo,bar) ::= << <foo; format="toUpper"> <bar; format="decode"> >>
Where 'bar' is the escaped string.
In this case the renderer itself could be a lexer/parser for the regular language.
Processing regular expressions
Many languages provide support for regular expressions.
This is a case of an 'island-grammar'.