Can I see a more complete example?

Sure you can!  This is a real example from Tapestry 5, which includes a simple property expression grammar. In Tapestry 5.0, this is ad-hoc based on regular expressions; in 5.1 it will be ANTLR based and much more extensive.  My first step was to reproduce the ad-hoc parsing using ANTLR.

 The grammar supports a couple of case insensitive keywords (true, false, null and this). It supports string literals in single quotes, integer and decimal literals. It has a range literal (i.e, "1..10") that really gets in the way of parsing decimals.  Identifiers are either property names, or methods (by suffixing with '()') and can be strung together with "." or "?." (the latter is a "safe dereference" that won't try to invoke methods on nulls).

Here's what I've come up with:

 lexer grammar PELexer;

// These are extra tokens recognized by the NUMBER_OR_RANGEOP rule.

	// Read a property or invoke a method.

	// Establishes a range between two ints.

	// Integer constant

	// Decimal constant

fragment LETTER
	:	('a'..'z'|'A'..'Z');
fragment DIGIT
	:	'0'..'9';
fragment SIGN
	:	('+'|'-');
LPAREN 	:	'(';
RPAREN 	:	')';

fragment QUOTE
	:	'\'';

// Clumsy but effective approach to case-insensitive identifiers.

fragment A
	:	('a' | 'A');
fragment E
	:	('e' | 'E');
fragment F
	:	('f' | 'F');
fragment H
	:	('h' | 'H');
fragment I
	:	('i' | 'I');
fragment L
	: 	('l' | 'L');
fragment N
	:	('n'|'N');
fragment R
	:	('r' | 'R');
fragment S
	:	('s' | 'S');
fragment T
	:	('t' | 'T');
fragment U
	:	('u' | 'U');

// Identifiers are case insensitive

NULL 	:	N U L L;
TRUE	        :	T R U E;
THIS	        :	T H I S;

	:	LETTER (LETTER | DIGIT | '_')+;

// The Safe Dereference operator understands not to de-reference through
// a null.

	: 	'?.';

WS 	:	(' '|'\t'|'\n'|'\r')+ { skip(); };

// Literal strings are always inside single quotes.
	:	QUOTE (options {greedy=false;} : .)* QUOTE { setText(getText().substring(1, getText().length()-1)); };

// Special rule that uses parsing tricks to identify numbers and ranges; it's all about
// the dot ('.').
// Recognizes:
// '.' as DEREF
// '..' as RANGEOP
// INTEGER (sign? digit+)
// DECIMAL (sign? digits* . digits+)
// Has to watch out for embedded rangeop (i.e. "1..10" is not "1." and ".10").

			{ input.LA(2) != '.' }? => '.' DIGIT* { $type = DECIMAL; }
			| { $type = INTEGER; }
	|	SIGN '.' DIGIT+ { $type = DECIMAL; }
	|	'.'
		DIGIT+ { $type = DECIMAL; }
		| '.' {$type = RANGEOP; }
		| {$type = DEREF; } )

The tricky part is NUMBER_OR_RANGEOP. This is a rule but it never emits a NUMBER_OR_RANGEOP token; it uses look ahead to identify INTEGER, DECIMAL, RANGEOP and DEREF tokens.

Let's take it apart:


This starts something that may be an integer, or the start of a decimal. We parse through the digits and the code block in the curly brace executes just after the digits. LA(1) is the "current" character, LA(2) is the character after.  Normally, a decimal point at this location indicates a  DECIMAL, but we turn off that rule entirely if the character after the '.' is also a '.' ... that's two dots in a row, the range operator.  When we think there's a range operator we drop down to the other option and force the token to be an INTEGER, stopping on the last digit.


This is straightforward, another form for a decimal.  We seperate this out because we don't want the next rule to start SIGN? '.', as +.. and -.. are non-sensical.


Finally we get to a token that starts with a '.' (and no sign) and maybe a DECIMAL, a RANGEOP or a DEREF. It reads pretty well ... if we can match digits, its an (unsigned) decimal. If we can match a second '.', its a RANGEOP. Otherwise its a '.' followed by something else, so its just a DEREF.

The example also demonstrates a few other ideas; a clumsy way to accomplish case-insensitive identifiers, and the way to handle quoted string literals (by matching the enclosing quotes and then stripping them out in action code).