Island Grammars Under Parser Control

The Problem

The ANTLR 3 examples include 'island-grammar' which shows how to parse part of the input using an alternative grammar by invoking a new Lexer+Parser combination on the input when a certain token is recognized (specifically, when a start-of-comment marker is seen).

This demonstrates a nice, simple case, where the lexer can identify the embedded language on its own. More dificult to handle is the case where only the parser is able to see the point where the embedded language begins in the input.

Consider that in the 'island-grammar' example, the Lexer of the 'outer' grammar will spot the start of the section to be handled by island grammar when it sees the token '/**' (the start of a Javadoc-style comment section). This token has no other meaning within the language being defined in simple.g.

If the token that marks the start of the island grammar within your input can also has other uses within your language, the basic 'island-grammar' technique can't be used.
For instance, look at an example of a regular expression literal:

r =   / b; f = r/m;

This assigns the regular expression ' b; f = r' (with the flag 'm') to the variable 'r'.

Compare that with the following line of code:

r = a / b; f = r/m;

Here, two statements appear on one line; the assignment of 'a / b' to the variable 'r' and the assignment of 'r/m' to the variable 'f'.

Clearly, the lexer for this language will not be able to determine on seeing '/' if this should represent a DIVIDE token, or the start of a REGEX token, since all that follows the '/' is the same in both of the above examples. It's only the context of what came before the '/' that allows correct recognition.

The Solution

It may be possible to have your lexer track the last significant token emitted, and to process the input following '/' differently depending on whether the token preceding it was an IDENT or an ASSIGN, for instance. This page demonstrates another approach, which is to have the parser direct a temporary switch to an island grammar.

NB Controlling lexer-level behavior from the parser is difficult, and will often not work the way you want. Grammar permitting, it is sometimes possible.

Watch that lookahead

The first problem with altering the way that input is processed from a parser action is that the standard ANTLR 3 TokenStream implementations snarf the entire input in one go at the point of construction, and then feed the parser with tokens one-by-one from an internal list.

Since the tokens of the island grammar are not compatible with tokens of the outer grammar, we need to avoid this behavior (imagine the regular expression literal /"/ which consists of just a double-quote, if the outer-grammar were to try and interpret that, it would bomb-out thinking it was looking at an unterminated string literal) .

The metaas project includes an example implementation of TokenStream that lazily loads tokens from the underlying TokenSource (it also implements a doubly-linked list of Token objects, for other reasons):

http://svn.badgers-in-foil.co.uk/metaas/trunk/src/main/java/uk/co/badgersinfoil/metaas/impl/antlr/LinkedListTokenStream.java

Building a parser to use this TokenStream will then look like,

ANTLRReaderStream charstream = new ANTLRReaderStream(in);
        AS3ParserLexer lexer = new AS3ParserLexer(charstream);
        LinkedListTokenSource linker = new LinkedListTokenSource(lexer);
        LinkedListTokenStream tokenStream = new LinkedListTokenStream(linker);
        AS3Parser parser = new AS3Parser(tokenStream);

The Grammar

We extend our grammar definiton to deal with this kind of literal, and add the output of the island grammar to the AST being built,

constant
    :    regexpLiteral
    |    HEX_LITERAL
    |    DECIMAL_LITERAL
    |    OCTAL_LITERAL
    |    FLOAT_LITERAL
    |    STRING_LITERAL
    |    TRUE
    |    FALSE
    ;

regexpLiteral
    @init { LinkedListTree re = null; }
    :    '/' { re=handleRegexp(); }  ->  ^( {re} )
    ;

Magic needs to happen in the handleRegexp() method, but it will need access to lots of the internal state of the lexer to work. Specifically, we'll have to provide,

The CharStream implementation (ANTLRReaderStream), so that we can get at the as-yet unprocessed input
The lexer, so that we can potentially re-initialize its input at exactly the point where the island grammar input finishes

These are not normally accessible from the parser, so we need to define a few things in the @parser::members section so that they can be supplied by the calling code:

private AS3ParserLexer lexer;
	private CharStream cs;

	public void setInput(AS3ParserLexer lexer, CharStream cs) {
		this.lexer = lexer;
		this.cs = cs;
	}

Now we should have the pieces we need to implement the handleRegexp() method:

private LinkedListTree handleRegexp() throws RecognitionException {
		String tail = cs.substring(cs.index(), cs.size()-1);
		RegexpParser parser;
		try {
			parser = createRegexpParser(new StringReader(tail), stream);
		} catch (IOException e) {
			throw new RuntimeException(e);
		}
		LinkedListTree ast = ASTUtils.tree(parser.regexp());
		// now, restore in input state of the outer grammar to something sane,
		tail = parser.getInputTail();
		try {
			cs = new ANTLRReaderStream(new StringReader(tail));
		} catch (IOException e) {
			throw new RuntimeException(e);
		}
		lexer.setCharStream(cs);
		LinkedListTokenSource source = (LinkedListTokenSource)stream.getTokenSource();
		stream.setTokenSource(source);  // drop any remaining 'Regexp state' from stream
		source.setDelegate(lexer);  // custom method of LinkedListTokenSource
		return ast;
	}

Conclusion

This does seem to be a pretty complex way of doing things, but it also seems to work. Repeated string-copying in this implementation probably also means that it isn't the speediest solution.

I am currently in the process of trying to use this technique to process E4X XML literals in my target language (see outer grammar AS3.g3 and island grammar E4X.g).