Another solution for Island Grammars Under Parser Control

Author: Break  

liujing.break@gmail.com

Have read the article in

http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control

For this same problem, I just copy the orginal sample here:

The Problem

The ANTLR 3 examples include 'island-grammar' which shows how to parse part of the input using an alternative grammar by invoking a new Lexer+Parser combination on the input when a certain token is recognized (specifically, when a start-of-comment marker is seen).

This demonstrates a nice, simple case, where the lexer can identify the embedded language on its own. More dificult to handle is the case where only the parser is able to see the point where the embedded language begins in the input.

Consider that in the 'island-grammar' example, the Lexer of the 'outer' grammar will spot the start of the section to be handled by island grammar when it sees the token '/**' (the start of a Javadoc-style comment section). This token has no other meaning within the language being defined in simple.g.

If the token that marks the start of the island grammar within your input can also has other uses within your language, the basic 'island-grammar' technique can't be used.
For instance, look at an example of a regular expression literal:

r = / b; f = r/m;
This assigns the regular expression ' b; f = r' (with the flag 'm') to the variable 'r'.

Compare that with the following line of code:

r = a / b; f = r/m;
Here, two statements appear on one line; the assignment of 'a / b' to the variable 'r' and the assignment of 'r/m' to the variable 'f'.

Clearly, the lexer for this language will not be able to determine on seeing '/' if this should represent a DIVIDE token, or the start of a REGEX token, since all that follows the '/' is the same in both of the above examples. It's only the context of what came before the '/' that allows correct recognition.

My solution

Using the Gated Semantic Predicates.

A very simple sample grammar to mock the problem situation ( like ECMAScript with regular expression support)

Sample.g

grammar Sample;
options
{
 backtrack=true;
 memoize=true;
}
@header{
    package org.liujing.magdown.parser;
    import java.util.logging.*;
    import java.io.*;
    import org.liujing.parser.*;
}
@member{
	...
}
@synpredgate{
    state.backtracking==0 || doActionInPredicate
}

program :statement\+ ;

statement: 'var' ID '=' exp ';'
   ;

exp: exp1
     |exp2
   ;

exp1:

   Number (Slash Number)?
    | regularExp
    ;

exp2:  '"' ID '"';

regularExp
    @init{
    doActionInPredicate = true;
    }
    @after{
    doActionInPredicate = false;
    }
    : slash='/' { consumeReg($slash); }
    ;

Slash: '/';
VAR: 'var';
Equals: '=';
Comma: ';';
ID: ('a'..'z'\|'A'..'Z'\|'_')*;

Number: ('0'..'9')\+ ;

WhiteSpace
 : ('\t' \| '\v' \| '\f' \| ' ' \| '\u00A0'\|'\n'\|'\r') {skip();}
 ;

Comment: '//' (~'\n')*'\n' {skip(); };  

The island grammar for parsing regular expression, this parser must return the last token of all valid content.

grammar JSRegex;

options
{
	backtrack=true;
	memoize=true;
}

@header{
    package org.liujing.magdown.parser;
    import java.util.logging.*;
}
@members{
    static Logger log = Logger.getLogger(JSRegexParser.class.getName());
}
@lexer::header{
package org.liujing.magdown.parser;
import java.util.logging.*;
}
@lexer::members{
static Logger log = Logger.getLogger(JSRegexLexer.class.getName());
}

expression returns [Token endT]  : body eT=END {endT = $eT;};

body: s=ANYCHARS {log.info($s.getText());};

ANYCHARS: (~('/'|'\\') | '\\' '\u0000'..'\uFFFE')* ;
END: '/'('a'..'z')?;

Now, we need add a method predicateRegex in above Sample.g, it has 4 functions:

  1. consumeReg() method, it even needs to be executed during Backtracking, so we need to define @synpredgate to rewrite the action condition.
  2. Call the island grammar parser
  3. Skip some underneath characters to hide the island grammar content from the main lexer, so that the main Parser can totally ignore them.
  4. Memoiaze the island parser's parsing result for mutiple predicates backtraces. We don't want the island parser be executed multi times.
    @members{
        static Logger log = Logger.getLogger(SampleParser.class.getName());
    
        private Map<Integer, Integer> memoRegex = new HashMap();
    
        private boolean doActionInPredicate = false;
    
        public void consumeReg(Token last){
            try{
                //log.fine("backtracing:" + state.backtracking);
                CommonToken startToken = (CommonToken)last;
                int startCharPos = startToken.getStopIndex()+ 1;
                Integer endCharPos = memoRegex.get(startCharPos);
                if( endCharPos == null){
                    //log.log(Level.FINE, "debug", new Throwable("debug"));
                    //log.info("startToken:" + startToken.getText());
                    CharStream cs = ((Lexer)input.getTokenSource()).getCharStream();
                    //log.info("index = " + cs.index());
                    //log.info("start=" + start + " substring = " + cs.substring(start, cs.size()));
                    //log.log(Level.INFO, "show my stack", new Throwable());
                    ((RemovableTokenStream)input).cleanRestTokens(startToken);
    
                    JSRegexLexer jslex = new JSRegexLexer(cs);
                    JSRegexParser jsparser = new JSRegexParser(new UnbufferedTokenStream(jslex));
                    CommonToken endT = (CommonToken)jsparser.expression();
                    log.info("end of regex: "+
                        substring(endT.getStopIndex() , endT.getStopIndex() + 1));
    
                    cs.seek(endT.getStopIndex() + 1);
                    cs.setLine(endT.getLine());
                    cs.setCharPositionInLine(endT.getCharPositionInLine() +
                        (endT.getStopIndex() + 1 - endT.getStartIndex()));
    
                    log.fine("back to main :" + (endT.getStopIndex() + 1));
                    //log.info("LT(" + (la + 1) + ") = "+ input.LT(la + 1).getText());
                    memoRegex.put(startCharPos, endT.getStopIndex() + 1);
                }else{
    
                }
            }catch(Exception e){
                e.printStackTrace();
            }
        }
    }
    Above JSRegexParser and JSRegexLexer is the grammar parser for parsing regular expression, it is compiled from JSRegex.g .

RemovableTokenStream is the extended TokenStream which made by me to manipulate underneath char stream.

To execute our main parser, it must use RemovableTokenStream instead of CommonTokenStream.

ANTLRInputStream in = new ANTLRInputStream(
ParserTest.class.getResourceAsStream("testparser1.txt"));
SampleLexer lexer = new SampleLexer(in);
RemovableTokenStream tokens = new RemovableTokenStream(lexer);
SampleParser p = new SampleParser(tokens);

RemovableTokenStream.java extends UnbufferedTokenStream, here I recommend to use unbufferredTokenStream for the island Lexer creation as well, since I haven't tested if CommandTokenStream works for my solution.

public class RemovableTokenStream extends UnbufferedTokenStream{
    private static Logger log = Logger.getLogger(RemovableTokenStream.class.getName());

    public RemovableTokenStream(TokenSource tokenSource){
        super(tokenSource);
    }

    /**
    delete all cached tokens that is following the given token.<br>
    this method should be invoked in semantic predicates
    @param k number of LookAhead
    */
    public void cleanRestTokens(int k){
        int curr = p + k -1;
        CommonToken startToken = (CommonToken)data.get(curr);
        int start = curr + 1;
        if( start < data.size()){
            log.info("p=" + p +
                " token= [ "+ ((CommonToken)data.get(curr)).getStartIndex()
                + "-"+ ((CommonToken)data.get(curr)).getStopIndex() + "]" +
                data.get(curr).getText());
            data.subList(start, data.size()).clear();
            CharStream cs = ((Lexer)getTokenSource()).getCharStream();
            cs.seek(startToken.getStopIndex() + 1);
            log.info("reset cs pos:"+ (startToken.getStopIndex() + 1) );
            cs.setLine(startToken.getLine());
            cs.setCharPositionInLine(startToken.getCharPositionInLine() +
                (startToken.getStopIndex() + 1 - startToken.getStartIndex()));
        }
    }
}


OK, done.

Now you may try the sample parse content

var XYZ = /1;2/g ;
var ABC = "xxx";
var LJ = 2/2;


liujing.break@gmail.com