ANTLR and Shroedinger's Tokens

Had a nice lunch with Mihai Surdeanu today at Stanford. Mihai does natural language processing research and has used ANTLR in the past to process English text. He asked for 2 things: tokens that can be in more than one token class (token type) at the same time and the ability to get all interpretations of ambiguous input. Sam Harwell is also interested in getting all interpretations.

Tokens in multiple classes

John Aycock and Nigel Horspool wrote this awesome paperback in 2001: Schrodinger’s token. The idea is to allow tokens to be both identifiers and keywords, for example. In a natural language setting, one might want VERB and NOUN or something more specific to a domain.

The simplest implementation is to split the integer token type in class Token into 2 regions: one for specific token types, and then the upper bits as a bit set to specify which class the token can be in. Of course that would limit us to a certain number of classes because there's a finite number of bits in an integer, but let's start with that. If we want to handle the keywords can be identifiers problem, we would create specific token types for the keywords and then use ID as a token class in the upper bits. That way we could say something like this in the grammar:

tokens {ID<0x00010000>;} // specify ID's token type value (<0x00010000> is new syntax)
// allow IF IF THEN THEN=IF;
stat: IF expr THEN stat
    | ID '=' expr ';'
    ;
expr: ID ;

The lexical rules would look like this:

IF   : 'if'   {$type |= ID;} ; // add to the ID class
THEN : 'then' {$type |= ID;} ; // add to the ID class
ID   : [a-z]+ ;

The code generated for the parser would look like this:

void stat() {
    if ( lookahead==IF ) {
    }
    else if ( lookahead&ID>0 ) { // can't use switch here
        match(ID);
        match(EQUALS);
        expr();
        match(SEMI);
    }
    else error();
}
void expr() { match(ID); }

All of the magic happens in match(t). If the token type requested (the parameter t) is greater than the fixed token type region, then it should treat the token type as a bit mask to use against the incoming token. Otherwise, it just works as normal match(t), checking equality.

Actually, the ParserATNSimulator that does the prediction would also need to understand this splitting of the token type integers so that he could do appropriate comparisons like match().

We could imagine allocating the 1st 12 bits or so of a token type integer for the fixed token types and then we are free to use bits 13 above when we define tokens, as I've done with the new syntax above.

If we really needed many more classes, we could define our own Token implementation that recorded an arbitrary bit set. This would require overriding match() in Parser.java. The key part that ANTLR needs to do is to define the token names like IF, ID, and so on so that the grammar can reference them. Hmm..we also need to specify a new ATN simulator so that we could override the functionality there.

We could do all of this with a handbuilt lexer. Mihai was telling me that for his domain specific application, Law documents, the only fixed token is WORD. Everything else is a domain specific token class. A word might be important in one context, BARRED, but not in another.

One final thought: rather than having a fixed pivot point like the 12 bit, we could do this arbitrarily. If we specify 0x00010000 for a token class, then we are using bits 16 and above. If we use 0x00000100, then we are limiting fixed token types to the and the 1st 8 bits.

Oh crap. ANTLR uses sets sometimes for efficiency. For example if you use (A|B|C), it will generate code to match a set rather than a switch for the subrule. Set (ID|';') would try to add a big number for ID into an IntervalSet. That might cause trouble.  Also, I currently generate arrays that assume contiguous token types up to a maximum. A token type of 0x10000000 would certainly generate a big wasteful array. Sounds to me like we shouldn't try to trick ANTLR you into doing this and we should formalize the notion of Shroedinger's Tokens.   Perhaps we need to be able define token classes like we define tokens:

classes {ID;} // (new syntax)

Actually, maybe we just add a Token implementation that has a separate field, tokenClasses, that we can check separately. match() would have to have some kind of signal however to check the extra field. Probably ANTLR would have to generate match(tokentype, tokenclass), but that's no big deal. It would require formal definitions like the classes section just shown. Hmm...well, this is a good start to remind me of my thoughts if I can get time to implement this.

Specifying an Ambiguity Resolver

This part seems easy. All I have to do is add a hook in ParserATNSimulator that let's people specify a new resolution mechanism. Right now, the code areas that detect ambiguities, all resolve things by calling getMinElement(). Instead of this I could simply make a a call to some kind of ParserAmbiguityHandler.getPredictedAlt(...) with lots of parameters that give enough information to make a decision. To get all possible interpretations, all we have to do is see the same input to the parser multiple times and have the ambiguity handler rotate through the possible alternatives for each ambiguous decision. When there are no more alternatives to choose from at the ambiguous points, we can stop.

Here is everything I passed to the parser listener when I see an ambiguity:

    /** If context sensitive parsing, we know it's ambiguity not conflict */
    public void reportAmbiguity(DFA dfa, DFAState D, int startIndex, int stopIndex,
                                IntervalSet ambigAlts,
                                ATNConfigSet configs)