Page Comparison

...

The simplest implementation is to split the integer token type in class Token into 2 regions: one for specific token types, and then the upper bits as a bit set to specify which class the token can be in. Of course that would limit us to a certain number of classes because there's a finite number of bits in an integer, but let's start with that. If we want to handle the keywords can be identifiers problem, we would create specific token types for the keywords and then use ID as a token class in the upper bits. That way we could say something like this in the grammar:

Code Block

language	java

tokens {ID<0x00010000>;} // specify ID's token type value (new syntax)
// allow IF IF THEN THEN=IF;
stat: IF expr THEN stat
    | ID '=' expr ';'
    ;
expr: ID ;

...

Code Block

language	java

void stat() {
    if ( lookahead==IF ) {
    }
    else if ( lookahead&ID>0 ) { // can't use switch here
        match(ID);
        match(EQUALS);
        expr();
        match(SEMI);
    }
    else error();
}
void expr() { match(ID); }

All of the magic happens in match(t). If the token type requested (the parameter t) is greater than the fixed token type region, then it should treat the token type as a bit mask to use against the incoming token. Otherwise, it just works as normal match(t), checking equality.

Actually, the ParserATNSimulator that does the prediction would also need to understand this splitting of the token type integers so that he could do appropriate comparisons like match().

We could imagine allocating the 1st 12 bits or so of a token type integer for the fixed token types and then we are free to use bits 13 above when we define tokens, as I've done with the new syntax above.

If we really needed many more classes, we could define our own Token implementation that recorded an arbitrary bit set. This would require overriding match() in Parser.java. The key part that ANTLR needs to do is to define the token names like IF, ID, and so on so that the grammar can reference them.

We could do all of this with a handbuilt lexer. Mihai was telling me that for his domain specific application, Law documents, the only fixed token is WORD. Everything else is a domain specific token class. A word might be important in one context, BARRED, but not in another.

Versions Compared

Old Version 1

New Version 2

Key

Specifying an Ambiguity Resolver