Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejava
tokens {ID<0x00010000>;} // specify ID's token type value (<0x00010000> is new syntax)
// allow IF IF THEN THEN=IF;
stat: IF expr THEN stat
    | ID '=' expr ';'
    ;
expr: ID ;

...

If we really needed many more classes, we could define our own Token implementation that recorded an arbitrary bit set. This would require overriding match() in Parser.java. The key part that ANTLR needs to do is to define the token names like IF, ID, and so on so that the grammar can reference them. Hmm..we also need to specify a new ATN simulator so that we could override the functionality there.

We could do all of this with a handbuilt lexer. Mihai was telling me that for his domain specific application, Law documents, the only fixed token is WORD. Everything else is a domain specific token class. A word might be important in one context, BARRED, but not in another.

One final thought: rather than having a fixed pivot point like the 12 bit, we could do this arbitrarily. If we specify 0x00010000 for a token class, then we are using bits 16 and above. If we use 0x00000100, then we are limiting fixed token types to the and the 1st 8 bits.

Oh crap. ANTLR uses sets sometimes for efficiency. For example if you use (A|B|C), it will generate code to match a set rather than a switch for the subrule. Set (ID|';') would try to add a big number for ID into an IntervalSet. That might cause trouble.  Also, I currently generate arrays that assume contiguous token types up to a maximum. A token type of 0x10000000 would certainly generate a big wasteful array. Sounds to me like we shouldn't try to trick ANTLR you into doing this and we should formalize the notion of Shroedinger's Tokens.   Perhaps we need to be able define token classes like we define tokens:

Code Block
languagejava
classes {ID;} // (new syntax)

Actually, maybe we just add a Token implementation that has a separate field, tokenClasses, that we can check separately. match() would have to have some kind of signal however to check the extra field. Probably ANTLR would have to generate match(tokentype, tokenclass), but that's no big deal. It would require formal definitions like the classes section just shown. Hmm...well, this is a good start to remind me of my thoughts if I can get time to implement this.

Specifying an Ambiguity Resolver

This part seems easy. All I have to do is add a hook in ParserATNSimulator that let's people specify a new resolution mechanism. Right now, the code areas that detect ambiguities, all resolve things by calling getMinElement(). Instead of this I could simply make a a call to some kind of ParserAmbiguityHandler.getPredictedAlt(...) with lots of parameters that give enough information to make a decision. To get all possible interpretations, all we have to do is see the same input to the parser multiple times and have the ambiguity handler rotate through the possible alternatives for each ambiguous decision. When there are no more alternatives to choose from at the ambiguous points, we can stop.

Here is everything I passed to the parser listener when I see an ambiguity:

Code Block
languagejava
    /** If context sensitive parsing, we know it's ambiguity not conflict */
    public void reportAmbiguity(DFA dfa, DFAState D, int startIndex, int stopIndex,
                                IntervalSet ambigAlts,
                                ATNConfigSet configs)