Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Had a nice lunch with Mihai Surdeanu today at Stanford. Mihai does natural language processing research and has used ANTLR in the past to process English text. He asked for 2 things: tokens that can be in more than one token class (token type) at the same time and the ability to get all interpretations of ambiguous input. Sam Harwell is also interested in getting all interpretations.

Tokens in multiple classes

John Aycock and Nigel Horspool wrote this awesome paperback in 2001: Schrodinger’s token. The idea is to allow tokens to be both identifiers and keywords, for example. In a natural language setting, one might want VERB and NOUN or something more specific to a domain.

The simplest implementation is to split the integer token type in class Token into 2 regions: one for specific token types, and then the upper bits as a bit set to specify which class the token can be in. Of course that would limit us to a certain number of classes because there's a finite number of bits in an integer, but let's start with that. If we want to handle the keywords can be identifiers problem, we would create specific token types for the keywords and then use ID as a token class in the upper bits. That way we could say something like this in the grammar:

// allow IF IF THEN THEN=IF;
stat: IF expr THEN stat
    | ID '=' expr ';'
    ;
expr: ID ;

The lexical rules would look like this:

IF   : 'if'   {$type |= ID;} ; // add to the ID class
THEN : 'then' {$type |= ID;} ; // add to the ID class
ID   : [a-z]+ ;

The code generated for the parser would look like this:

void stat() {
    if ( lookahead==IF ) {
    }
    else {
        match(ID);
        match(EQUALS);
        expr();
        match(SEMI);
    }
}
void expr() { match(ID); }

All of the magic happens in match(t). If the token type requested (the parameter t) is greater than the fixed token type region, then it should treat the token type as a bit mask to use against the incoming token. Otherwise, it just works as normal match(t), checking equality.

Specifying an Ambiguity Resolver

  • No labels