/
How do I handle abbreviated keywords?

How do I handle abbreviated keywords?

Given a language that accepts abbreviated forms of the following keywords:

Keyword

Abbreviations

print

pr, pri, prin, print

read

re, rea, read

fold

fo, fol, fold

apply

ap, app, appl, apply

There are two basic strategies for supporting these abbreviated keywords in ANTLR - we'll call them the explicit and implicit strategies.

Explicit strategy - handle abbreviations explicitly in a grammar

In the explicit strategy, we add each keyword and all it's abbreviations directly into an ANTLR grammar. This is a simple solution. Each keyword and it's abbreviated forms appear explicity as a lexer token rule in our grammar file. All we need to do is ensure that all such rules appear before any generic IDENTIFIER rule and, ANTLR will sort out any potential ambiguities.

PRINT : 'pr' ( 'i' ( 'n' ( 't' )? )? )? ;

READ  : 're' ( 'a' ( 'd' )? )? ;

FOLD  : 'fo' ( 'l' ( 'd' )? )? ;

APPLY : 'ap' ( 'p' ( 'l' ( 'y' )? )? )? ;

Implicit strategy - handle abbreviations implicitly with an action

The explicit strategy works very well when we have a few keywords as shown above. If our language has many more keywords (posts to the list mentions languages with 40, 50 or more keywords and in one particular case nearly a 1000 keywords!) the explicit strategy can become tiresome to implement.

In this situation, we would like to avoid having to create a rule for each keyword and all it's abbreviations explicity in our grammar. Rather we would like a generic mechanism whereby we declare our keywords and abbreviations and that's it. The keywords and abbreviations are handled implicitly - without explicit rules in the grammar. In the implicit strategy, we won't have a token rule for each keyword anymore so, we need to define each keyword's token type explicitly. For this we use ANTLR's tokens block.

tokens
{
	PRINT;
	READ;
	FOLD;
	APPLY;
        .....
}

We then need a generic rule that matches all keywords and their abbreviations. Most grammars already have such a rule in any case - the generic IDENTIFIER rule. What we would like to do is to construct our IDENTIFIER rule such that when it encounters one of our keywords or abbreviations, it tags the resulting token with one of our predefined token types rather than the generic IDENTIFIER token type. To do this we add an action to the IDENTIFIER rule as shown below:

IDENTIFIER
	: 	( 'a'..'z'| ... )
		{ $type = CheckKeywordsTable(getText()) }
	;

This emulates the testLiterals functionality that was available in ANTLR V2. The method CheckKeywordsTable() is called for each recognized IDENTIFIER string and it has a chance to change the returned token type using whatever logic is required for our application. Please note that CheckKeywordsTable() is not a built-in method. You will have to declare it as a member in your grammar file.

@lexer::members 
{
    private int CheckKeywordsTable(string lexeme)
    {
        // Your custom logic goes here
        // In most cases, this would be a map/dictionary lookup
    }
}

In most case, CheckKeywordsTable() might simply consult an IDictionary<string,int> map of all keywords (incl abbreviations). It must return IDENTIFIER for all non-keyword and non-abbreviation lexemes.