How do I get case insensitivity?

Discussions

There is some discussion about this subject (ANTLR v3) here: http://www.antlr.org/pipermail/antlr-interest/2006-May/016269.html

In summary, there is no ANTLR option that enables case insensitivity, as this is hard or impossible to do completely correctly, taking into account all possible internationalization issues. Therefore you need to implement a custom LA(int) to provide the case-insensitive behavior you want.

Java - Implement a custom File or String Stream and Override LA

Specify the keywords that you wish to be recognized in a case insensitive way completely in UPPER case in the lexer.

Then, create your own input stream class that always returns an upper case version of LA. The token match will then be in UPPER case, but the text contained in the tokens will be case preserving. Here is an example:

/*
 * ANTLRNoCaseFileStream.java
 *
 * Created on January 25, 2008, 2:12 PM
 *
 * To change this template, choose Tools | Template Manager
 * and open the template in the editor.
 */

package org.antlr.runtime;

import java.io.*;

/**
 *
 * @author Jim Idle
 */
public class ANTLRNoCaseFileStream  extends ANTLRFileStream {
    public ANTLRNoCaseFileStream(String fileName) throws IOException {
        super(fileName, null);
    }

    public ANTLRNoCaseFileStream(String fileName, String encoding)
    throws IOException {
        super(fileName, encoding);
    }

    public int LA(int i) {
        if ( i==0 ) {
            return 0; // undefined
        }
        if ( i<0 ) {
            i++; // e.g., translate LA(-1) to use offset 0
        }

        if ( (p+i-1) >= n ) {

            return CharStream.EOF;
        }
        return Character.toUpperCase(data[p+i-1]);
    }
}

C Target. Set the input stream to use an UPPER case LA

Support for case insenstive matching is built in to the C target input streams. To use it, you must make a method call before using the input stream as in the example below and specify all your keyword/lexer tokens in UPPER CASE only. The built in function is farily simplistic, so if you need to handle internationalization issues you should consult the source for LA antlr3inputstream.c, make any adjustments you need and install a pointer to this function in place of the standard one (which is really only latin-1 compatible only).

input = 	antlr3NewAsciiStringInPlaceStream(src, len, fileName);
     input->setUcaseLA(input, ANTLR3_TRUE);

C# Target. Implement a custom look ahead (LA)

Here is sample code for C# (which is very easy to port to different languages):

Note if you use this, your tokens in the grammar should be all lower case!

/// <summary>
/// Look ahead for tokenizing is all lowercase, whereas the original case of an input stream is preserved.
///</summary>
public class CaseInsensitiveStringStream : ANTLRStringStream {
    public CaseInsensitiveStringStream(char\[\] data, int numberOfActualCharsInArray) : base(data, numberOfActualCharsInArray) {}

    public CaseInsensitiveStringStream() {}

    public CaseInsensitiveStringStream(string input) : base(input) {}

     // Only the lookahead is converted to lowercase. The original case is preserved in the stream.
    public override int LA(int i) {
        if (i == 0) {
            return 0;
        }

        if (i < 0) {
            i++;
        }

        if (((p + i) - 1) >= n) {
            return (int) CharStreamConstants.EOF;
        }

        return Char.ToLowerInvariant(data\[(p + i) - 1\]);&nbsp; // This is how "case insensitive" is defined, i.e., could also use a special culture...
    }
}

Handle case insensitivity directly in a grammar

Following the FAQ on abbreviated keywords, we can write a token to accept letters of either case:

SELECT : ('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t') ;

The following awk script will generate the above:

#!/usr/bin/awk -f
{
  printf("%s : ", toupper($0));
  for (i = 1; i <= length($0); i++) {
    c = substr($0, i, 1);
    if (toupper(c) != tolower(c))
      printf("('%s'|'%s')", toupper(c), tolower(c));
    else
      printf("('%s')", c);
  }
  print " ;";
}

You may find the following easier to use, read, and maintain the following:

SELECT : S E L E C T ;

fragment C : 'c' | 'C';
fragment E : 'e' | 'E';
fragment L : 'l' | 'L';
fragment S : 's' | 'S';
fragment T : 't' | 'T';

Of course, you will need a fragment for each letter (i.e. A through Z) used in all of the lexical rules for which you want case insensitivity.
Also take note that calling a fragment rule for each character may impact on performance; test with your typical input to see if it
helps or degrades performance.