Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In summary, there is no ANTLR option that enables case insensitivity, as this is hard or impossible to do completely correctly, taking into account all possible internationalization issues. Therefore you need to implement a custom LA(int) to provide the case-insensitive behavior you want.

Handle case insensitivity directly in a grammar

Following the FAQ on abbreviated keywords, we can write a token to accept letters of either case:

Code Block

SELECT : ('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t') ;

The following awk script will generate the above:

Code Block

#!/usr/bin/awk -f
{
  printf("%s : ", toupper($0));
  for (i = 1; i <= length($0); i++) {
    c = substr($0, i, 1);
    if (toupper(c) != tolower(c))
      printf("('%s'|'%s')", toupper(c), tolower(c));
    else
      printf("('%s')", c);
  }
  print " ;";
}

You may find the following easier to use, read, and maintain the following:

Code Block

SELECT : S E L E C T ;

fragment C : 'c' | 'C';
fragment E : 'e' | 'E';
fragment L : 'l' | 'L';
fragment S : 's' | 'S';
fragment T : 't' | 'T';

Of course, you will need a fragment for each letter (i.e. A through Z) used in all of the lexical rules for which you want case insensitivity.
Also take note that calling a fragment rule for each character may impact on performance; test with your typical input to see if it
helps or degrades performance.

Java - Implement a custom File or String Stream and Override LA

...

Code Block
/// <summary>
/// Look ahead for tokenizing is all lowercase, whereas the original case of an input stream is preserved.
///</summary>
public class CaseInsensitiveStringStream : ANTLRStringStream {
    public CaseInsensitiveStringStream(char\[\] data, int numberOfActualCharsInArray) : base(data, numberOfActualCharsInArray) {}

    public CaseInsensitiveStringStream() {}

    public CaseInsensitiveStringStream(string input) : base(input) {}

     // Only the lookahead is converted to lowercase. The original case is preserved in the stream.
    public override int LA(int i) {
        if (i == 0) {
            return 0;
        }

        if (i < 0) {
            i++;
        }

        if (((p + i) - 1) >= n) {
            return (int) CharStreamConstants.EOF;
        }

        return Char.ToLowerInvariant(data\[(p + i) - 1\]);&nbsp; // This is how "case insensitive" is defined, i.e., could also use a special culture...
    }
}

Handle case insensitivity directly in a grammar

Following the FAQ on abbreviated keywords, we can write a token to accept letters of either case:

Code Block

SELECT : ('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t') ;

The following awk script will generate the above:

Code Block

#!/usr/bin/awk -f
{
  printf("%s : ", toupper($0));
  for (i = 1; i <= length($0); i++) {
    c = substr($0, i, 1);
    if (toupper(c) != tolower(c))
      printf("('%s'|'%s')", toupper(c), tolower(c));
    else
      printf("('%s')", c);
  }
  print " ;";
}

You may find the following easier to use, read, and maintain the following:

Code Block

SELECT : S E L E C T ;

fragment C : 'c' | 'C';
fragment E : 'e' | 'E';
fragment L : 'l' | 'L';
fragment S : 's' | 'S';
fragment T : 't' | 'T';

Of course, you will need a fragment for each letter (i.e. A through Z) used in all of the lexical rules for which you want case insensitivity.
Also take note that calling a fragment rule for each character may impact on performance; test with your typical input to see if it
helps or degrades performance.