Quick Starter on Parser Grammars - No Past Experience Required

Introduction 

You need a parser and want to use ANTLR, but you never learned how to write parser grammars? Then the following tutorial should teach you the very basics of understanding on following matters:

  • EBNFs
  • Left recursion
  • Grammar ambiguities
  • Difference between lexer and parser rules
  • How to define tokens
  • How to build a grammar from a language specification

Nonetheless the study of a few books will be helpful: The free Basics of Compiler Design and Compiler Construction PDFs and the famous dragon book in the newest edition for understanding the theory of compilers and The Definitive ANTLR Reference for in-depth understanding of intricate details of the ANTLR software and how to write complex grammars. The use of ANTLRworks is heavily recommended. Note: If the downloaded ANTLRworks file ends with .zip, rename it into .jar! All sample grammars below are fully functional except that at the beginning of the test file one has to add

grammar Grammarname;

where Grammarname is the name of the test file with the suffix .g.

EBNFs 

EBNF means Extended Backus-Naur Form, the grammar for describing parser grammars. The basic form of an EBNF rule is simple:

a : b;

and describes that symbol a on the left side can be replaced with the symbol b on the right side. If b is itself a rule, b is also replaced with its right side, and so on, until no symbol referencing a rule remains. The number and order of symbols, usually called tokens, on the right side is unlimited and arbitrary. It is possible to denote several alternatives of replacement tokens with the | symbol - the used alternative is chosen dependent on the input. To discriminate between tokens which occur only in rules and tokens which occur also in the input, a single quote is used as string delimiter (character escaping works like in Java). A basic example of this behaviour is exhibited by this grammar:

utterance : greeting | exclamation;
greeting : interjection subject;
exclamation : 'Hooray!';
interjection : 'Hello';
subject : 'World!';

This grammar matches the input "Hello World!" and "Hooray!". (Note the use of lowercase and uppercase letters. The importance of this is explained later.) Until now I have only described the basic BNF - the extended notations and their meanings are:

  • () - Parentheses. Used to group several elements, so they are treated as one single token
  • ? - Any token followed by ? occurs 0 or 1 times
  • * - Any token followed by * can occur 0 or more times
  • + - Any token followed by + can occur 1 or more times
  • . - Any character/token can occur one time
  • ~ - Any character/token following the ~ may not occur at the current place
  • .. - Between two characters .. spans a range which accepts every character between both boundaries inclusive

Examples are:

integer : (HEX_PREFIX | OCTAL_PREFIX)? DIGITS;            // Matches HEX_PREFIX DIGITS or OCTAL_PREFIX DIGITS or simply DIGITS
DIGITS : '1'..'9' '0'..'9'*;                              // Matches one digit between 1 and 9 and followed by an arbitrary
                                                          // number of digits between 0 and 9
ML_COMMENT : '/*' ( options {greedy=false;} : . )* '*/' ; // Matches multiline comments: Everything from "/*" to the next "*/".
                                                          // (greedy=false prevents that the dot matches everything between the
                                                          // first "/*" and the last "*/", not the next "*/" in case of several
                                                          // multiline comments in a file.)
NO_LOWERCASE_LETTERS : (~('a'..'z'))+;                    // Matches all character strings excluding those with lowercase letters
                                                          // and the empty string

Left recursion

A popular mistake is to use left recursion for LL-Parsers like ANTLR (LR-Parsers like yacc can handle this kind of construct). Recursion in general means that an alternative of rule a references the rule a again and is allowed, as otherwise the possible input is finite. Left recursion occurs, when the rule a is in one of the alternatives referenced directly at the beginning like so:

a : a B
   | C
   ;

The problem is that ANTLR enters an infinite loop as every invocation of rule a immediately invokes rule a again. (In reality, the parser generation aborts upon detecting a left-recursive rule, so nonetheless the grammar has to be changed.) The solution is a simple rewrite like:

a : C B*;

Grammar ambiguities

Even if a grammar describes a certain language, it can still cause problems at the recognition. Ambiguities result if a certain input can be matched by two or more different rule invocations. The solution is either to rewrite the grammar, which may result in a less readable grammar, or to add syntactic predicates, which enforce the ordering of the possibilities. Both topics belong to the advanced category and are discussed in the ANTLR book in detail. They were mentioned only in this beginner tutorial as they can explain why a valid grammar doesn't work the way one would expect.

Another problem related to grammar ambiguities is when ANTLR doesn't find a viable start rule. Viable start rules have no references to them in other rules, but when recursion is used ANTLR may still complain about unreachable alternatives. This can be solved two ways: The first one is to add a rule like "new_start : old_start;". The second and easier way is to add at the supposed start rule "EOF" aka End Of File at the end of the start rule. To prevent the problem from ever occurring the author recommends to always add EOF as there are no further side effects unless you are using the start rule recursively as in the following example:

propsHash
    :    '{' atom (COMMA atom)* '}' EOF
    ;

atom
    :    propsHash    |    other_alternatives
    ;

In this case you have to create an extra start rule where you reference both the recursive rule and EOF:

input
    :	propsHash EOF
    ;

propsHash
    :    '{' atom (COMMA atom)* '}'
    ;
atom
    :    propsHash    |    other_alternatives
    ;

Difference between lexer and parser rules

The use of EBNFs doesn't distiguish between lexer and parser - only between terminal and non-terminal symbols. Terminal symbols describe the input, while non-terminal symbols describe the tree structure behind the input. In practice, terminal symbols are recognized by a lexer and non-terminal symbols are recognized by a parser. ANTLR imposes the convention that lexer rules start with an uppercase letter and parser rules with a lowercase letter. Lexer rules contain only either literals (along the use of EBNF symbols; literals can be both single characters and longer strings) or references to other lexer rules. Parser rules may reference parser and lexer rules as they wish and even include literals, but never only literals. See following sample grammar:

decl : 'int' ID '=' INT ';' // E.g., "int x = 3;"
       | 'int' ID ';'          // E.g., "int x;"
       ;
ID   : ('a'..'z' | 'A'..'Z')+;
INT  : '0'..'9'+;

How to define tokens

The default behaviour of ANTLR is that the rule names are automatically elevated into token status. The literals are also available, but they have not been assigned a symbolic name. If you want to be able to use symbolic names instead of the literals in parser rules, you can create lexer rules which match only the literals. A second alternative is to use the special tokens command. The tokens command has two advantages: The first one is that the definitions of symbolic names are all collected in one place. The second one is that keywords defined that way are prioritized over the rule for normal variable names, which would also match keywords in general. For instance, the keyword "import" would also match the more general rule "('a'..'z')+".

Additionally one can define imaginary tokens to be used in ASTs (Abstract Syntax Trees) for multipass grammars (another advanced topic not treated here). In such grammars it is also common to factor subrules out for increased readability, although no token for those rules is actually required. Then the keyword fragment is used before the rule's name, effectively inlining this rule. A few samples to showcase the possibilities:

tokens {
   IMAGINERY_TOKEN;
   EQUALS='=';
}
UNICODE_CHAR
   :'\\''u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
   ;
fragment
HEX_DIGIT
   : '0'..'9'|'a'..'f'|'A'..'F'
   ;

The author's recommendation is to use ordinary rules and the tokens command. Using literals and lexer rules for one and the same input sequence will result in giving the lexer rule the priority, thus generating a different token than expected by the parser rule. This error is not easy to see for inexperienced users. It is considered to be easier to use separate lexer and parser grammars, because then this kind of error cannot happen.

Another point of interest is the order of the token declaration. The earlier a token is defined, the higher is the precedence if a certain input can be matched by two or more tokens. This means that using the tokens command to define keywords will match those keywords instead a more general ID rule. The following code snippet provides an example:

start
   :  (WS |  FOO)* EOF
   ;

WS : (options {greedy=false;} : ' '+) ;
FOO : ~('x' | 'y' | 'z')+ ;

If you give an input containing only spaces then WS will be chosen. Should one change the rules order that FOO comes before WS then FOO will be chosen. Any input containing other characters than spaces will match FOO, even if two or more WS and FOO tokens could be produced. The lexer rules will match greedily the maximum of applicable characters.

How to build a grammar from a language specification

Such language specifications often only showcase example inputs. The first step is to abstract the syntax from the input and then to think of suitable rules which validate the language correctly. Let's assume that a parser has to recognize the following set of commands:

  • get data from file
  • put data in file
  • change data in file
  • get metadata about file
  • change metadata about file
  • get dependencies of file
  • get dependents of file
  • get statistics of file

The abstraction boils down to: verb object preposition target. A basic ANTLR3 grammar would be:

grammar CommandLanguage;

tokens {
    GET='get';
    PUT='put';
    CHANGE='change';
    DATA='data';
    METADATA='metadata';
    DEPENDENCIES='dependencies';
    DEPENDENTS='dependents';
    STATISTICS='statistics';
    FROM='from';
    IN='in';
    ABOUT='about';
    OF='of';
}

command
    :    sentence (NEWLINE sentence)* NEWLINE? EOF
    |
    ;

sentence
    :    WS? verb WS object WS preposition WS target WS?
    ;

verb
    :    GET
    |    PUT
    |    CHANGE
    ;

object
    :    DATA
    |    METADATA
    |    DEPENDENCIES
    |    DEPENDENTS
    |    STATISTICS
    ;

preposition
    :    FROM
    |    IN
    |    ABOUT
    |    OF
    ;

target
    :    FILE
    ;

FILE
    : ('a'..'z'|'A'..'Z'|'0'..'9'|'.')+
    ;

NEWLINE
    : '\r'? '\n'
    ;

WS
    : (' '|'\t'|'\n'|'\r')+ {skip();}
    ;

This grammar accepts some invalid sentences like "change dependencies in file". It is possible to accept only valid sentences by changing the grammar rules, but this would only make the grammar much less readable. Instead use semantic predicates which are described in the ANTLR book or somewhere on the wiki or mailing list. There are no actions in this simple example that would process the parsed data. Right now the above grammar only recognizes valid input by accepting or rejecting it. This Expression Evaluator demonstrates how actions can be used.

Todo:

  • Add Java test main() for example grammar
  • Add text how to generate Java files from grammar
  • Add text how to write keyword-less grammars
  • Add warning about too aggressive left recursion removal
  • Add warning about misuse of the dot operator (".+" in a grammar rule)
  • Look if other ANTLR tutorials can provide additional hints for missing pieces
  • Update index