Lexing XML with ANTLR 3

First of all you have to decide which parts of the input XML file our parser needs. Traditionally, you have a so called lexer that reads through all characters and builds larger chunks called tokens. A token can be a keyword like struct or void in the C programming language or, it can be a string enclosed in quotes. A lexer thus is some sort of preprocessor for the next step, the token parser. You will learn more about the token parser later, here we will look more closely at the lexer.

In ANTLR you describe a lexer using a lexer grammar. To declare a grammar called XMLLexer, you create a text file and add the following lines at the top:

lexer grammar XMLLexer;

What follows are a number of descriptions for tokens, like

TAG_START_OPEN : '<' ;

This description, called rule or production, says "generate a token called TAG_START_OPEN every time you see a '<' in the input character stream" to indicate a tag starts here. The following rule is for the '>' that indicates the end of a tag:

TAG_CLOSE : '>' ;

The token parser which we will inspect in detail later has to make sense of these tokens and find out where a full tag is.

You now have seen something about tags, but what about text? This is a little more complicated as text is anything until a tag begins which is indicated by a '<':

PCDATA : (~'<')+ ;

This rule contains two new concepts. First, everything inside (...)+ must be there at least once, but many also may be there many times more. Then there is this ~ before the~ _'<'~. The ~ means "anything but the character that follows". The complete description thus means the PCDATA token is composed of one or more characters that are not '<'. To illustrate that, look at this stream of characters

This is text<tag>

Rule PCDATA would accept the text This is text and generate a PCDATA token from it. That's fine and just what we wanted. The next token will be TAG_START_OPEN upon seeing the '<'. Hooray, still right!. But upon seeing tag it tries to generate an additional PCDATA token which certainly isn't what we want!. OK, no problem, here comes the rule for the tag name. We call it GENERIC_ID as you can use it for attribute names as well:

GENERIC_ID : ( LETTER | '_' | ':') (NAMECHAR)* ;

Before we investigate if this is appropriate to our problem, we have to explain some more concepts. (...)* is the same as (...)+ except that it also allows for no characters at all. The other new thing is the embedded NAMECHAR. It is a reference to another rule. This special rule describes how a character of an id might look like. You will see that rule later. Finally you have the | which is an or. Thus ( LETTER | '' | ':') means the first character can either be a LETTER - another sub rule - a '' or a ':'. Let us complete this grammar with the missing sub rules to let it run through ANTLR:

lexer grammar XMLLexer;

TAG_START_OPEN : '<' ;
TAG_CLOSE : '>' ;

PCDATA : (~'<')+ ;

GENERIC_ID
    : ( LETTER | '_' | ':') (NAMECHAR)*
    ;

fragment NAMECHAR
    : LETTER | DIGIT | '.' | '-' | '_' | ':'
    ;

fragment DIGIT
    :    '0'..'9'
    ;

fragment LETTER
    : 'a'..'z'
    | 'A'..'Z'
    ;

WS  : (' '|'\r'|'\t'|'\u000C'|'\n') {channel=99;}
    ;

You can see that sub rules are prefixed by the keyword fragment. This is important as no sub rule can be used to identify a token. Otherwise your lexer would identify lots of LETTER tokens which you certainly do not want.

The rule WS is special as it recognizes all white spaces, but does not pass it to the next stage, the token parser, because we don't care about white spaces. Sending the token to channel 99 is ANTLR's way of saying that. {{}}

'0'..'9' is a short form of saying (0|1|2|3|4|5|6|7|8|9) and 'a'..'z' is short form for the range of characters from 'a' to 'z'.

This is how you invoke the ANTLR tool. Be careful to have all jars (ANTLR3, ANTLR2.7.6 and stringtemplate) from the ANTLR lib dir in your Java classpath:

java org.antlr.Tool xmlLexer-version1.g

As you can see, the ANTLR grammar is named xmlLexer-version1.g and followed the convention that all grammars have a g as extension. This is what ANTLR is likely to tell you:

ANTLR Parser Generator   Early Access Version 3.0b3 (July 21, 2006)  1989-2006
xmlLexer-version1.g:25:1: The following token definitions are unreachable: GENERIC_ID,WS

Hmmm. The first line simply tells us about the ANTLR version, but the second line looks like there is something wrong. It says that the rules for white spaces and the tag name can not be reached and will thus never apply. You might have guessed already that the reason is the rule for PCDATA. The characters it can match are a superset of those possibly matched by WS and GENERIC_ID. Reconsidering our XML problem we can rephrase it in terms of context. In one context text can be PCDATA and in another it might be the name for a tag.
If you are familiar with lex or flex, you might have heard of lexer modes. Lexer modes are contexts in which certain token definitions hold and others do not. Upon identification of certain tokens you can switch between these contexts/modes.In ANTLR you can do something very similar using gates. Gates can selectively switch off certain rules. What we want for the conflicting rules is to let GENERIC_ID and WS only apply when we are currently matching tags and PCDATA otherwise:

GENERIC_ID : {tagMode}?=> ( LETTER | '_' | ':') (NAMECHAR)* ;

WS         : {tagMode}?=> (' '|'\r'|'\t'|'\u000C'|'\n') {channel=99;} ;

PCDATA     : {!tagMode}?=> (~'<')+ ;

Gates are written inside {...}?=> and contain a boolean expression in the target language (Java by default). In your case this is a simple Boolean variable. Of course we need to define it and also toggle it upon identification of meaningful rules.

Introduction of members goes like this:

@members {
    boolean tagMode = false;
}

Cool. But just when to switch to the tag mode and when to switch back? Easy, every time a tag starts we switch to the tag mode:

TAG_START_OPEN : '<' { tagMode = true; } ;

and every time a tag ends we switch back:

TAG_CLOSE : '>' { tagMode = false; } ;

Everything inside {...} will be executed once a rule has been used to identify a token. It must thus be a valid statement of the target language. This is our complete grammar version #2:

lexer grammar XMLLexer;

@members {
    boolean tagMode = false;
}

TAG_START_OPEN : '<' { tagMode = true; } ;
TAG_CLOSE : '>' { tagMode = false; } ;

PCDATA : {!tagMode}?=> (~'<')+ ;

GENERIC_ID : {tagMode}?=> ( LETTER | '_' | ':') (NAMECHAR)* ;


fragment NAMECHAR
    : LETTER | DIGIT | '.' | '-' | '_' | ':'
    ;

fragment DIGIT
    :    '0'..'9'
    ;

fragment LETTER
    : 'a'..'z'
    | 'A'..'Z'
    ;

WS  :  {tagMode}?=>
       (' '|'\r'|'\t'|'\u000C'|'\n') {channel=99;}
    ;

Again starting ANTLR on that grammar

java org.antlr.Tool xmlLexer-version2.g

brings us

ANTLR Parser Generator   Early Access Version 3.0b3 (July 21, 2006)  1989-2006

and nothing else. Way better! We might give it a try and use it on this simple XML file:

<t>
  Starting text inside t
  <t1>Text inside t1</t1>
</t>

You also need some glue code now to open a file as input, direct it to the lexer and print out each identified token:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class MainLexer {
    public static void main(String[] args) {
        try {
           CharStream input = new ANTLRFileStream(args[0]);
           XMLLexer lexer = new XMLLexer(input);
           Token token;
           while ((token = lexer.nextToken())!=Token.EOF_TOKEN) {
             System.out.println("Token: "+token.getText());
           }
        } catch(Throwable t) {
           System.out.println("Exception: "+t);
           t.printStackTrace();
        }
    }
}

Now compile that code and start it with all libraries in the classpath and the name of the XML file as parameter:

javac *.java
java MainLexer simpleXML.xml

What we expect is a number of lines each containing the text of a token prefixed by the word "Token: ". This is what you get:

Token: <
Token: t
Token: >
Token:
  Starting text inside t

Token: <
Token: t1
Token: >
Token: Text inside t1
Token: <
[mTokens]: line 3:21 1:1: Tokens : ( TAG_START_OPEN | TAG_CLOSE | PCDATA
 | GENERIC_ID | WS ); state 2 (decision=5) no viable alt line 3:21; char='/'
Token: t1
Token: >
Token:

Token: <
[mTokens]: line 4:1 1:1: Tokens : ( TAG_START_OPEN | TAG_CLOSE | PCDATA
 | GENERIC_ID | WS ); state 2 (decision=5) no viable alt line 4:1; char='/'
Token: t
Token: >

Most of the tokens look ok, but ANTLR complains in line 3 and 4. Obviously, our lexer does not know how to deal with the '/' in the end tag. That's because we forgot to describe an end tag in our grammar. We also forgot about empty tags and attributes. This is the final lexer grammar:

lexer grammar XMLLexer;

@members {
    boolean tagMode = false;
}

TAG_START_OPEN : '<' { tagMode = true; } ;
TAG_END_OPEN : '</' { tagMode = true; } ;
TAG_CLOSE : { tagMode }?=> '>' { tagMode = false; } ;
TAG_EMPTY_CLOSE : { tagMode }?=> '/>' { tagMode = false; } ;

ATTR_EQ : { tagMode }?=> '=' ;

ATTR_VALUE : { tagMode }?=>
        ( '"' (~'"')* '"'
        | '\'' (~'\'')* '\''
        )
    ;

PCDATA : { !tagMode }?=> (~'<')+ ;

GENERIC_ID
    : { tagMode }?=>
      ( LETTER | '_' | ':') (NAMECHAR)*
    ;

fragment NAMECHAR
    : LETTER | DIGIT | '.' | '-' | '_' | ':'
    ;

fragment DIGIT
    :    '0'..'9'
    ;

fragment LETTER
    : 'a'..'z'
    | 'A'..'Z'
    ;

WS  :  { tagMode }?=>
       (' '|'\r'|'\t'|'\u000C'|'\n') {channel=99;}
    ;

This grammar allows parsing simple XML documents including empty tags and attributes. You can use this extended version to prove it. It should run through without errors:

<t attr="value">
  <empty/>
  Starting text inside t
  <t1>Text inside t1</t1>
</t>

This much for the lexing stage. Take a break and continue with the token parser that uses this lexer as a preprocessor to really make sense of the XML.

My siblings (including me):