Interfacing StAX to ANTLR

With version 1.6 Java now supports the StAX XML API [1]. Using this API you can ask for events like you would do to an ANTLR lexer. To me this sound like an invitation to interface this XML capability to ANTLR. And that's what I have done prototypically. This short article is the report of how it more or less worked out.

Why in the first place?

Like it or not. XML has its place in IT. Every Java configuration file I have seen in the last few years is either a simple name/value property file or in XML. Also, Java processing has improved over the years. It became more and more simple to deal with it. Using something like Spring[2], there even is no need to read in XML yourself any more. However, if you have to, using SAX and also StAX you still have to keep track of XML events and their context. E.g., for many tasks you have to record the current XML element and sometimes even in which XML element it is in. This is necessary as you have to know where a certain text (PCDATA) event belongs to. This will require something like a stack. To be more precise, all XML documents can be described using S-grammars which are a strict subset of LL(1) grammars. And, as you know, ANTLR supports a superset of LL(1). This means all valid XML documents can easily be parsed by ANTLR using its LL(*) algorithm and LL(*) grammars. The idea is to let StAX take the lexer part and write ANTLR3 parsers for every XML format you want to read in.

In a nutshell

Imagine you have an XML like that (taken from the StAX tutorial)

<?xml version="1.0"?>
<BookCatalogue xmlns="http://www.publishing.org">
  <Book>
    <Title>Yogasana Vijnana: the Science of Yoga</Title>
    <ISBN>81-40-34319-4</ISBN>
    <Cost currency="INR">11.50</Cost>
  </Book>
</BookCatalogue>

and you want to parse that and emit all the important information. For example like this

Book Title:Yogasana Vijnana: the Science of Yoga
ISBN:81-40-34319-4
COST:11.50

Using SAX and even StAX you would at least have to remember the latest element name. This is necessary in order to correctly assign subsequent character data to the right element. This would mean even more work if you had to trace the whole hierarchy in which an element is in. And, you know, keeping state like this really results in ugly code. What about that instead:

catalogue : <bookcatalogue> ( book )* </bookcatalogue> ;

book : <book> title? isbn? cost? </book> ;

title : <title> TEXT </title> {println("Book Title:"+$TEXT.text); } ;

isbn : <isbn> TEXT </isbn> {println("ISBN:"+$TEXT.text); } ;

cost : <cost> TEXT </cost> {println("COST:"+$TEXT.text); } ;

I have to admit: not very obvious at the first glance. However, if you take a second look, most of that might remind one of the good old DTD, plus some XML tags, plus some Java Code. But, this sort of ANTLR3 grammar can actually parse the above XML input and generate the output accordingly! And, my first version of the glue code to interface StAX to ANTLR is about 100 lines of code only. Tiny!

If you are not impressed, that's ok. Keep using the DOM where parsing code for the above XML would be longer than the glue code I have talked about. Or keep using SAX where the code would be even longer, plus you get a headache on top. StAX alone could easily do with that example, but more nested structures would make quite some code bloat as well. Interestingly, the boiler plate code you would have to write using StAX is pretty much the same as the code ANTLR3 generates from the above grammar.

Finally, ahem, err, if you actually are impressed, I have to confess that the above grammar isn't exactly a working one, but it is pretty close. I have used some make-up to make it more attractive. Now that I have actually managed to attract you let's go for the real stuff.

Translating StAX XML events to ANTLR tokens

First problem: ANTLR expects tokens with an integer token type. An ANTLR parser uses this type to identify a token as what it is. Usually an ANTLR generated lexer takes care of doing this. As we replace such a lexer with our StAX input we need to do some work here. The code that does this is the core of the glue code. It parses in the token file (containing textual token name/type pairs) that ANTLR generates from a grammar along with the parser source code. Like that:

public class StaxTokenSource implements TokenSource {

    protected Map<String, Integer> string2type = new HashMap<String, Integer>();

    ...

    protected void initMapping(InputStream tokenDefinition) {
        BufferedReader reader = new BufferedReader(new InputStreamReader(tokenDefinition));
        String line;
        try {
            while ((line = reader.readLine()) != null) {
                String[] parts = line.split("=");
                String tokenName = parts[0];
                String tokenType = parts[1];

                log.debug("Inserting mapping: " + tokenName + " -> " + tokenType);

                Integer iType = new Integer(tokenType);
                string2type.put(tokenName, iType);
            }
        } catch (IOException e) {
            log.fatal(e.getMessage(), e);
        }
    }
}

Now when the parser asks for the next token using method nextToken, the StaxTokenSource gets the next XML event from StAX and uses the mapping to infer the right token type (slightly simplified):

protected XMLEventReader reader;

    ...

    public Token nextToken() {
        Token token = null;
        XMLEvent event = reader.nextEvent();
        if (event != null) {
            int type = getANTLRType(event);
            token = new ClassicToken(type);
            if (event.isCharacters()) {
                token.setText(event.asCharacters().getData());
            } else if (event.isStartElement()) {
            }
        }
        return token;
    }

In case the XML event is text, we pass this to the token. Finally, here is getANTLRType that finds the token type for the XML event:

public final static String TEXT_TOKEN = "TEXT";

    public final static String START_TAG_SUFFIX = "_START";

    public final static String END_TAG_SUFFIX = "_END";

    public final static int UNDEFINED_TYPE = -1;

    ...

    protected int getANTLRType(XMLEvent event) {
        int type = UNDEFINED_TYPE;
        if (event.isCharacters()) {
            Integer iType = string2type.get(TEXT_TOKEN);
            type = iType.intValue();
        } else if (event.isStartElement()) {
            // XXX make it simply here
            String localName = event.asStartElement().getName().getLocalPart();
            localName = localName.toUpperCase() + START_TAG_SUFFIX;
            // FIXME add checks
            Integer iType = string2type.get(localName);
            type = iType.intValue();
        } else if (event.isEndElement()) {
            // XXX make it simply here
            String localName = event.asEndElement().getName().getLocalPart();
            localName = localName.toUpperCase() + END_TAG_SUFFIX;
            Integer iType = string2type.get(localName);
            // FIXME add checks
            type = iType.intValue();
        }
        return type;
    }

That's all.

How would your grammar REALLY look like

This is the complete, real, no omission, no make-up grammar:

parser grammar BookCatalogue;
@header {
package de.zeigermann.stax2antlr;
th
}

catalogue : {System.out.println("Found catalogue!"); } BOOKCATALOGUE_START ( book )* BOOKCATALOGUE_END ;

book : BOOK_START title? isbn? cost? BOOK_END ;

title : TITLE_START TEXT TITLE_END {System.out.println("Book Title:"+$TEXT.text); } ;

isbn : ISBN_START TEXT ISBN_END {System.out.println("ISBN:"+$TEXT.text); } ;

cost : COST_START TEXT COST_END {System.out.println("COST:"+$TEXT.text); } ;

Besides minor differences to the grammar presented before, you can see that the way you express start and end tags is rather ugly. You take the name of the tag in upper case and add "_START" if it is a start tag or "_END" if it is an end tag. This is a limitation which I have no solution for right now . Maybe changes to ANTLR3 would be necessary to make the grammar more natural.

Putting it all together

Finally, you need some glue code to put the parts (XML input, token definition file, and parser) together. Here it is

// this is where my sample.xml lies
String fileName = "D:/workspace/ANTLR3XML/sample.xml";
// this is where the tokens generated by ANTLR are
String tokenFileName = "D:/workspace/ANTLR3XML/src/de/zeigermann/stax2antlr/BookCatalogue.tokens";

InputStream inputStream = new FileInputStream(fileName);
InputStream tokenTypeInputStream = new FileInputStream(tokenFileName);

// StAX stuff
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader reader = factory.createXMLEventReader(inputStream);

// we stick the StAX token source into the standard ANTLR token stream here
TokenSource source = new StaxTokenSource(reader, tokenTypeInputStream);
TokenStream tokens = new CommonTokenStream(source);

// this is how you start the parser generated from ANTLR
BookCatalogueParser parser = new BookCatalogueParser(tokens);
parser.catalogue();

Summary

OK, we did not quite make it. The grammar looks a little bit ugly and we can not process attributes, yet. But that is something one can work on. Additionally, the generated code is readable for people with some parser knowledge. It does not quite look as good style hand written StAX code. An obvious solution would be to simplify the ANTLR output templates as we can be sure we only need to handle S-grammars here. Might be fun.

Anyway, I hope to have shown that with the combination of StAX and ANTLR XML processing can be fast, memory efficient and fun.

Links

[1] StAX Tutorial

[2] Spring