Test-Driven Development with ANTLR

Just as your code can benefit from complete unit tests, so can your grammars. Here's a walkthrough of developing a simple grammar (for CSS parsing) using Java, ANTLR and TestNG.

Prerequisites:

You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
You can run ANTLR to generate Java sources from a grammar.
You can build and run Java code

Basic setup

Every CSV line ends in a newline (line feed) or carriage return + line feed. Let's set that up as our basic grammar.

CSV.g

grammar CSV;

line :	NEWLINE;

NEWLINE	:	'\r'? '\n';

Create the corresponding test harness.

CSVTest.java

public class CSVTests {
    @Test
    public void testNewline() throws IOException, RecognitionException {
        CSVParser parser = createParser("\n");
        parser.line();
    }

    @Test
    public void testCRLF() throws IOException, RecognitionException {
        CSVParser parser = createParser("\r\n");
        parser.line();
    }
    
    private CSVParser createParser(String testString) throws IOException {
        byte[] byteArray = testString.getBytes("ISO-8859-1");
        InputStream stream = new ByteArrayInputStream(byteArray);
        ANTLRInputStream input = new ANTLRInputStream(stream);
        CSVLexer lexer = new CSVLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        CSVParser parser = new CSVParser(tokens);
        return parser;
    }

}

Generate the Java files for this grammar, then run the test.

The test should pass.

Return a result

We'll want to get a list of strings back from each line of CSV. In true TDD form, we'll alter the test then update the grammar to make the test correct again.

CSVTests.java fragment

    @Test
    public void testNewline() throws IOException, RecognitionException {
        CSVParser parser = createParser("\n");
        List<String> result = parser.line(); // final public void line...
        assert result.isEmpty() : "Nothing to return";
    }

    @Test
    public void testCRLF() throws IOException, RecognitionException {
        CSVParser parser = createParser("\r\n");
        List<String> result = parser.line();
        assert result.isEmpty() : "Nothing to return";
    }

As expected, this fails to compile: line is declared as final public void line

Now update the grammar so this compiles.

CSV.g fragment

line returns [List<String> result]
	:	NEWLINE;

Regenerate the grammar. The test compiles, but you get a NullPointerException. Look at the generated code in CSVParser.java:

Generated code...

    public final List<String> line() throws RecognitionException {
        List<String> result = null;

Oops! We need to initialize the result. Edit your grammar again:

CSV.g fragment

line returns [List<String> result]
@init {
    result = new ArrayList<String>();
}
	:	NEWLINE;

NEWLINE	:	'\r'? '\n';

Run ANTLR to regenerate your files and run the test again. The test should pass.

Extract a single word

A record in CSV is a series of fields separated by commas and ending in a newline or CRLF. We'll start by testing a single field in isolation then building back up to testing a whole record's worth of fields.

Start by adding a new test:

CSVTests.java

@Test
public void testSingleWord() throws IOException, RecognitionException {
    CSVParser parser = createParser("Red");
    String result = parser.field();
    assert result.equals("Red") : "Expected Red, found " + result;
}

Of course, this won't compile. We need to define field in the grammar, doing just enough to keep the tests working:

CSV.g

grammar CSV;

line returns [List<String> result]
@init {
	result = new ArrayList<String>();
}
	:	field NEWLINE;

field returns [String parsedItem]
	:	f=FIELD { $parsedItem = $f.text;}
	|   // nothing
	;
	
NEWLINE	:	'\r'? '\n';

FIELD	:	NONBREAKING* ;
	
// Anything except a line-breaking character is allowed.
NONBREAKING	
	:	~('\r' | '\n');

What's that funny character?

"~" means "not" and is used to match any item that's not in a set.

See The Definitive ANTLR Reference, page 95.

Support multiple fields

Let's try multiple simple fields:

CSVTests.java fragment

    @Test
    public void testMultipleWords() throws IOException, RecognitionException {
        CSVParser parser = createParser("Red,Green,,Blue\n");
        List<String> result = parser.line();
        assert result.size() == 4: "Expected 4 items";
        assert result.get(0).equals("Red") : "Expected Red";
        assert result.get(1).equals("Green") : "Expected Green";
        assert result.get(2).equals("") : "Expected empty";
        assert result.get(3).equals("Blue") : "Expected Blue";
    }

This is going to take a few more changes than just defining a line as field,field,field...

We need to add each field's value to the line.
A record such as a,,b should write out an empty field in the middle.
Testing shows a potential nondeterminism between <empty field>NEWLINE and NEWLINE.

CSV.g

grammar CSV;

line returns [List<String> result]
@init {
	result = new ArrayList<String>();
}
	: (NEWLINE) => NEWLINE
	| (
		fieldResult=field { result.add(fieldResult); }
		( COMMA fieldResult=field {result.add(fieldResult);} )*
	 	NEWLINE
	  )
	;

field returns [String parsedItem]
@init {
	parsedItem = "";
}
	: f=FIELD {$parsedItem=$f.text;}
	| // nothing 
	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

FIELD:	NONBREAKING+;
	
// Anything except a line-breaking character is allowed.
fragment NONBREAKING	
	:	~('\r' | '\n' | ',');

This works, but line is getting cluttered. Let's use a global and still return a testing string from field:

CSV.g with global list

grammar CSV;

@members {
List<String> fields = new ArrayList<String>();
}

line returns [List<String> result]
	: (
	    (NEWLINE) => NEWLINE
	    | field ( COMMA field )* NEWLINE
	  )
	  { $result = fields; }
	;

/** Adds the field to the master result and also returns it for unit testing */
field returns [String parsedItem]
@init { parsedItem = ""; }
	: (f=FIELD {$parsedItem=$f.text;}
	  | // nothing
	  )
 	{ fields.add($parsedItem); }
	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

FIELD:	NONBREAKING+;
	
// Anything except a line-breaking character is allowed.
fragment NONBREAKING	
	:	~('\r' | '\n' | ',');

Quoting, part 1

CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.

CSVTests.java fragment

@Test
public void testQuotedString() throws IOException, RecognitionException {
    CSVParser parser = createParser("\"Red, White, and Blue\"");
    String result = parser.field();
    assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>";
}

You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):

CSV.g

grammar CSV;

@members {
List<String> fields = new ArrayList<String>();
}

line returns [List<String> result]
	: (
	    (NEWLINE) => NEWLINE
	    | field ( COMMA field )* NEWLINE
	  )
	  { $result = fields; }
	;

/** Adds the field to the master result and also returns it for unit testing */
field returns [String parsedItem]
@init { parsedItem = ""; }
	: ( f=QUOTED
	  | f=UNQUOTED
	  | // nothing
	  )
 	{ $parsedItem = ($f == null) ? "" : $f.text; fields.add($parsedItem); }
	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

QUOTED	: '"' ( options {greedy=false;} : . )* '"' 
	  {
	  	// Strip the surrounding quotes
	  	String txt = getText(); 
	  	setText(txt.substring(1, txt.length() -1)); 
	  };
	
UNQUOTED	:	~('\r' | '\n' | ',' | ' ' | '"')+;

This gets the job done.

Remove spaces around commas

CSV ignores leading and trailing spaces, so we should do the same. Here's the test case:

CSVTests.java fragment

@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
    CSVParser parser = createParser("Red  ,   Green, ,Blue\n");
    List<String> result = parser.line();
    assert result.size() == 4 : "Expected 4 items";
    assert result.get(0).equals("Red") : "Expected Red";
    assert result.get(1).equals("Green") : "Expected Green";
    assert result.get(2).equals("") : "Expected empty";
    assert result.get(3).equals("Blue") : "Expected Blue";
}

Surprise – it passes! This is unexpected, but a look at the error output shows what's happening:

line 1:3 no viable alternative at character ' '
line 1:4 no viable alternative at character ' '
line 1:6 no viable alternative at character ' '
line 1:7 no viable alternative at character ' '
line 1:8 no viable alternative at character ' '
line 1:15 no viable alternative at character ' '

ANTLR 3's error recovery is taking care of this for us. Clearly, it's a nice thing while you're developing, but you probably want to make any production code do the right thing.

I will leave you with the following challenges:

Things to do

Strip leading and trailing spaces around commas (but nowhere else)
Consider disabling automatic error recovery while developing and testing your grammar. Otherwise, you may miss problems that crop up during automated unit testing (e.g. when using CruiseControl).
CSV escapes quotation marks by doubling them, so "" is interpreted as ". How would you implement that? How would it affect the QUOTED lexer rule? (Hint: These can only appear within a quoted string.)