Just as your code can benefit from complete unit tests, so can your grammars. Here's a walkthrough of developing a simple grammar (for CSS parsing) using Java, ANTLR and TestNG.
Prerequisites:
- You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
- You can run ANTLR to generate Java sources from a grammar.
- You can build and run Java code
Basic setup
Every CSV line ends in a newline (line feed) or carriage return + line feed. Let's set that up as our basic grammar.
grammar CSV; line : NEWLINE; NEWLINE : '\r'? '\n';
Create the corresponding test harness.
public class CSVTests { @Test public void testNewline() throws IOException, RecognitionException { CSVParser parser = createParser("\n"); parser.line(); } @Test public void testCRLF() throws IOException, RecognitionException { CSVParser parser = createParser("\r\n"); parser.line(); } private CSVParser createParser(String testString) throws IOException { byte[] byteArray = testString.getBytes("ISO-8859-1"); InputStream stream = new ByteArrayInputStream(byteArray); ANTLRInputStream input = new ANTLRInputStream(stream); CSVLexer lexer = new CSVLexer(input); CommonTokenStream tokens = new CommonTokenStream(lexer); CSVParser parser = new CSVParser(tokens); return parser; } }
Generate the Java files for this grammar, then run the test.
The test should pass.
Return a result
We'll want to get a list of strings back from each line of CSV. In true TDD form, we'll alter the test then update the grammar to make the test correct again.
@Test public void testNewline() throws IOException, RecognitionException { CSVParser parser = createParser("\n"); List<String> result = parser.line(); // final public void line... assert result.isEmpty() : "Nothing to return"; } @Test public void testCRLF() throws IOException, RecognitionException { CSVParser parser = createParser("\r\n"); List<String> result = parser.line(); assert result.isEmpty() : "Nothing to return"; }
As expected, this fails to compile: line
is declared as final public void line
Now update the grammar so this compiles.
line returns [List<String> result] : NEWLINE;
Regenerate the grammar. The test compiles, but you get a NullPointerException
. Look at the generated code in CSVParser.java
:
public final List<String> line() throws RecognitionException { List<String> result = null;
Oops! We need to initialize the result. Edit your grammar again:
line returns [List<String> result] @init { result = new ArrayList<String>(); } : NEWLINE; NEWLINE : '\r'? '\n';
Run ANTLR to regenerate your files and run the test again. The test should pass.
Extract a single word
A record in CSV is a series of fields separated by commas and ending in a newline or CRLF. We'll start by testing a single field in isolation then building back up to testing a whole record's worth of fields.
Start by adding a new test:
@Test public void testSingleWord() throws IOException, RecognitionException { CSVParser parser = createParser("Red"); String result = parser.field(); assert result.equals("Red") : "Expected Red, found " + result; }
Of course, this won't compile. We need to define field
in the grammar, doing just enough to keep the tests working:
grammar CSV; line returns [List<String> result] @init { result = new ArrayList<String>(); } : field NEWLINE; field returns [String parsedItem] : f=FIELD { $parsedItem = $f.text;} | // nothing ; NEWLINE : '\r'? '\n'; FIELD : NONBREAKING* ; // Anything except a line-breaking character is allowed. NONBREAKING : ~('\r' | '\n');
What's that funny character?
"~" means "not" and is used to match any item that's not in a set.
See The Definitive ANTLR Reference, page 95.
Support multiple fields
Let's try multiple simple fields:
@Test public void testMultipleWords() throws IOException, RecognitionException { CSVParser parser = createParser("Red,Green,,Blue\n"); List<String> result = parser.line(); assert result.size() == 4: "Expected 4 items"; assert result.get(0).equals("Red") : "Expected Red"; assert result.get(1).equals("Green") : "Expected Green"; assert result.get(2).equals("") : "Expected empty"; assert result.get(3).equals("Blue") : "Expected Blue"; }
This is going to take a few more changes than just defining a line as field,field,field...
- We need to add each field's value to the line.
- A record such as
a,,b
should write out an empty field in the middle. - Testing shows a potential nondeterminism between <empty field>NEWLINE and NEWLINE.
grammar CSV; line returns [List<String> result] @init { result = new ArrayList<String>(); } : (NEWLINE) => NEWLINE | ( fieldResult=field { result.add(fieldResult); } ( COMMA fieldResult=field {result.add(fieldResult);} )* NEWLINE ) ; field returns [String parsedItem] @init { parsedItem = ""; } : f=FIELD {$parsedItem=$f.text;} | // nothing ; NEWLINE : '\r'? '\n'; COMMA : ','; FIELD: NONBREAKING+; // Anything except a line-breaking character is allowed. fragment NONBREAKING : ~('\r' | '\n' | ',');
This works, but line
is getting cluttered. Let's use a global and still return a testing string from field:
grammar CSV; @members { List<String> fields = new ArrayList<String>(); } line returns [List<String> result] : ( (NEWLINE) => NEWLINE | field ( COMMA field )* NEWLINE ) { $result = fields; } ; /** Adds the field to the master result and also returns it for unit testing */ field returns [String parsedItem] @init { parsedItem = ""; } : (f=FIELD {$parsedItem=$f.text;} | // nothing ) { fields.add($parsedItem); } ; NEWLINE : '\r'? '\n'; COMMA : ','; FIELD: NONBREAKING+; // Anything except a line-breaking character is allowed. fragment NONBREAKING : ~('\r' | '\n' | ',');
Quoting, part 1
CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.
@Test public void testQuotedString() throws IOException, RecognitionException { CSVParser parser = createParser("\"Red, White, and Blue\""); String result = parser.field(); assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>"; }
You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):
grammar CSV; @members { List<String> fields = new ArrayList<String>(); } line returns [List<String> result] : ( (NEWLINE) => NEWLINE | field ( COMMA field )* NEWLINE ) { $result = fields; } ; /** Adds the field to the master result and also returns it for unit testing */ field returns [String parsedItem] @init { parsedItem = ""; } : ( f=QUOTED | f=UNQUOTED | // nothing ) { $parsedItem = ($f == null) ? "" : $f.text; fields.add($parsedItem); } ; NEWLINE : '\r'? '\n'; COMMA : ','; QUOTED : '"' ( options {greedy=false;} : . )* '"' { // Strip the surrounding quotes String txt = getText(); setText(txt.substring(1, txt.length() -1)); }; UNQUOTED : ~('\r' | '\n' | ',' | ' ' | '"')+;
This gets the job done.
Remove spaces around commas
CSV ignores leading and trailing spaces, so we should do the same. Here's the test case:
@Test public void testSpaceRemoval() throws IOException, RecognitionException { CSVParser parser = createParser("Red , Green, ,Blue\n"); List<String> result = parser.line(); assert result.size() == 4 : "Expected 4 items"; assert result.get(0).equals("Red") : "Expected Red"; assert result.get(1).equals("Green") : "Expected Green"; assert result.get(2).equals("") : "Expected empty"; assert result.get(3).equals("Blue") : "Expected Blue"; }
Surprise – it passes! This is unexpected, but a look at the error output shows what's happening:
line 1:3 no viable alternative at character ' ' line 1:4 no viable alternative at character ' ' line 1:6 no viable alternative at character ' ' line 1:7 no viable alternative at character ' ' line 1:8 no viable alternative at character ' ' line 1:15 no viable alternative at character ' '
ANTLR 3's error recovery is taking care of this for us. Clearly, it's a nice thing while you're developing, but you probably want to make any production code do the right thing.
I will leave you with the following challenges:
- Strip leading and trailing spaces around commas (but nowhere else)
- Consider disabling automatic error recovery while developing and testing your grammar. Otherwise, you may miss problems that crop up during automated unit testing (e.g. when using CruiseControl).
- CSV escapes quotation marks by doubling them, so "" is interpreted as ". How would you implement that? How would it affect the
QUOTED
lexer rule? (Hint: These can only appear within a quoted string.)