Just as your code can benefit from complete unit tests, so can your grammars. Here's a walkthrough of developing a simple grammar (for CSV parsing) using Java, ANTLR and TestNG.
Prerequisites:
- You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
- You can run ANTLR to generate Java sources from a grammar.
- You can build and run Java code
Basic setup
...
CSV defined
CSV (Comma Separated Values) is a file format commonly used for exchanging spreadsheet data and which generally follows __ rules:
- A record consists of multiple fields seperated by commas. Each record ends with a line feed (a/k/a newline) or carriage return + line feed
...
Code Block | ||
---|---|---|
| ||
grammar CSV;
line : NEWLINE;
NEWLINE : '\r'? '\n';
|
Create the corresponding test harness.
...
title | CSVTest.java |
---|
...
- .
- A field may not contain spaces, commas, double-quotes, line feeds, or carriage returns unless the field itself is wrapped in double-quotes on each end.
- To insert a double quote into a quoted field, double it (i.e. use "")
- Spaces are ignored immediately before and after a comma.
- White space at the front or end of the record is not allowed unless part of a quoted field.
Prerequisites for this tutorial
- You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
- You can run ANTLR to generate Java sources from a grammar.
- You can build and run Java code.
- (optional) You have JDK 1.5 or later installed. If not, you won't be able to use some of the constructs in this example and will need to translate back to 1.4 or earlier.
Develop your grammar
Basic setup
Every CSV line ends in a newline (line feed) or carriage return + line feed. Let's set that up as our basic grammar.
Code Block | ||
---|---|---|
| ||
grammar CSV;
line : NEWLINE;
NEWLINE : '\r'? '\n';
|
Create the corresponding test harness.
Code Block | ||
---|---|---|
| ||
public class CSVTests { @Test public void testNewline() throws IOException, RecognitionException { CSVParser parser = createParser("\r\n"); parser.line(); } @Test private public CSVParservoid createParsertestCRLF(String testString) throws IOException, RecognitionException { byte[]CSVParser byteArrayparser = testString.getBytescreateParser("ISO-8859-1\r\n"); InputStream stream = new ByteArrayInputStream(byteArrayparser.line(); } ANTLRInputStream input = new ANTLRInputStream(stream); private CSVParser createParser(String testString) throws IOException { CSVLexerbyte[] lexerbyteArray = new CSVLexer(inputtestString.getBytes("ISO-8859-1"); CommonTokenStreamInputStream tokensstream = new CommonTokenStreamByteArrayInputStream(lexerbyteArray); CSVParserANTLRInputStream parserinput = new CSVParserANTLRInputStream(tokensstream); return parserCSVLexer lexer = new CSVLexer(input); } } CommonTokenStream tokens = new CommonTokenStream(lexer); CSVParser parser = new CSVParser(tokens); return parser; } } |
Generate the Java files for this grammar, then run the test.
The test should pass.
Return a result
We'll want to get a list of strings back from each line of CSV. In true TDD form, we'll alter the test then update the grammar to make the test correct again.
...
Run ANTLR to regenerate your files and run the test again. The test should pass.
Extract a single word
A record in CSV is a series of fields separated by commas and ending in a newline or CRLF. We'll start by testing a single field in isolation then building back up to testing a whole record's worth of fields.
...
Info | ||
---|---|---|
| ||
"~" means "not" and is used to match any item that's not in a set. See The Definitive ANTLR Reference, page 95. |
Support multiple fields
Let's try multiple simple fields:
...
- We need to add each field's value to the line.
- A record such as
a,,b
should write out an empty field in the middle. - Testing shows a potential nondeterminism between <empty field>NEWLINE and NEWLINE.
Code Block | ||
---|---|---|
| ||
grammar CSV;
line returns [List<String> result]
@init {
result = new ArrayList<String>();
}
: (NEWLINE) => NEWLINE
| (
fieldResult=field { result.add(fieldResult); }
( COMMA fieldResult=field {result.add(fieldResult);} )*
NEWLINE
)
;
field returns [String parsedItem]
@init {
parsedItem = "";
}
: f=FIELD {$parsedItem=$f.text;}
| // nothing
;
NEWLINE : '\r'? '\n';
COMMA : ',';
FIELD: NONBREAKING+;
// Anything except a line-breaking character is allowed.
fragment NONBREAKING
: ~('\r' | '\n' | ',');
|
...
grammar CSV;
line returns [List<String> result]
@init {
result = new ArrayList<String>();
}
: (NEWLINE) => NEWLINE
| (
fieldResult=field { result.add(fieldResult); }
( COMMA fieldResult=field {result.add(fieldResult);} )*
NEWLINE
)
;
field returns [String parsedItem]
@init {
parsedItem = "";
}
: f=FIELD {$parsedItem=$f.text;}
| // nothing
;
NEWLINE : '\r'? '\n';
COMMA : ',';
FIELD: NONBREAKING+;
// Anything except a line-breaking character is allowed.
fragment NONBREAKING
: ~('\r' | '\n' | ',');
|
This works, but line
is getting cluttered. Let's use a global and still return a testing string from field:
Code Block | ||
---|---|---|
| ||
grammar CSV;
@members {
List<String> fields = new ArrayList<String>();
}
line returns [List<String> result]
: (
(NEWLINE) => NEWLINE
| field ( COMMA field )* NEWLINE
)
{ $result = fields; }
;
/** Adds the field to the master result and also returns it for unit testing */
field returns [String parsedItem]
@init { parsedItem = ""; }
: (f=FIELD {$parsedItem=$f.text;}
| // nothing
)
{ fields.add($parsedItem); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
FIELD: NONBREAKING+;
// Anything except a line-breaking character is allowed.
fragment NONBREAKING
: ~('\r' | '\n' | ',');
|
Quoting, part 1
CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.
Code Block | ||
---|---|---|
| ||
@Test
public void testQuotedString() throws IOException, RecognitionException {
CSVParser parser = createParser("\"Red, White, and Blue\"");
String result = parser.field();
assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>";
}
|
You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):
Code Block | ||
---|---|---|
| ||
grammar CSV; @members { List<String> fields = new ArrayList<String>(); } line returns [List<String> result] : ( (NEWLINE) => NEWLINE | field ( COMMA field )* NEWLINE ) { $result = fields; } ; /** Adds the field to the master result and also returns it for unit testing */ field returns [String parsedItem] @init { parsedItem = ""; } : ( f=FIELD {$parsedItem=$f.text;} | // nothing ) { fields.add($parsedItem); } ; NEWLINE : '\r'? '\n'; COMMA : ','; FIELD: NONBREAKING+; // Anything except a line-breaking character is allowed. fragment NONBREAKING : ~('\r' | '\n' | ','); |
Quoting, part 1
CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.
Code Block | ||
---|---|---|
| ||
@Test
public void testQuotedString() throws IOException, RecognitionException {
CSVParser parser = createParser("\"Red, White, and Blue\"");
String result = parser.field();
assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>";
}
|
You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):
Code Block | ||
---|---|---|
| ||
grammar CSV;
@members {
List<String> fields = new ArrayList<String>();
}
line returns [List<String> result]
: (
(NEWLINE) => NEWLINE
| field ( COMMA field )* NEWLINE
)
{ $result = fields; }
;
/** Adds the field to the master result and also returns it for unit testing */
field returns [String parsedItem]
@init { parsedItem = ""; }
: ( f=QUOTED
| f=UNQUOTED
| // nothing
)
{ $parsedItem = ($f == null) ? "" : $f.text; fields.add($parsedItem); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
QUOTED : '"' ( options {greedy=false;} : . )* '"'
{
// Strip the surrounding quotes
String txt = getText();
setText(txt.substring(1, txt.length() -1));
};
UNQUOTED : ~('\r' | '\n' | ',' | ' ' | '"')+;
|
...
QUOTED
| f=UNQUOTED
| // nothing
)
{ $parsedItem = ($f == null) ? "" : $f.text; fields.add($parsedItem); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
QUOTED : '"' ( options {greedy=false;} : . )* '"'
{
// Strip the surrounding quotes
String txt = getText();
setText(txt.substring(1, txt.length() -1));
};
UNQUOTED : ~('\r' | '\n' | ',' | ' ' | '"')+;
|
This gets the job done.
Quoting, part 2
Let's allow quotes inside quoted fields. CSV uses "" to represent " in the final output.
Code Block | ||
---|---|---|
| ||
@Test
public void testQuoteEscaping() throws IOException, RecognitionException {
CSVParser parser = createParser("\"Before\"\"After\"");
String result = parser.field();
assert result.equals("Before\"After") : "Expected <<Before\"After>>, but found <<" + result + ">>";
}
|
The QUOTED
lexer rule is clearly the place to put this. But think about how you would do it. [Seriously, go try some solutions before using mine. – RDC]
Here's one solution, including a bit of post-processing code to convert ""
to "
:
Code Block | ||
---|---|---|
| ||
QUOTED : ('"' ( options {greedy=false;}: . )+ '"')+
{
StringBuffer txt = new StringBuffer(getText());
// Remove first and last double-quote
txt.deleteCharAt(0);
txt.deleteCharAt(txt.length()-1);
// "" -> "
int probe;
while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
txt.deleteCharAt(probe);
}
setText(txt.toString());
};
|
Remove spaces around commas
CSV ignores leading and trailing spaces, so we should do the same. Here's the test case:
...
ANTLR 3's error recovery is taking care of this for us. Clearly, it's a nice thing while you're developing, but you probably want
I'm inclined to make any production code do the right thing . anyway, if for no other reason than to make the grammar self-documenting. So I will leave you with the following challenges:
Panel | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Strip leading and trailing spaces around commas (but nowhere else)
| |||||||||||||
|