Just as your code can benefit from complete unit tests, so can your grammars. Here's a walkthrough an example of developing a simple grammar ( for CSV parsing) using Java, ANTLR and TestNG.
...
CSV (Comma Separated Values) is a file format commonly used for exchanging spreadsheet data and which generally follows __ these rules:
- A record consists of multiple fields seperated by commas. Each record ends with a line feed (a/k/a newline) or carriage return + line feed.
- A field may not contain spaces, commas, double-quotes, line feeds, or carriage returns unless the field itself is wrapped in double-quotes on each end.
- To insert a double quote into a quoted field, double it (i.e. use "")
- Spaces are ignored immediately before and after a comma.
- White space at the front or end of the record is not allowed unless part of a quoted field.
...
- You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
- You can run ANTLR to generate Java sources from a grammar.
- You can build and run Java code.
- (optional) You have JDK 1. 5 or later installed. If not, you won't be able to use some of the constructs in this example and will need to translate back to 1.4 or earlier.
...
Every CSV line ends in a newline (line feed) or carriage return + line feed. Let's set that up as our basic grammar.
Code Block |
---|
|
grammar CSV;
line : NEWLINE;
NEWLINE : '\r'? '\n';
|
Create the corresponding test harness.
Code Block |
---|
title | CSVTestCSVTests.java |
---|
|
public class CSVTests {
@Test
public void testNewline() throws IOException, RecognitionException {
CSVParser parser = createParser("\n");
parser.line();
}
@Test
public void testCRLF() throws IOException, RecognitionException {
CSVParser parser = createParser("\r\n");
parser.line();
}
private CSVParser createParser(String testString) throws IOException {
CharStream stream = byte[] byteArray = testString.getBytes("ISO-8859-1"new ANTLRStringStream(testString);
CSVLexer lexer InputStream stream = new ByteArrayInputStreamCSVLexer(byteArraystream);
ANTLRInputStreamCommonTokenStream inputtokens = new ANTLRInputStreamCommonTokenStream(streamlexer);
CSVLexerCSVParser lexerparser = new CSVLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CSVParser parser = new CSVParser(tokens);
return parser;
}
}
|
Generate the Java files for this grammar, then run the test.
...
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testNewline() throws IOException, RecognitionException {
CSVParser parser = createParser("\n");
List<String> result = parser.line(); // final public void line...
assert result.isEmpty() : "Nothing to return";
}
@Test
public void testCRLF() throws IOException, RecognitionException {
CSVParser parser = createParser("\r\n");
List<String> result = parser.line();
assert result.isEmpty() : "Nothing to return";
}
|
As expected, this fails to compile: line
is declared as final public void line
...
Code Block |
---|
|
line returns [List<String> result]
: NEWLINE;
|
...
Code Block |
---|
|
public final List<String> line() throws RecognitionException {
List<String> result = null;
|
...
Code Block |
---|
|
line returns [List<String> result]
@init {
result = new ArrayList<String>();
}
: NEWLINE;
NEWLINE : '\r'? '\n';
|
...
Code Block |
---|
|
@Test
public void testSingleWord() throws IOException, RecognitionException {
CSVParser parser = createParser("Red");
String result = parser.field();
assert result.equals("Red") : "Expected Red, found " + result;
}
|
Of course, this won't compile. We need to define field
in the grammar, doing just enough to keep the tests working:
Code Block |
---|
|
grammar CSV;
line returns [List<String> result]
@init {
result = new ArrayList<String>();
}
: field NEWLINE;
field returns [String parsedItem]
: f=FIELD { $parsedItem = $f.text;}
| // nothing
;
NEWLINE : '\r'? '\n';
FIELD : NONBREAKING* ;
// Anything except a line-breaking character is allowed.
fragement NONBREAKING
: ~('\r' | '\n');
|
Info |
---|
title | What's that funny character? |
---|
|
"~" means "not" and is used to match any item that's not in a set. See The Definitive ANTLR Reference, page 95. |
...
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testMultipleWords() throws IOException, RecognitionException {
CSVParser parser = createParser("Red,Green,,Blue\n");
List<String> result = parser.line();
assert result.size() == 4: "Expected 4 items";
assert result.get(0).equals("Red") : "Expected Red";
assert result.get(1).equals("Green") : "Expected Green";
assert result.get(2).equals("") : "Expected empty";
assert result.get(3).equals("Blue") : "Expected Blue";
}
|
...
- We need to add each field's value to the line.
- A record such as
a,,b
should write out an empty field in the middle. - Testing shows a potential nondeterminism between <empty field>NEWLINE and NEWLINE.
Code Block |
---|
|
grammar CSV;
line returns [List<String> result]
@init {
result = new ArrayList<String>();
}
: (NEWLINE) => NEWLINE
| (
fieldResult=field { result.add(fieldResult); }
( COMMA fieldResult=field {result.add(fieldResult);} )*
NEWLINE
)
;
field returns [String parsedItem]
@init {
parsedItem = "";
}
: f=FIELD {$parsedItem=$f.text;}
| // nothing
;
NEWLINE : '\r'? '\n';
COMMA : ',';
FIELD: NONBREAKING+;
// Anything except a line-breaking character is allowed.
fragment NONBREAKING
: ~('\r' | '\n' | ',');
|
This works, but line
is getting cluttered. Let's use a global and still return a testing string from field:
Introducing scopes
We've been passing a value back from field
to line
, but there's another way to pass information between rules: dynamic scopes. This is a good place to see how they work.
Code Block |
---|
title | CSV.g with global listscope |
---|
|
grammar CSV;
@members/* {Old List<String>definitions fieldscommented = new ArrayList<String>();
}
lineout:
line returns [List<String> result]
: (
@init { result = new ArrayList<String>(); }
: (NEWLINE) => NEWLINE
| (
fieldResult=field | field { result.add(fieldResult); }
( COMMA fieldResult=field {result.add(fieldResult);} )*
NEWLINE
)
;
{field $result = fields; }
;
/** Adds the field to the master result and also returns it for unit testing */
field returns [returns [String parsedItem]
@init { parsedItem = ""; }
: (f=FIELD {$parsedItem=$f.text;}
| // nothing
)
{ fields.add($parsedItem); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
FIELD: NONBREAKING+;
// Anything except a line-breaking character is allowed.
fragment NONBREAKING
: ~('\r' | '\n' | ',');
|
Quoting, part 1
CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testQuotedString() throws IOException, RecognitionException {
CSVParser parser = createParser("\"Red, White, and Blue\"");
String result = parser.field();
assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>";
}
|
You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):
Code Block |
---|
|
grammar CSV;
@members {
List<String> fields = new ArrayList<String>();
}
line returns [List<String> result]
: (
(NEWLINE) => NEWLINE
| field ( COMMA field )* NEWLINE
)
{ $result = fields; }
;
/** Adds the field to the master result and also returns it for unit testing */
field returns [String parsedItem]
@init { parsedItem = ""; }
: ( f=QUOTED
| f=UNQUOTED
| // nothing
)
{ $parsedItem = ($f == null) ? "" : $f.text; fields.add($parsedItem); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
QUOTED : '"' ( options {greedy=false;} : . )* '"'
{
// Strip the surrounding quotes
String txt = getText();
setText(txt.substring(1, txt.length() -1));
};
UNQUOTED : ~('\r' | '\n' | ',' | ' ' | '"')+;
|
This gets the job done.
Quoting, part 2
Let's allow quotes inside quoted fields. CSV uses "" to represent " in the final output.
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testQuoteEscaping() throws IOException, RecognitionException {
CSVParser parser = createParser("\"Before\"\"After\"");
String result = parser.field();
assert result.equals("Before\"After") : "Expected <<Before\"After>>, but found <<" + result + ">>";
}
|
The QUOTED
lexer rule is clearly the place to put this. But think about how you would do it. [Seriously, go try some solutions before using mine. – RDC]
Here's one solution, including a bit of post-processing code to convert ""
to "
:
Code Block |
---|
|
QUOTED : ('"' ( options {greedy=false;}: . )+ '"')+
{
StringBuffer txt = new StringBuffer(getText());
// Remove first and last double-quote
txt.deleteCharAt(0);
txt.deleteCharAt(txt.length()-1);
// "" -> "
int probe;
while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
txt.deleteCharAt(probe);
}
setText(txt.toString());
};
|
Remove spaces around commas
CSV ignores leading and trailing spaces, so we should do the same. Here's the test case:
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
CSVParser parser = createParser("Red , Green, ,Blue\n");
List<String> result = parser.line();
assert result.size() == 4 : "Expected 4 items";
assert result.get(0).equals("Red") : "Expected Red";
assert result.get(1).equals("Green") : "Expected Green";
assert result.get(2).equals("") : "Expected empty";
assert result.get(3).equals("Blue") : "Expected Blue";
}
|
Surprise – it passes! This is unexpected, but a look at the error output shows what's happening:
No Format |
---|
line 1:3 no viable alternative at character ' '
line 1:4 no viable alternative at character ' '
line 1:6 no viable alternative at character ' '
line 1:7 no viable alternative at character ' '
line 1:8 no viable alternative at character ' '
line 1:15 no viable alternative at character ' '
|
ANTLR 3's error recovery is taking care of this for us by skipping the space characters (which are unrecognized by the UNUOTED
rule. If we want to catch the error instead, we'll have to alter the error recovery mechanism.
Panel |
---|
borderColor | #ccc |
---|
bgColor | #FFFFCE |
---|
titleBGColor | #F7D6C1 |
---|
title | Things to do |
---|
borderStyle | dashed |
---|
|
Consider disabling automatic error recovery while developing and testing your grammar. Otherwise, you may miss problems that crop up during automated unit testing (e.g. when using CruiseControl).Strip leading and trailing spaces around commas (but nowhere else)
*/
// New definitions:
line returns [List<String> result]
scope { List fields; }
@init { $line::fields = new ArrayList(); }
: (
(NEWLINE) => NEWLINE
| field (COMMA field)* NEWLINE
)
{ $result = $line::fields; }
;
field
: ( f=FIELD
| // nothing
)
{ $line::fields.add($f.text); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
FIELD: NONBREAKING+;
// Anything except a line-breaking character is allowed.
fragment NONBREAKING
: ~('\r' | '\n' | ',');
|
Since field
no longer returns a string, we'll need to alter the test to pass the value through line
and add a newline to the end of the line:
Code Block |
---|
title | CSVTests.java, new field test via line |
---|
|
@Test
public void testSingleWord() throws IOException, RecognitionException {
String result = parseField("Red");
assert result.equals("Red") : "Expected Red, but found " + result;
}
private String parseField(String testString) throws IOException, RecognitionException {
CSVParser parser = createParser(testString + "\n");
List<String> result = parser.line();
return result.get(0);
}
|
Quoting, part 1
CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testQuotedString() throws IOException, RecognitionException {
String result = parseField("\"Red, White, and Blue\"");
assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>";
}
|
You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):
Code Block |
---|
|
grammar CSV;
line returns [List<String> result]
scope { List fields; }
@init { $line::fields = new ArrayList(); }
: (
(NEWLINE) => NEWLINE
| field (COMMA field)* NEWLINE
)
{ $result = $line::fields; }
;
field
: ( f=QUOTED
| f=UNQUOTED
| // nothing
)
{ $line::fields.add(($f == null) ? "" : $f.text); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
QUOTED : '"' ( options {greedy=false;} : . )* '"'
{
// Strip the surrounding quotes
String txt = getText();
setText(txt.substring(1, txt.length() -1));
};
UNQUOTED : ~('\r' | '\n' | ',' | ' ' | '"')+;
|
This gets the job done.
Quoting, part 2
Let's allow quotes inside quoted fields. CSV uses "" to represent " in the final output.
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testQuoteEscaping() throws IOException, RecognitionException {
String result = parseField("\"Before\"\"After\"");
assert result.equals("Before\"After") : "Expected <<Before\"After>>, but found <<" + result + ">>";
}
|
The QUOTED
lexer rule is clearly the place to put this. But think about how you would do it. [Seriously, go try some solutions before using mine. – RDC]
Here's one solution, including a bit of post-processing code to convert ""
to "
:
Code Block |
---|
|
QUOTED : ('"' ( options {greedy=false;}: . )+ '"')+
{
StringBuffer txt = new StringBuffer(getText());
// Remove first and last double-quote
txt.deleteCharAt(0);
txt.deleteCharAt(txt.length()-1);
// "" -> "
int probe;
while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
txt.deleteCharAt(probe);
}
setText(txt.toString());
};
|
Remove spaces around commas
CSV ignores leading and trailing spaces, so we should do the same. Here's the test case:
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
CSVParser parser = createParser("Red , Green, ,Blue\n");
List<String> result = parser.line();
assert result.size() == 4 : "Expected 4 items";
assert result.get(0).equals("Red") : "Expected Red";
assert result.get(1).equals("Green") : "Expected Green";
assert result.get(2).equals("") : "Expected empty";
assert result.get(3).equals("Blue") : "Expected Blue";
}
|
Surprise – it passes! This is unexpected, but a look at the error output shows what's happening:
No Format |
---|
line 1:3 no viable alternative at character ' '
line 1:4 no viable alternative at character ' '
line 1:6 no viable alternative at character ' '
line 1:7 no viable alternative at character ' '
line 1:8 no viable alternative at character ' '
line 1:15 no viable alternative at character ' '
|
ANTLR 3's error recovery is taking care of this by skipping the space characters (which are unrecognized by the UNQUOTED
rule. If we want to catch the error instead, we'll have to alter the error recovery mechanism.
Note |
---|
title | Do you use continuous builds? |
---|
|
Consider disabling automatic error recovery while developing and testing your grammar. Otherwise, you may miss problems that crop up during automated unit testing (e.g. when using CruiseControl). |
Intercepting errors
Since the last test passed when it should have failed, but printed a series of warning messages, let's intercept the error information and retain it for later use.
Start by altering the test to extract stored error messages from the lexer. This also requires setting the code that creates the parser and lexer:
Code Block |
---|
title | CSVTests.java fragment |
---|
|
@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
CSVLexer lexer = createLexer("Red , Green, ,Blue\n");
CSVParser parser = createParser(lexer);
List<String> result = parser.line();
assert result.size() == 4 : "Expected 4 items";
assert result.get(0).equals("Red") : "Expected Red";
assert result.get(1).equals("Green") : "Expected Green";
assert result.get(2).equals("") : "Expected empty";
assert result.get(3).equals("Blue") : "Expected Blue";
// Now make sure we didn't have any lexing errors, which were failing silently earlier
// The parser drives the lexer, so check for exceptions after
// parsing.
List<RecognitionException> lexerExceptions = lexer.getExceptions();
assert lexerExceptions.isEmpty() : "Lexer threw exceptions -- see output";
}
// and later...
private CSVParser createParser(CSVLexer lexer) throws IOException {
CommonTokenStream tokens = new CommonTokenStream(lexer);
CSVParser parser = new CSVParser(tokens);
return parser;
}
private CSVParser createParser(String testString) throws IOException {
CSVLexer lexer = createLexer(testString);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CSVParser parser = new CSVParser(tokens);
return parser;
}
private CSVLexer createLexer(String testString) throws IOException {
CharStream stream = new ANTLRStringStream(testString);
CSVLexer lexer = new CSVLexer(stream);
return lexer;
}
|
Of course, this doesn't compile: CSVLexer
doesn't implement getExceptions
.
A little work with a debugger shows that the lexer calls reportError
when it encounters an error:
Code Block |
---|
public void reportError(RecognitionException e)
|
Override this method in the lexer and add the exception to a list (all of the changes are in @lexer::members
):
Code Block |
---|
|
grammar CSV;
@lexer::members {
List<RecognitionException> exceptions = new ArrayList<RecognitionException>();
public List<RecognitionException> getExceptions() {
return exceptions;
}
@Override
public void reportError(RecognitionException e) {
super.reportError(e);
exceptions.add(e);
}
}
line returns [List<String> result]
scope { List fields; }
@init { $line::fields = new ArrayList(); }
: (
(NEWLINE) => NEWLINE
| field (COMMA field)* NEWLINE
)
{ $result = $line::fields; }
;
field
: ( f=QUOTED
| f=UNQUOTED
| // nothing
)
{ $line::fields.add(($f == null) ? "" : $f.text); }
;
NEWLINE : '\r'? '\n';
COMMA : ',';
QUOTED : ('"' ( options {greedy=false;}: . )+ '"')+
{
StringBuffer txt = new StringBuffer(getText());
// Remove first and last double-quote
txt.deleteCharAt(0);
txt.deleteCharAt(txt.length()-1);
// "" -> "
int probe;
while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
txt.deleteCharAt(probe);
}
setText(txt.toString());
};
// Anything except a line-breaking character is allowed.
UNQUOTED
: ~('\r' | '\n' | ',' | ' ')+;
|
Regenerate the lexer and parser, then run the tests.
The test fails, as we expected.
Ignore whitespace around commas (we mean it this time...)
Now we can alter the grammar to ignore leading and trailing spaces around commas. The answer turns out to be ridiculously simple:
Code Block |
---|
|
COMMA : ( ' '* ',' ' '*);
|
The test passes.