Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added exception-recording example

Just as your code can benefit from complete unit tests, so can your grammars. Here's a walkthrough an example of developing a simple grammar ( for CSV parsing) using Java, ANTLR and TestNG.

...

CSV (Comma Separated Values) is a file format commonly used for exchanging spreadsheet data and which generally follows __ these rules:

  1. A record consists of multiple fields seperated by commas. Each record ends with a line feed (a/k/a newline) or carriage return + line feed.
  2. A field may not contain spaces, commas, double-quotes, line feeds, or carriage returns unless the field itself is wrapped in double-quotes on each end.
  3. To insert a double quote into a quoted field, double it (i.e. use "")
  4. Spaces are ignored immediately before and after a comma.
  5. White space at the front or end of the record is not allowed unless part of a quoted field.

...

  • You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
  • You can run ANTLR to generate Java sources from a grammar.
  • You can build and run Java code.
  • (optional) You have JDK 1. 5 or later installed. If not, you won't be able to use some of the constructs in this example and will need to translate back to 1.4 or earlier.

...

ANTLR 3's error recovery is taking care of this for us by skipping the space characters (which are unrecognized by the UNUOTED UNQUOTED rule. If we want to catch the error instead, we'll have to alter the error recovery mechanism.

Panel
borderColor#ccc
bgColor#FFFFCE
titleBGColor#F7D6C1
titleThings to do
borderStyledashed
Note
titleDo you use continuous builds?

Consider disabling automatic error recovery while developing and testing your grammar. Otherwise, you may miss problems that crop up during automated unit testing (e.g. when using CruiseControl).

Strip leading and trailing spaces around commas (but nowhere else)

Intercepting errors

Since the last test passed when it should have failed, but printed a series of warning messages, let's intercept the error information and retain it for later use.

Start by altering the test to extract stored error messages from the lexer. This also requires setting the code that creates the parser and lexer:

Code Block
titleCSVTests.java fragment

@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
    CSVLexer lexer = createLexer("Red  ,   Green, ,Blue\n");
    CSVParser parser = createParser(lexer);
    List<String> result = parser.line();
    assert result.size() == 4 : "Expected 4 items";
    assert result.get(0).equals("Red") : "Expected Red";
    assert result.get(1).equals("Green") : "Expected Green";
    assert result.get(2).equals("") : "Expected empty";
    assert result.get(3).equals("Blue") : "Expected Blue";
    // Now make sure we didn't have any lexing errors, which were failing silently earlier
    // The parser drives the lexer, so check for exceptions after
    // parsing.
    List<RecognitionException> lexerExceptions = lexer.getExceptions();
    assert lexerExceptions.isEmpty() : "Lexer threw exceptions -- see output";
}

// and later...

private CSVParser createParser(CSVLexer lexer) throws IOException {
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);
    return parser;
}

private CSVParser createParser(String testString) throws IOException {
    CSVLexer lexer = createLexer(testString);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);
    return parser;
}

private CSVLexer createLexer(String testString) throws IOException {
    byte[] byteArray = testString.getBytes("ISO-8859-1");
    InputStream stream = new ByteArrayInputStream(byteArray);
    ANTLRInputStream input = new ANTLRInputStream(stream);
    CSVLexer lexer = new CSVLexer(input);
    return lexer;
}

(minus) Of course, this doesn't compile: CSVLexer doesn't implement getExceptions.

A little work with a debugger shows that the lexer calls reportError when it encounters an error:

Code Block

public void reportError(RecognitionException e)

Override this method in the lexer and add the exception to a list (all of the changes are in @lexer::members):

Code Block
titleCSV.g

grammar CSV;

@members {
List<String> fields = new ArrayList<String>();
}

@lexer::members {
List<RecognitionException> exceptions = new ArrayList<RecognitionException>();

public List<RecognitionException> getExceptions() {
  return exceptions;
}

@Override
public void reportError(RecognitionException e) {
  super.reportError(e);
  exceptions.add(e);
}

}

line returns [List<String> result]
	: (
	    (NEWLINE) => NEWLINE
	    | field ( COMMA field )* NEWLINE
	  )
	  { $result = fields; }
	;

/** Adds the field to the master result and also returns it for unit testing */
field returns [String parsedItem]
@init { parsedItem = ""; }
	: ( f=QUOTED
	  | f=UNQUOTED
	  | // nothing
	  )
 	{ $parsedItem = ($f == null) ? "" : $f.text; fields.add($parsedItem); }
	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

QUOTED	: ('"' ( options {greedy=false;}: . )+ '"')+
	  {
	  	StringBuffer txt = new StringBuffer(getText()); 
	  	// Remove first and last double-quote
	  	txt.deleteCharAt(0);
	  	txt.deleteCharAt(txt.length()-1);
	  	// "" -> "
	  	int probe;
	  	while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
	  		txt.deleteCharAt(probe);
	  	}
	  	setText(txt.toString()); 
	  };
	
// Anything except a line-breaking character is allowed.
UNQUOTED	
	:	~('\r' | '\n' | ',' | ' ')+;

Regenerate the lexer and parser, then run the tests.

(minus) The test fails, as we expected.

Ignore whitespace around commas (we mean it this time...)

Now we can alter the grammar to ignore leading and trailing spaces around commas. The answer turns out to be ridiculously simple:

Code Block
titleCSV.g fragment

COMMA	:	( ' '* ',' ' '*);

(tick) The test passes.