Page Comparison

...

Every CSV line ends in a newline (line feed) or carriage return + line feed. Let's set that up as our basic grammar.

Code Block

title	CSV.g


grammar CSV;

line :	NEWLINE;

NEWLINE	:	'\r'? '\n';

Create the corresponding test harness.

Code Block

title	CSVTestCSVTests.java


public class CSVTests {
    @Test
    public void testNewline() throws IOException, RecognitionException {
        CSVParser parser = createParser("\n");
        parser.line();
    }

    @Test
    public void testCRLF() throws IOException, RecognitionException {
        CSVParser parser = createParser("\r\n");
        parser.line();
    }

        private CSVParser createParser(String testString) throws IOException {
    	CharStream stream = new ANTLRStringStream(testString);
    	CSVLexer lexer = new CSVLexer(stream);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        CSVParser parser = new CSVParser(tokens);
        return parser;
    }

}

Generate the Java files for this grammar, then run the test.

...

Code Block

title	CSVTests.java fragment


    @Test
    public void testNewline() throws IOException, RecognitionException {
        CSVParser parser = createParser("\n");
        List<String> result = parser.line(); // final public void line...
        assert result.isEmpty() : "Nothing to return";
    }

    @Test
    public void testCRLF() throws IOException, RecognitionException {
        CSVParser parser = createParser("\r\n");
        List<String> result = parser.line();
        assert result.isEmpty() : "Nothing to return";
    }

As expected, this fails to compile: line is declared as final public void line

...

Code Block

title	CSV.g fragment


line returns [List<String> result]
	:	NEWLINE;

...

Code Block

title	Generated code...


    public final List<String> line() throws RecognitionException {
        List<String> result = null;

...

Code Block

title	CSV.g fragment


line returns [List<String> result]
@init {
    result = new ArrayList<String>();
}
	:	NEWLINE;

NEWLINE	:	'\r'? '\n';

...

Code Block

title	CSVTests.java


@Test
public void testSingleWord() throws IOException, RecognitionException {
    CSVParser parser = createParser("Red");
    String result = parser.field();
    assert result.equals("Red") : "Expected Red, found " + result;
}

Of course, this won't compile. We need to define field in the grammar, doing just enough to keep the tests working:

Code Block

title	CSV.g


grammar CSV;

line returns [List<String> result]
@init {
	result = new ArrayList<String>();
}
	:	field NEWLINE;

field returns [String parsedItem]
	:	f=FIELD { $parsedItem = $f.text;}
	|   // nothing
	;
	
NEWLINE	:	'\r'? '\n';

FIELD	:	NONBREAKING* ;
	
// Anything except a line-breaking character is allowed.
fragement NONBREAKING	
	:	~('\r' | '\n');

Info

title	What's that funny character?

"~" means "not" and is used to match any item that's not in a set.

See The Definitive ANTLR Reference, page 95.

...

Code Block

title	CSVTests.java fragment


    @Test
    public void testMultipleWords() throws IOException, RecognitionException {
        CSVParser parser = createParser("Red,Green,,Blue\n");
        List<String> result = parser.line();
        assert result.size() == 4: "Expected 4 items";
        assert result.get(0).equals("Red") : "Expected Red";
        assert result.get(1).equals("Green") : "Expected Green";
        assert result.get(2).equals("") : "Expected empty";
        assert result.get(3).equals("Blue") : "Expected Blue";
    }

...

We need to add each field's value to the line.
A record such as a,,b should write out an empty field in the middle.
Testing shows a potential nondeterminism between <empty field>NEWLINE and NEWLINE.

Code Block

title	CSV.g


grammar CSV;

line returns [List<String> result]
@init {
	result = new ArrayList<String>();
}
	: (NEWLINE) => NEWLINE
	| (
		fieldResult=field { result.add(fieldResult); }
		( COMMA fieldResult=field {result.add(fieldResult);} )*
	 	NEWLINE
	  )
	;

field returns [String parsedItem]
@init {
	parsedItem = "";
}
	: f=FIELD {$parsedItem=$f.text;}
	| // nothing

	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

FIELD:	NONBREAKING+;
	
// Anything except a line-breaking character is allowed.
fragment NONBREAKING	
	:	~('\r' | '\n' | ',');

This works, but line is getting cluttered.

...

Code Block

title	CSV.g with scope


grammar CSV;

/* Old definitions commented out:
line returns [List<String> result]
@init { result = new ArrayList<String>(); }
	: (NEWLINE) => NEWLINE
	| (
		fieldResult=field { result.add(fieldResult); }
		( COMMA fieldResult=field {result.add(fieldResult);} )*
	 	NEWLINE
	  )
	;

field returns [String parsedItem]
@init { parsedItem = ""; }
	: (f=FIELD {$parsedItem=$f.text;}
	  | // nothing
	  )
 	{ fields.add($parsedItem); }
	;

*/

// New definitions:
line returns [List<String> result]
scope { List fields; }

@init { $line::fields = new ArrayList(); }
	: (
	    (NEWLINE) => NEWLINE
	    | field (COMMA  field)* NEWLINE
	  )
	  { $result = $line::fields; }
	;

field
	: ( f=FIELD
	  | // nothing
	  )
 	{ $line::fields.add($f.text); }
	;

NEWLINE	:	'\r'? '\n';

COMMA	:	',';

FIELD:	NONBREAKING+;
	
// Anything except a line-breaking character is allowed.
fragment NONBREAKING	
	:	~('\r' | '\n' | ',');

Since field no longer returns a string, we'll need to alter the test to pass the value through line and add a newline to the end of the line:

Code Block

title	CSVTests.java, new field test via line


@Test
public void testSingleWord() throws IOException, RecognitionException {
    String result = parseField("Red");
    assert result.equals("Red") : "Expected Red, but found " + result;
}

private String parseField(String testString) throws IOException, RecognitionException {
    CSVParser parser = createParser(testString + "\n");
    List<String> result = parser.line();
    return result.get(0);
}

...

Code Block

title	CSVTests.java fragment


@Test
public void testQuotedString() throws IOException, RecognitionException {
    String result = parseField("\"Red, White, and Blue\"");
    assert result.equals("Red, White, and Blue") : "Expected <<Red, White, and Blue>>, but found <<" + result + ">>";
}

You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):

Code Block

title	CSV.g


grammar CSV;

line returns [List<String> result]
scope { List fields; }

@init { $line::fields = new ArrayList(); }
	: (
	    (NEWLINE) => NEWLINE
	    | field (COMMA  field)* NEWLINE
	  )
	  { $result = $line::fields; }
	;

field
	: ( f=QUOTED
	  | f=UNQUOTED
	  | // nothing
	  )
 	{ $line::fields.add(($f == null) ? "" : $f.text); }
	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

QUOTED	: '"' ( options {greedy=false;} : . )* '"'

	  {
	  	// Strip the surrounding quotes
	  	String txt = getText();

	  	setText(txt.substring(1, txt.length() -1));

	  };
	
UNQUOTED	:	~('\r' | '\n' | ',' | ' ' | '"')+;

This gets the job done.

Quoting, part 2

...

Code Block

title	CSVTests.java fragment


@Test
public void testQuoteEscaping() throws IOException, RecognitionException {
    String result = parseField("\"Before\"\"After\"");
    assert result.equals("Before\"After") : "Expected <<Before\"After>>, but found <<" + result + ">>";
}

...

Code Block

title	CSV.g fragment


QUOTED	: ('"' ( options {greedy=false;}: . )+ '"')+
	  {
	  	StringBuffer txt = new StringBuffer(getText());

	  	// Remove first and last double-quote
	  	txt.deleteCharAt(0);
	  	txt.deleteCharAt(txt.length()-1);
	  	// "" -> "
	  	int probe;
	  	while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
	  		txt.deleteCharAt(probe);
	  	}
	  	setText(txt.toString());

	  };

Remove spaces around commas

...

Code Block

title	CSVTests.java fragment


@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
    CSVParser parser = createParser("Red  ,   Green, ,Blue\n");
    List<String> result = parser.line();
    assert result.size() == 4 : "Expected 4 items";
    assert result.get(0).equals("Red") : "Expected Red";
    assert result.get(1).equals("Green") : "Expected Green";
    assert result.get(2).equals("") : "Expected empty";
    assert result.get(3).equals("Blue") : "Expected Blue";
}

Surprise – it passes! This is unexpected, but a look at the error output shows what's happening:

No Format


line 1:3 no viable alternative at character ' '
line 1:4 no viable alternative at character ' '
line 1:6 no viable alternative at character ' '
line 1:7 no viable alternative at character ' '
line 1:8 no viable alternative at character ' '
line 1:15 no viable alternative at character ' '

...

Code Block

title	CSVTests.java fragment


@Test
public void testSpaceRemoval() throws IOException, RecognitionException {
    CSVLexer lexer = createLexer("Red  ,   Green, ,Blue\n");
    CSVParser parser = createParser(lexer);
    List<String> result = parser.line();
    assert result.size() == 4 : "Expected 4 items";
    assert result.get(0).equals("Red") : "Expected Red";
    assert result.get(1).equals("Green") : "Expected Green";
    assert result.get(2).equals("") : "Expected empty";
    assert result.get(3).equals("Blue") : "Expected Blue";
    // Now make sure we didn't have any lexing errors, which were failing silently earlier
    // The parser drives the lexer, so check for exceptions after
    // parsing.
    List<RecognitionException> lexerExceptions = lexer.getExceptions();
    assert lexerExceptions.isEmpty() : "Lexer threw exceptions -- see output";
}

// and later...

private CSVParser createParser(CSVLexer lexer) throws IOException {
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);
    return parser;
}

private CSVParser createParser(String testString) throws IOException {
    CSVLexer lexer = createLexer(testString);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);
    return parser;
}

private CSVLexer createLexer(String testString) throws IOException {
	CharStream stream = new ANTLRStringStream(testString);
	CSVLexer lexer = new CSVLexer(stream);
    return lexer;
}

...

A little work with a debugger shows that the lexer calls reportError when it encounters an error:

Code Block
public void reportError(RecognitionException e)

Override this method in the lexer and add the exception to a list (all of the changes are in @lexer::members):

Code Block

title	CSV.g


grammar CSV;

@lexer::members {
	List<RecognitionException> exceptions = new ArrayList<RecognitionException>();

	public List<RecognitionException> getExceptions() {
  		return exceptions;
	}

	@Override
	public void reportError(RecognitionException e) {
  		super.reportError(e);
  		exceptions.add(e);
	}
}
line returns [List<String> result]
scope { List fields; }

@init { $line::fields = new ArrayList(); }
	: (
	    (NEWLINE) => NEWLINE
	    | field (COMMA  field)* NEWLINE
	  )
	  { $result = $line::fields; }
	;

field
	: ( f=QUOTED
	  | f=UNQUOTED
	  | // nothing
	  )
 	{ $line::fields.add(($f == null) ? "" : $f.text); }
	;
	
NEWLINE	:	'\r'? '\n';

COMMA	:	',';

QUOTED	: ('"' ( options {greedy=false;}: . )+ '"')+
	  {
	  	StringBuffer txt = new StringBuffer(getText()); 
	  	// Remove first and last double-quote
	  	txt.deleteCharAt(0);
	  	txt.deleteCharAt(txt.length()-1);
	  	// "" -> "
	  	int probe;
	  	while ((probe = txt.lastIndexOf("\"\"")) >= 0) {
	  		txt.deleteCharAt(probe);
	  	}
	  	setText(txt.toString()); 
	  };
	
// Anything except a line-breaking character is allowed.
UNQUOTED	
	:	~('\r' | '\n' | ',' | ' ')+;

Regenerate the lexer and parser, then run the tests.

...

Code Block

title	CSV.g fragment


COMMA	:	( ' '* ',' ' '*);

...

Versions Compared

Old Version 21

New Version Current

Key

Quoting, part 2

Remove spaces around commas