ANTLR v4 lexers

ANTLR 1.0 (PCCTS) used a traditional DFA-based lexer. As I moved to ANTLR v2, I realized that we could tokenize input using the same recursive descent mechanism used for parsers. Unfortunately, with only fixed lookahead, ANTLR v2 lexers were sort of difficult to build because he had to do a lot of your own left factoring. ANTLR v3 introduced arbitrary regular look ahead with LL(*), which makes the lexer rules much easier to write. The lexers are really powerful because you can do recursive constructs like nested comments and things like that. You can have semantic predicates and arbitrary actions. Unfortunately, it still seems more difficult to build tokenizer with ANTLR than with a traditional lex-like approach. Using LL(*) to compute DFA from lexer rules also is pretty expensive. Finally, generating a lot of DFA state machines in Java is a pain due to the lack of static arrays in Java class files.

For ANTLR v4, I'm contemplating a hybrid mechanism that combines the best of state machines with the best of ANTLR's flexibility. The basic idea is to create my internal NFA-like recursive transition machine to represent all of the lexer rules. Then, generate a fixed representation of it during code generation. At runtime, the lexer would backtrack across the state machine very much like the ANTLR grammar interpreter that ANTLRWorks uses (although there are some bugs in at the moment). NFA are notoriously slower to interpret than DFA, but after looking at a number of lexer grammars, I concluded that a little bit of fixed lookahead ala LL(k) for k=2 would make them very fast.

Update: After thinking about it more, I decided that I might as well go ahead and generate an entire DFA to match all of the tokens. The current code generation model for DFA's should be a decent way to represent the big DFA. I would only generate methods for lexer rules if they had actions other than the right edge. see more below.

The beauty of a state machine over generated code full of IF/WHILE statements, is that we can jump around anywhere we want inside the state machine during execution. If we get halfway through a token and realize it's not going to match, we can trivially rewind the input and jump to the appropriate state for the next most likely token. It also allows us to do "longest match" type stuff easily because we can just keep track of all of the tokens we match until we stop matching characters; then just pick the token for the longest sequence.

Now, what about actions? How are we going to handle larger actions inside? Well, that's easy enough to do by generating methods, one per action. If there are actions within a token, we rewind the input and match the token again once we are sure it will match. And this time we do it with feeling, executing the actions. This is precisely the mechanism of syntactic predicates, which is nice symmetry. What about local variables and parameters? If we shove actions into their own method, how can they see the local variables defined in other actions? Well, as part of another feature (to avoid compilation errors during semantic predicate hoisting), I'm going to store all parameters in my own software stack rather than relying on the method call stack of the underlying target language. As for local variables, I'm going to formalize them so that I know you are defining a local variable; I will store that also in the software stack so that they are available to all of the actions in a rule.

What about recursive token references such as the following rule:

B : '[' (B|.)+ ']' ;

That's no problem. we simply have a transition in the state machine that push the current state and jumps back to the start of B.

I believe this mechanism would provide a more satisfying experience for the lexer grammar developer. It might actually be faster than the current recursive descent lexers in v3. It's easier to generate and faster to analyze. It should also take much less time to initialize at lexer runtime because the state machine will be smaller. Currently, v3 lexers are huge. For example, the Java lexer is 4747 lines! It's just not that hard a problem to deserve that much code. Oh, it should also make it much easier to build incremental lexers, if we decide to implement those.

Also, another issue as Gavin Lambert says:

While it's usually possible to frame the contents of a looping construct in a positive sense (which is what ANTLR currently requires), it'd be nice if there were a language construct that would let you do specific negative matches too (maybe a feature request for v4?).

For (a completely made up) example, consider a case where you might want to match any identifier except one starting with "foo". In current ANTLR, you'd have to do one of these:
 FOOLIST: 'foo[' NON_FOO_ID+ ']';
 FOOLIST: 'foo[' ({!next_id_starts_with_foo()}? => ID)+ ']';
 FOOLIST: 'foo[' ((~'f' | 'f' ~'o' | 'fo' ~'o') => ID)+ ']';
It'd be nice if there was some way to express a negative match via a syntactic predicate, eg:
 FOOLIST: 'foo[' (('foo') => ~ | ID)+ ']';
(where '~' in an alt basically means "break", ie. match nothing and terminate the innermost loop.)
Or, perhaps better:
 FOOLIST: 'foo[' (('foo') ~=> ID)+ ']';
(where '~=>' means "only take this path if the predicate fails")

Granted, this sort of requirement doesn't come up often, but when it does it'd be nice to have a tidier way of expressing it; and it'd be fairly simple to implement...

More stuff as I implement v4: v2 and v3 lexers drive me nuts sometimes. E.g.,

QUALIFIED_ATTR
	:	'$' x=ID '.' y=ID (WS? '=' expr=ATTR_VALUE_EXPR ';')?
	;

If I have a WS after $x.y, it enters optional subrule even if '=' not following. It then fails! It should rewind and claim it did all but subrule.

Also, Ron Hunter-Duvar just pointed out that this loops forever with failed pred:

lexer grammar Test2;

@members {
 public static void main(String[] args) {
   Test2 lexer = new Test2();
   lexer.setCharStream(new ANTLRStringStream("cb"));
   System.out.println("Token="+lexer.nextToken());
 }
}

CAB:
 ( ( 'c' 'a' )
 | ( { false }? 'c' 'b' )
 )
 ;

Main idea

Perform NFA to DFA conversion not just LL prediction. Treat rule references as meta-characters sort of like LR(0) machine construction. Try to inline refs to other top-level tokens so we avoid some runtime nondeterminisms.

FLOAT : INT '.' INT ;
INT   : '0'..'9'+ ;

Otherwise, there's no merging of transitions in DFA. Need to inline:

FLOAT : '0'..'9'+ '.' '0'..'9'+ ;
INT   : '0'..'9'+ ;

Or:

FLOAT : INT '.' INT ;
INT   : DIGIT+ ;
fragment
DIGIT : '0'..'9' ;

becomes

FLOAT : DIGIT+ '.' DIGIT+;
INT   : DIGIT+ ;
fragment
DIGIT : '0'..'9' ;

We'd still be an NFA (or an RTN) at runtime but the left-factoring would reduce number of nondeterministic states.

Maybe it's easier to just inline all non-recursive rules (unless there are actions).

Actions at non-right-edges restrict/prevent left-factoring. I.e., DFA conversion can't merge some edges.

Runtime implementation

NFA simulation / interpretation

Code generation

Scannerless parsing

while I'm at it, I should investigate a scannerless parsing mechanism. it just means using char as tokens really but it might be pretty slow that way. Scannerless parsing makes embedded/island grammars easier and supports true grammar composition. That is, we can import one grammar into another and not worry about breaking recognition of either language. Maybe with generics, i can make a Parser object with TokenLabel type set to Integer to parse char. Actually we'd need:

class IntegerToken implements Token {
    int c;
    public int getType() { return c; }
    public String getText() { return String.valueOf(c); }
    ...
}

Heck, that might actually work w/o generics.

Hmm...might be easier to make ScannerlessParser object extending Parser.

Update 4/19/2010

Requirements

I believe I have settled on the following requirements:

lexer modes (easy enough to implement and useful in a lot of cases); means we don't need recursive rules
no recursive rules
no non-greedy loops (use states or predicates when necessary)
choose longest match from among the lexer rules; E.g., "beginning" matchs ID not BEGIN ID.
```
BEGIN : 'begin' ;
ID    : 'a'..'z'+ ;
```
arbitrary actions and predicates allowed BUT they will execute out of context from each other and hence must use "global" information such as the input stream and lexer fields or labels on lexer rule elements like id=ID. Local variable definitions in actions won't be visible to other actions. I'll probably have to implement the formal local variables and parameters mechanism so that we can handle things like ( {i<5}? LETTER {++i;} )+
no syntactic predicates in lexers BUT I think I may allow a new kind of predicate ala PEGs using the &x operator that says match x that don't make those characters part of the token. Then &~x would mean make sure x doesn't match at this point in the input stream. Without recursion, a DFA can recognize any lexical structure we specify (given the way I build the DFA--I effectively in line all rule references to make the grammar regular).

Implementation

As for implementation, we need a two-pronged approach:

DFAs for speed.
The default should be a DFA for speed and simplicity. Actions would be allowed in this case as long as they are at the right edge of a lexical rule (to execute after we've matched an entire rule).
NFAs to handle labeled lexer rule elements and actions and predicates.
Actions in the middle of a rule or predicates or labeled elements force us to use an NFA simulation instead of a DFA. For this, I will probably use something similar to Thompson's algorithm as explained by Russ Cox. I will add memoization to avoid exploring dead ends in the NFA increase efficiency

So, if you add something to your lexical rules beyond what a DFA can handle, you are stuck with an NFA, which will run a little bit slower. Filter mode, for example, will have to use an NFA because they typically use a lot of actions of predicates.

Notes from April 20, 2010

added lex modes/states:

lexer grammar L2;

A : 'a' ;
AA : 'aa' ;

mode FOO;

B : 'b' ;
C : 'c' ;

Actions will flip between them (push/pop/flip).

Notes from April 30, 2010

The bytecode is actually easier to read than the java

lexer grammar L2;
A : 'ab';
B : 'a'..'z'+ ;
I : '0'..'9'+ ;

yields:

0000:	split         9, 16, 29   // says 3 paths are possible
0009:	match8        'a'
0011:	match8        'b'
0013:	accept        4
0016:	range8        'a', 'z'
0019:	split         16, 26
0026:	accept        5
0029:	range8        '0', '9'
0032:	split         29, 39 // go back or fall out of loop into accept state
0039:	accept        6

is that what you mean? It's 1-to-1 with the grammar. taken almost verbatim from Russ Cox's description of VM-based NFA simulation.

ANTLR v4 uses 42 bytes to encode entire L2 grammar. ANTLR v3 generates 246 lines of Java and 2709 bytes of java .class file:

/tmp $ wc -l L2.java
    246 L2.java
/tmp $ ls -l L2.class
-rw-r--r--  1 parrt  wheel  2709 Apr 30 16:39 L2.class

Notes from May 1, 2010

got recursive lexer rule invocation working in a few hours from the "no calls allowed" core (it mimics LL(*) analysis but at runtime).

	@Test public void testRecursiveCall() throws Exception {
		LexerGrammar g = new LexerGrammar(
			"lexer grammar L;\n" +
			"ACTION : '{' (ACTION|.)* '}' ;\n");
		String expecting = "ACTION, EOF";
		checkMatches(g, "{hi}", expecting);
		checkMatches(g, "{{hi}}", expecting);
		checkMatches(g, "{{x}{y}}", expecting);
		checkMatches(g, "{{{{{{x}}}}}}", expecting);
	}

Note how simple the bytecodes are for the grammar:

ACTION : '{' (ACTION | .)* '}' ;

gives:

0000:	split         5
0005:	match8        '{'      // Start of ACTION
0007:	split         14, 31
0014:	split         21, 27
0021:	call          5 // call ACTION
0024:	jmp           28
0027:	wildcard        
0028:	jmp           7
0031:	match8        '}'
0033:	accept        4

v4 does what you'd expect now: longest match with priority given to earlier rules upon match of same length.

It also handles case where it must remember all possible matches and rewind if it fails further on. This was highlighted in

http://www.antlr.org/jira/browse/ANTLR-189

Now it works automatically:

	@Test public void testRewindBackToLastGoodMatch_DOT_vs_NUM() throws Exception {
		LexerGrammar g = new LexerGrammar(
			"lexer grammar L;\n" +
			"NUM: '0'..'9'+ ('.' '0'..'9'+)? ;\n"+
			"DOT : '.' ;\n"+
			"WS : ' ' ;\n");
		checkMatches(g, "3.14 .", "NUM, WS, DOT, EOF");
		checkMatches(g, "9", "NUM, EOF");
		checkMatches(g, ".1", "DOT, NUM, EOF");
		checkMatches(g, "1.", "NUM, DOT, EOF");
	}

Here, "1." starts NUM and enters ('.' '0'..'9'+)? subrule due to '.' after '1'. Ooops, no digit after '.'. Rewind to spot where we looked like an integer: "1" then next match sees '.'. cool.

Java impl of this more complicated VM is still only about 1200 bytes (in java bytecodes). Can use lots more memory at runtime than "no rule invocation" version as well.

Notes from May 26, 2010

Just passing along an example HTML subset lexer/parser using ANTLR v4; thanks to debugging and moral support from Oliver Zeigermann, we got the code generation and runtime support working sufficiently to use the following grammars. generate some really nice code indeed. You will note that, except for the enhancement of the lexer modes, the grammars are backward compatible with v3

I still have a long way to go, but it's looking more & more useful (only does LL(1) code generation at this point).

lexer grammar HTMLLexer;

TAG_START : '<' {pushMode(INSIDE);} ;

COMMENT : '<!--' .* '-->' {skip();} ;

TEXT : ~'<'+ ;

mode INSIDE;

TAG_STOP : '>' {popMode();} ;

END_TAG : '/' ID '>' {popMode();} ;

ID : ('A'..'Z'|'a'..'z'|'0'..'9'|'_'|'#')+ ;

EQ : '=' ;

STRING : '"' .* '"'
      ;

WS : ' '+ {skip();} ;

parser grammar HTMLParser;

options { tokenVocab=HTMLLexer; }

file : ( TAG_START (starttag | endtag) | TEXT)+ EOF ;

starttag : ID attr* TAG_STOP ;

attr : ID (EQ (ID|STRING))? ;

endtag
	:	 END_TAG
	;

context-sensitive lexers

Scott Stanchfield's thoughts

Terence Parr