Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

ANTLR 1.0 (PCCTS) used a traditional DFA-based lexer. As I moved to ANTLR v2, I realized that we could tokenize input using the same recursive descent mechanism used for parsers. Unfortunately, with only fixed lookahead, ANTLR v2 lexers were sort of difficult to build because he had to do a lot of your own left factoring. ANTLR v3 introduced arbitrary regular look ahead with LL(*), which makes the lexer rules much easier to write. The lexers are really powerful because you can do recursive constructs like nested comments and things like that. You can have semantic predicates and arbitrary actions. Unfortunately, it still seems more difficult to build tokenizer with ANTLR than with a traditional lex-like approach. Using LL(*) to compute DFA from lexer rules also is pretty expensive. Finally, generating a lot of DFA state machines in Java is a pain due to the lack of static arrays in Java class files.

For ANTLR v4, I'm contemplating a hybrid mechanism that combines the best of state machines with the best of ANTLR's flexibility. The basic idea is to create my internal NFA-like recursive transition machine to represent all of the lexer rules. Then, generate a fixed representation of it during code generation. At runtime, the lexer would backtrack across the state machine very much like the ANTLR grammar interpreter that ANTLRWorks uses (although there are some bugs in at the moment). NFA are notoriously slower to interpret than DFA, but after looking at a number of lexer grammars, I concluded that a little bit of fixed lookahead ala LL(k) for k=2 would make them very fast.

The beauty of a state machine over generated code full of IF/WHILE statements, is that we can jump around anywhere we want inside the state machine during execution. If we get halfway through a token and realize it's not going to match, we can trivially rewind the input and jump to the appropriate state for the next most likely token. It also allows us to do "longest match" type stuff easily because we can just keep track of all of the tokens we match until we stop matching characters; then just pick the token for the longest sequence.

Now, what about actions? How are we going to handle larger actions inside? Well, that's easy enough to do by generating methods, one per action. If there are actions within a token, we rewind the input and match the token again once we are sure it will match. And this time we do it with feeling, executing the actions. This is precisely the mechanism of syntactic predicates, which is nice symmetry. What about local variables and parameters? If we shove actions into their own method, how can they see the local variables defined in other actions? Well, as part of another feature (to avoid compilation errors during semantic predicate hoisting), I'm going to store all parameters in my own software stack rather than relying on the method call stack of the underlying target language. As for local variables, I'm going to formalize them so that I know you are defining a local variable; I will store that also in the software stack so that they are available to all of the actions in a rule.

What about recursive token references such as the following rule:

Code Block

B : '[' (B|.)+ ']' ;

That's no problem. we simply have a transition in the state machine that push the current state and jumps back to the start of B.

I believe this mechanism would provide a more satisfying experience for the lexer grammar developer. It might actually be faster than the current recursive descent lexers in v3. It's easier to generate and faster to analyze. It should also take much less time to initialize at lexer runtime because the state machine will be smaller. Currently, v3 lexers are huge. For example, the Java lexer is 4747 lines! It's just not that hard a problem to deserve that much code. (wink) Oh, it should also make it much easier to build incremental lexers, if we decide to implement those.

Also, another issue as Gavin Lambert says:

While it's usually possible to frame the contents of a looping construct in a positive sense (which is what ANTLR currently requires), it'd be nice if there were a language construct that would let you do specific negative matches too (maybe a feature request for v4?).

For (a completely made up) example, consider a case where you might want to match any identifier except one starting with "foo". In current ANTLR, you'd have to do one of these:

Code Block

 FOOLIST: 'foo[' NON_FOO_ID+ ']';
 FOOLIST: 'foo[' ({!next_id_starts_with_foo()}? => ID)+ ']';
 FOOLIST: 'foo[' ((~'f' | 'f' ~'o' | 'fo' ~'o') => ID)+ ']';

It'd be nice if there was some way to express a negative match via a syntactic predicate, eg:

Code Block

 FOOLIST: 'foo[' (('foo') => ~ | ID)+ ']';

(where '~' in an alt basically means "break", ie. match nothing and terminate the innermost loop.)
Or, perhaps better:

Code Block

 FOOLIST: 'foo[' (('foo') ~=> ID)+ ']';

(where '~=>' means "only take this path if the predicate fails")

...

I've moved this content to the v4 ANTLR pages.