These are some of my favorite things--PDA, RTN, ATN, DFA, NFA

At long last, I'm back on the ANTLR v4 rebuild after 9 months hiatus to write an academic LL(*) paper with Kathleen Fisher and release StringTemplate v4. Woot!

Ok, so what does all that title nonsense have to do with ANTLR v4? Well, v4 will use all those things at some point, either in analysis or in the generated code. I'm proposing something a little different for v4: Along with a recursive-descent parser, I'm going to serialize the grammar down to a simple PDA (push-down automaton) represented by the bytecodes of an interpreter (Derived partially from Cox' description of Thompson's 1960s work). The interpreter will serve many useful purposes:

a pure interpreted version of a lexer/parser/tree parser; related work: A Text Pattern-Matching Tool based on Parsing Expression Grammars, but I will need to extend for semantic predicates and actions
the backtracking mechanism for parsers and tree parsers
the mechanism for handling lexical constructs that a DFA cannot handle.
improve error recovery
support "code completion / intellisense" type functionality in development environments

From the bytecodes, we can build a data structure in memory that will be more convenient for a lot of purposes. The data structure will effectively be a recursive transition network (RTN) or, if there are predicates and actions, an augmented transition network (ATN).

Pure-interpreted parsing

Sometimes you need to parse some data but going through the whole ANTLR code generation and then compilation process is overkill. It would be nice to have a good interpreter that could parse input described by a grammar without having to deal with code generation. We can short-circuit the ANTLR tool at the point it's created an internal ATN but before code generation. Then, we can simply run an interpreter over the ATN.

If you don't mind a code generation phase but don't want a huge output file typical of recursive descent parsers, we could also generate the bytecode for a PDA into a Java file. The bytecodes are like a serialized version of the ATN. Even if you don't plan on interpreting the grammar, I think I will generate the bytecodes anyway for the recovery stuff I describe below. The bytecodes even for large grammar shouldn't be more than a few k or 10s of k bytes. Hmm... this does bring up the whole issue again in Java of generating static arrays (see Implementing parsers and state machines in Java).

Interpreted backtracking

Unlike most backtracking parsers, ANTLR v3 supports arbitrary actions, even those that cannot be undone. It does that by turning off the actions while backtracking. Upon success, it rewinds the input and then re-parses the input, this time with feeling to execute the actions. To pull this off, I need to generate an "if backtracking" conditional around every action, which causes a problem if you've defined a local variable in that action. All of a sudden that variable is invisible to the other actions. Also, at least the way I generate syntactic predicates in v3, backtracking results in code bloat.

If I have a separate mechanism to do backtracking, I don't need to guard all of the actions with a conditional and there's no code bloat. Syntactic predicates will launch an interpreter (as bytecodes or walking an ATN; doesn't matter) to see if a particular grammar fragment matches. Even if this turns out to be much lower than using generated recursive descent code fragments, ANTLR works very hard to remove unnecessary syntactic predicates. I doubt even a slow implementation would affect the overall parsing time significantly.

Handling context-free fragments in the lexer

ANTLR v3 lexers annoy me and probably you too. v4 lexers will be DFA-based and, theoretically, should be much faster as a result. Further, they will also behave much more as people expect in some edge cases. (I'm adding modes since they are really useful for parsing HTML and other embedded languages). The problem is I really want to support non-greedy operators, which are really hard to implement in a conventional DFA but trivial in an NFA/ATN. I also want to allow arbitrary actions interspersed among the rule elements. Oh, and recursion. What this comes down to is using the DFA as an optimization if you don't do anything wacky. If you come up with a rule that's recursive or has actions and non-right edges or non-greedy operators, ANTLR generates PDA bytecodes for those rules and, importantly, any rules that look like they could match the same thing. Upon entry to nextToken(), the lexer uses the current character to determine whether to jump to the PDA or the DFA. Oh, I almost forgot--Unicode characters can cause trouble in terms of table size for pure DFA. Range 'a'..'\u8000' for example, makes a transition table 0x8000 elements big for each state that could reference it. It makes more sense to deal with Unicode with a separate mechanism. I think JavaCC does this too.

Better error recovery

If we have a mapping from every location in the grammar to a position in an ATN, we should be able to do a pretty good job of error recovery. For example, we can supply error alternatives without bloating the generated code at all. Here is a contrived example with two regular alternatives and two error alternatives:

stat: 'return' expr ';'
    | ID '(' args ')' ';'
    / 'return' ';' // missing expr
    / ';' // missing statement
    ;

If there's a syntax error, we can suspend the parser and launch an interpreter that tries to match one of the error alternatives. If one succeeds, an action in that alternative will properly report an error and resynchronize. If nothing matches, we can do the standard recovery boogie.

The standard error recovery is to compute the set of possible following symbols, using the parsing stack as context to provide the exact follow rather than FOLLOW(the-rule). If we have the entire grammar represented in some way, ANTLR does not have to compute and then generate code to store bits sets for error recovery. I can just walk the ATN at runtime. It also doesn't have to pass these bits it's around as it does in v3. For rule reference r, currently I need to generate:

pushFollow(FOLLOW_r_set_for_this_ref_position);
r();
state._fsp--;

Likely I will need to keep some kind of pointer and stack of pointers into the ATN representation of the grammar in order to pull off this error recovery.

Code completion

Beyond error recovery, we can provide good information to the user about what the parser could've accepted at any particular error position.