Lessons and thoughts

Code reuse

As I rebuild ANTLR, it's becoming clear that the enemies of reuse are (at least):

  • Type leakage
  • data structure entanglement
  • side effects

Perfect example: I passed a Grammar to fill into the grammar tree walker; it set options and such. side effects and it had to know about Grammar (leakage). Another tool couldn't just get options w/o loading Grammar.class. Grammar / walkers got all tangled up too.

data structure entanglement

I'm noticing a lot of cases in the old code where I pack everything related to DFAs or DFA states into those objects. Not only does it make those objects really big in code size but it makes it hard to determine when certain fields are computed and so on. I also tended to update things on the fly such as the DecisionProbe as I computed DFAs. I think it's better to avoid side effects as much as possible and also altering another data structure.

Here is a part of my grammar analysis pipeline:

NFAToApproxDFAConverter approxConv = new NFAToApproxDFAConverter(g, s);
DFA dfa = approxConv.createDFA();
DFAVerifier v = new DFAVerifier(dfa, approxConv);
v.verify();
if ( v.noProblems() ) return dfa;

I've moved all of the temporary data structures out of DFA that were related to creating the DFA and will try to keep all of the verification stuff in this other object. I'll try having a compute everything after the fact instead of updating temporary data structures as I create the DFA. Everything just seems much cleaner when I don't have a million fields that are updated by lots of different pieces of code. It's hard to determine exact sequence of events and what the dependencies are, particularly if you need to use this code in another tool such as a development environment.

Grrr...as i create DFA, I know exactly what's wrong with a state. A pity to throw away and then walk entire DFA later to relearn that info. I guess I'll report issues to an object after all.

state machine architecture

One the things that I'm noticing about the previous implementation was that I had, more or less, a single implementation of NFA state, DFA state, and transition. I was not using a very object-oriented approach. Instead, I had "state type" and "transition type" kind of variables to indicate how to process the state. I think I was afraid of spreading graph algorithms across 20 different class files. instead, I wanted the algorithm in one spot. Unfortunately, the code looks like C code: lots of "if node n is of type x" type testing, the hallmark of non-object-oriented programming. This time I'm creating a whole set of node types and transition types, hoping that each bit of code will be a little bit simpler even if it's spread around.

grammar analysis

The enemy of beautiful code is a bunch of special cases. That's precisely what the LL(*) algorithm looks like in the previous version. lots of error checking, lots of testing what kind of state I'm looking at, lots of testing what kind of transition I'm looking at, and also the lexer analysis. Lexers are kind of a special case in of themselves. For v4 I'm hoping to do some kind of a hybrid mechanism. Either way, I should keep the lexer analysis separate. I'm hoping to get the LL(*) algorithm back to something that is very simple as it was when I first started experimenting.

Separating error handling from analysis and computation

Oftentimes, when building a method, I get all of the parameter error checking and other error-checking out of the way first so that the rest of the method can be clean. so each method looks like

void foo(parameters) {
    check parameters and any other error-checking
    do computation
}

When I look through all of the code in v3, I notice that there is error checking code strewn throughout. This makes the code very messy to read because it obscures the computation with error handling. I'm now taking the approach of trying to get as much error handling out of the way as possible right away. hopefully this will leave the rest of the tool much easier to manage and write because it just has to compute without worrying about errors. The only exception might be the grammar and also section; but those are warnings about ambiguities not errors.

My semantic checking plan is to create tree pattern matching grammars that trigger checks within an associated object. XXXTriggers.g pairs with XXXChecks.java. Here are some of my initial triggers:

rule:   ^( RULE r=ID .*) {BasicSemanticChecks.checkInvalidRuleDef(gtype, $r.token);}
    ;

ruleref
    :	RULE_REF {BasicSemanticChecks.checkInvalidRuleRef(gtype, $RULE_REF.token);}
    ;

tokenAlias
    :   {inContext("TOKENS")}? ^(ASSIGN TOKEN_REF STRING_LITERAL)
        {BasicSemanticChecks.checkTokenAlias(gtype, $TOKEN_REF.token);}
    ;

Then, the associated checking class has a nice set of "rules" like this:

    protected static void checkTokenAlias(int gtype, Token tokenID) {
        String fileName = tokenID.getInputStream().getSourceName();
        if ( gtype==ANTLRParser.LEXER_GRAMMAR ) {
            ErrorManager.grammarError(ErrorType.CANNOT_ALIAS_TOKENS_IN_LEXER,
                                      fileName,
                                      tokenID,
                                      tokenID.getText());
        }
    }