ANTLRWorks2 design thoughts

ramblings about design as stream of consciousness

I built a quick mockup with NetBeans just so that I have all of the Windows in front of me with a couple of faked images. The basic design is easy because NetBeans allows us to move windows around as we want. I happen to lay the Windows out like this:

(Sam already has the navigator and the editor Windows filled in, but I was too lazy to incorporate.)

Every window has content, publishes events, and listens for events. For example the navigator window, listens for publication events from the editor window saying "heh I have a grammar". It then displays the list of lexer and parser rules. It, in turn, publishes selection events that the other Windows can listen in on. The editor window moves the cursor position to the selected rule. The properties window might say something about that rule. The syntax diagram window shows the syntax diagram for the selected rule.

Keeping Windows in sync

Lots of the Windows published selection events: navigator window, editor window, syntax diagram window, AST window, and parse tree window. They also listen for the selection events so that all of those Windows can stay in sync. If you select a rule in the navigator, perhaps it selects all rule references in the parse tree.

There's a difference between selecting a rule in the navigator and selecting a specific rule or token reference in a tree. The tree views are representations of the tokens in the input window and how they are parsed. So the trees and the input windows need to be kept in sync and they also published selection events to highlight the appropriate elements in the grammar.

The only selection event published by the editor window is the current rule. I guess that means that as you move the cursor through the editor, the rule in the navigator should also be selected.

Debugger

There is no "debugger" per se. We simply run or interpret the grammar and get a parse tree and, if running with output=AST, we also get an AST. Once we have recorded the entire parse, we may want to walk back through it. Technically, there's no reason to actually step through anything because you can see the entire parse. That said, I think it would be useful to be able to single step by input token or even position within the syntax diagram. This information would be available via the parse log generated during interpretation or parser execution. I have no idea what that log will look like but it will clearly need to include:

  • grammar (as a reference to the filename or the actual file contents)
  • the input
  • the parse tree. this would have parameters and local variable values if it's being run not interpreted
  • the AST if this thing is being run not interpreted

By selecting elements in the AST or parse tree, the properties window can show information about that node. For example, the parse tree node property sheet which show whatever labels, parameters, etc... were computed. We could even show via reflection the elements of the AST nodes very much like a full Java debugger. At some point, we might actually use the debugger components of the NetBeans platform to make this more "live"

I do not intend to expose the underlying augmented transition network (ATN) to the user. Any kind of stepping should be done by walking the syntax diagram (which is really the dual of the ATN anyway).

Internal models

ANTLR v4 parses grammars into GrammarAST heterogeneous ASTs and then constructs ATNs from those ASTs and finally generates code from the ASTs. I can't get away from building some kind of internal data structure and we already have the GrammarAST so let's keep it. I also need the ATN so that I can generate it into the output (and also use it for the interpreter). Can't get away without that either. Okay so we are stuck with both the tree and an ATN model of the input.

What model or models do we use for the navigator tree and the syntax diagram. The Nodes API needs nodes objects for the navigation window. These will be wrappers on some other internal model object I guess.

The syntax diagram prototype that Colin is working on uses its own model of the grammar rule such as Rule, Alternative, Element, token ref, rule ref, loop, etc... That is probably fine for experimenting, but even as we moved to a prototype it seems like we should come up with an internal model needed by ANTLRWorks that we can share.

One option is to have Colin walk the AST manually or with a tree grammar to compute size and position of the syntax diagram subcomponents. I could also augment the GrammarAST class hierarchy to beogrammer friendly and include a visitor dispatch mechanism. The current hierarchy is actually almost like a simple input model like what Colin is using now:

It looks like I would have to add AST nodes for some of the other rule elements such as * and + loops, but that is no big deal. Then, instead of simple getChildren() I could add specific methods such as AltAST.getRuleElements() etc... This would make it much easier to use by other people.

What do the navigator nodes wrap? The top level nodes are just lexer and parser so they can wrap just a string and pointer to the underlying v4 Grammar object. The second level nodes are the rule names. Perhaps they should wrap the Rule objects created by v4. These Rule objects have a pointer to the AST which knows where it came from in the text of the grammar. That would allow us to navigate, moving the cursor in the editor window. Actually, perhaps the navigator should simply wrap the (not yet implemented) RuleAST nodes.

Yep, as I write this, it seems like the right answer is to beef up the AST class hierarchy so that it's more programmer friendly. If I did this again, I would go the parse tree but that would require some pretty significant changes through the code generator and all the other systems. Maybe I should simply have ANTLR created a parse tree as well as an AST using some kind of option. That parse tree would make it harder to walk, however, at least manually.

Uniquely identifying window elements

We need a way to uniquely identify the rule elements such as a token references and rule references within a rule. We also need to map that unique identifier to a node in the syntax diagram. Nodes in the parse tree and AST must map back to these unique identifiers so that we can highlight elements in the grammar. Basically we need to publish "location 46" and have at least three Windows know which element to highlight: the parse tree, the AST, and the syntax diagram. The navigator should also highlight the rule that contains location 46, whatever that element is.

We can't use the ATN state number because AW has no reason to create the ATN unless it's planning on doing interpretation of the grammar. It would probably create this on-demand. Probably would be better to use something like the unique identifier mechanism I had in ANTLR v3:

public class GrammarAST extends CommonTree {
	static int count = 0;
	public int ID = ++count; // unique identifier per AST node
	...
}

As the syntax diagram created nodes from AST nodes, it would record a reference back to the AST from which it was derived. Selecting this syntax diagram node later, would publish the unique identifier of the grammar AST node.

The interpreter feeds off of the ATN, whose states all know from which GrammarAST node they were derived (or could be made to track). As it created parse tree elements, it could track the appropriate unique identifier pulled from the GrammarAST nodes. Uh, wait. The ATN state Transition objects are the things that correspond to grammar elements. As it transitions, the interp would record the GrammarAST identifier.

If we actually execute a parser, we have a problem. At parse time, we always know where we are in the ATN, but we have no idea how that corresponds to elements in the original grammar. I guess I could create a mapping as part of the ATN I generate in the parser file. The mapping would convert ATN transitions to the unique identifiers in the GrammarAST nodes. They are like pointers that would transcend Java VM executions. As long as we have the same grammar, the GrammarAST unique identifiers would be the same. That assumes of course that we have a deterministic process for creating the unique identifiers, which we would by using the above counter.

Such a map would be extremely useful for syntax error reporting during development. Given a state within the ATNState, the parser could immediately identify where exactly in the grammar the input fail to match. We could even pop up ANTLRWorks and show you exactly where in the grammar the parse was.

As part of a parse log, we could keep the exact list of GrammarAST unique identifiers that we pass through. At the moment, I track the ATN state number, but instead I think I should track the transition objects' GrammarAST unique identifiers. This way, ANTLRWorks could single step through the syntax diagrams and grammar to show you the trace.

We also want to be able to click on an input symbol and have ANTLRWorks highlight where the grammar and syntax diagram that token was matched. I guess we need a list of GrammarAST unique identifiers that correspond to the input symbols. Four 100 input symbols, we would have a list of 100 unique identifiers. to single step through the input would then be easy. So, we would have both single step by input as well as single step through the grammar as a result of a parse. tasty.

Action items

I guess I just need to do this:

  • Beef up GrammarAST, adding more types, getters
  • Add unique ID to GrammarAST
  • Add GrammarAST ptr in Transition objects
  • Generate information in the generated code that maps ATN transitions to GrammarAST unique identifiers.
  • Alter the trace ATN mechanism so that it tracks ATN transitions not the states
  • Create a mechanism to record the unique identifier for each input symbol matched during a parse.