ANTLR v3 issues by Sam Harwell

Some things to put on a list for later. Let me know if either A) you
can't distinguish the nested bullets to associate them with the
top-level ones or B) the hyperlinks did not come through (there are a
few of them - either all or none should appear).

=20

The main Tool class needs to be usable from another class. I
want to use it for background parsing in the IDE so I can provide exact
error messages with line numbers, etc. This includes:

- Removing all static mutable state so two independent instances of
  the Tool are guaranteed to be mutually thread safe.

- Having an option to redirect output (something other than a file on
  disk). No need to be making lots of temporaries when used for this.

- I need the ability to get error messages associated with a
  particular instance of the tool, including the location in the file.

- Error messages regarding references in actions need to include
  line/column numbers (they currently report either 0, 0 or the { brace
  that opens the action I believe).

Functions should not be generated for unreferenced empty
fragment rules. This could be extended to functions not being generated
for all empty non-public rules.

The Java version should be updated to implement rule
visibility (it's in the grammar but not hooked up). You may or may not
want to make it work like the C# version currently does, but I believe
it's a useful feature in the end.

The Java version should be updated to handle "sets" with
additional flexibility. Currently its analysis breaks down for certain
simple things like "rule : (set1) | (set2);" (and variations). I found
this when converting the grammars to v3, so C# version currently handles
this. A side benefit here is the performance impact of using fragment
rules in lexers for things like DIGIT or LETTER is reduced (but
unfortunately not eliminated).

- Lightweight ("short") fragment rules in the lexer could be inlined
  when both the caller and callee have no predicates or actions.

The Java version could be updated to remove redundant blocks
of code. This refers to patters like "if (...) alt =3D 3;" followed by =
"if
(!(...)) throw ...;" in the alt itself. I've observed a significant
performance impact from the redundancy in the profiler for an optimized
grammar (almost no predicates and short lookahead), where fixing the
issue in the C# version of the Tool resolved it. I can get you the P4 CL
numbers for the fixes associated with this in the C# port - there are
many cases to consider including cross-rule checks in the lexer. A side
benefit is smaller generated code.

The lexer should automatically implement keyword style rules
(string literals with no actions) using an internal hash table. This
would be behind the scenes - no requirement to place an action in the
IDENTIFIER rule to perform the lookup. This will have a significant
performance boost (shorter lookahead for identifiers in the lexer) and
fix many of the "generated code was too large to compile" problems.

=20

The following are more opinionated:

=20

Auto-AST syntax in rules is much more efficient than rewrites.
When the only change that needs to be made is changing the type of the
root, it's unfortunate that "rule : ID^ NUMBER;" has to become "rule :
ID NUMBER -> ^(SOMETHINGANTLR v3 issues by Sam Harwell NUMBER)". Not sure what we can do about
this - maybe hijacking the hetero-AST syntax since you know I'm pushing
for its removal. Another option is detecting cases where Auto-AST
would build a correct tree shape and generating code of that form, then
embed the appropriate tree adapter calls in its generated code. The
latter is slightly more complicated, but much more powerful and results
in a much wider class of rewrite rule optimizations without placing any
requirements on the developer to make use of it.

Add @enter{} and @leave{} blocks for states. This is much more
reliable than initializing them in the rule @init{} blocks.

Since you still don't believe me that enclosing rule
references are not ambiguous, add a $$ operator for referencing the
current AST of the enclosing rule.

Set the TokenStartIndex and TokenStopIndex for iteratively
generated ASTs, as are often found in rules for expressions. It's
obvious when this doesn't work because the AST Explorer
<http://www.antlr.org/pipermail/antlr-interest/attachments/20090423/9770
5305/attachment.png> breaks. To resolve the original question about how
it should work, the AST Explorer demonstrates the expected use case for
those properties.

It'd be nice to have a generalized attribute syntax
(annotations for you Java users) for rules and the grammar declaration
to replace current @options{} blocks. Example syntax from Java
<http://java.sun.com/docs/books/tutorial/java/javaOO/annotations.html> ,
C#, Visual Basic, and C++/CLI
<http://msdn.microsoft.com/en-us/library/bfz783fz.aspx> (three on one
page), and F#
<http://msdn.microsoft.com/en-us/library/dd233179(VS.100).aspx> (also
simplified documentation
<http://lorgonblog.spaces.live.com/blog/cns!701679AD17B6D310!161.entry>
for the F# syntax) are available. The current @options{} blocks would be
deprecated but remain in place until v4. Several built-in attributes
would be available, such as Target("CSharp3") for a grammar, where all
others are made available to the target for processing. I prefer the
explicit bracketing used in the .NET languages to Java's open syntax -
of those the C# one is widely known but the F# one is an option if the
plain square brackets cause ambiguities in the grammar. The F# syntax
just uses < and > as the delimiters. Unlike the languages above, we
wouldn't need to declare attribute classes - we're just borrowing the
syntax because it's clean, extensible without parsing difficulties, and
everyone knows it.

We should gather a group of production grammars (I'm fine with
NDA(s) for this) and examine all the target-specific content. If many of
them show common high-level functions, it begs us to ask if ANTLR should
provide general handling for it. In particular, predicates (especially
semantic predicates) incur significant runtime overhead, so we should
look at how and where they are being used and see if we can improve the
situation without too much trouble. I say production grammars because
they are the best real-world use cases we can get.