ANTLR v3 issues by Sam Harwell

Some things to put on a list for later. (smile) Let me know if either A) you
can't distinguish the nested bullets to associate them with the
top-level ones or B) the hyperlinks did not come through (there are a
few of them - either all or none should appear).

=20

  • The main Tool class needs to be usable from another class. I
    want to use it for background parsing in the IDE so I can provide exact
    error messages with line numbers, etc. This includes:
    • Removing all static mutable state so two independent instances of
      the Tool are guaranteed to be mutually thread safe.
    • Having an option to redirect output (something other than a file on
      disk). No need to be making lots of temporaries when used for this.
    • I need the ability to get error messages associated with a
      particular instance of the tool, including the location in the file.
    • Error messages regarding references in actions need to include
      line/column numbers (they currently report either 0, 0 or the { brace
      that opens the action I believe).
  • Functions should not be generated for unreferenced empty
    fragment rules. This could be extended to functions not being generated
    for all empty non-public rules.
  • The Java version should be updated to implement rule
    visibility (it's in the grammar but not hooked up). You may or may not
    want to make it work like the C# version currently does, but I believe
    it's a useful feature in the end.
  • The Java version should be updated to handle "sets" with
    additional flexibility. Currently its analysis breaks down for certain
    simple things like "rule : (set1) | (set2);" (and variations). I found
    this when converting the grammars to v3, so C# version currently handles
    this. A side benefit here is the performance impact of using fragment
    rules in lexers for things like DIGIT or LETTER is reduced (but
    unfortunately not eliminated).
    • Lightweight ("short") fragment rules in the lexer could be inlined
      when both the caller and callee have no predicates or actions.
  • The Java version could be updated to remove redundant blocks
    of code. This refers to patters like "if (...) alt =3D 3;" followed by =
    "if
    (!(...)) throw ...;" in the alt itself. I've observed a significant
    performance impact from the redundancy in the profiler for an optimized
    grammar (almost no predicates and short lookahead), where fixing the
    issue in the C# version of the Tool resolved it. I can get you the P4 CL
    numbers for the fixes associated with this in the C# port - there are
    many cases to consider including cross-rule checks in the lexer. A side
    benefit is smaller generated code.
  • The lexer should automatically implement keyword style rules
    (string literals with no actions) using an internal hash table. This
    would be behind the scenes - no requirement to place an action in the
    IDENTIFIER rule to perform the lookup. This will have a significant
    performance boost (shorter lookahead for identifiers in the lexer) and
    fix many of the "generated code was too large to compile" problems.

=20

The following are more opinionated:

=20

  • Auto-AST syntax in rules is much more efficient than rewrites.
    When the only change that needs to be made is changing the type of the
    root, it's unfortunate that "rule : ID^ NUMBER;" has to become "rule :
    ID NUMBER -> ^(SOMETHINGANTLR v3 issues by Sam Harwell NUMBER)". Not sure what we can do about
    this - maybe hijacking the hetero-AST syntax since you know I'm pushing
    for its removal. (wink) Another option is detecting cases where Auto-AST
    would build a correct tree shape and generating code of that form, then
    embed the appropriate tree adapter calls in its generated code. The
    latter is slightly more complicated, but much more powerful and results
    in a much wider class of rewrite rule optimizations without placing any
    requirements on the developer to make use of it.
  • Add @enter{} and @leave{} blocks for states. This is much more
    reliable than initializing them in the rule @init{} blocks.
  • Since you still don't believe me that enclosing rule
    references are not ambiguous, add a $$ operator for referencing the
    current AST of the enclosing rule.
  • Set the TokenStartIndex and TokenStopIndex for iteratively
    generated ASTs, as are often found in rules for expressions. It's
    obvious when this doesn't work because the AST Explorer
    <http://www.antlr.org/pipermail/antlr-interest/attachments/20090423/9770
    5305/attachment.png> breaks. To resolve the original question about how
    it should work, the AST Explorer demonstrates the expected use case for
    those properties.
  • We should gather a group of production grammars (I'm fine with
    NDA(s) for this) and examine all the target-specific content. If many of
    them show common high-level functions, it begs us to ask if ANTLR should
    provide general handling for it. In particular, predicates (especially
    semantic predicates) incur significant runtime overhead, so we should
    look at how and where they are being used and see if we can improve the
    situation without too much trouble. I say production grammars because
    they are the best real-world use cases we can get.