Blog from December, 2012

In null vs missing vs empty vs nonexistent in ST v4 a few years ago, I tried to resolve in my head the difference between a missing attribute, a null value, an array with no elements, and a string with no characters. I don't think I got it completely thought through and ST v4 might have some weird inconsistencies. This page is an attempt to finally write down all the cases and resolve exactly how things should work.

I think the general rule should be: no complete <...> expression evaluation ever equals null.  A lone x can have value null, but the resulting <x> evaluates to the empty string. A missing entry in a list like [a,,b] is the only way to create a null value in ST and I think this might have been a mistake on my part to allow missing elements.

Secondly, an undefined attribute x or attribute property x.y is the same as null. Note that undefined y gives a warning to the listener.

Single-valued attribute values

This table shows the result of evaluating the indicated expression given the value of x:

exprx undefinedx nullx=""x=list len=0
<x>""""""""
<x:t()>""""""

""

<x; null="y">yy""""
<x:t(); null="y">yy""""
<if(x)>y<endif>""""y""
<if(x)>y<else>z<endif>zzyz

For that, I assume that x.y where x and/or y are undefined is also considered the same as plain x is undefined or null.

I noticed that people are trying to do filtering inside ST expressions, which is not my intention as I think it violates model view separation. All filtering and computation must be done in the model.  The template is only to display the list not to compute the list. This brings us to what lists of values do.

Multi-valued attribute values

We must distinguish between a list, array, or other Iteratable passed in as an attribute and what we create using the list [...] operator in ST expressions. Let's start with attributes passed in. x is an attribute  set to the value of the Java array construction specified by a surrounding Java application; a,b Java variables have string values "a", "b". t and u are templates that repeat their single argument.

exprx={}x={a}x={a,b}x={null}x={null,b}x={a,null}x={a,null,b}
<x>""aab""baab
<x; null="y">""aabyybayayb
<x; separator=",">""aa,b""baa,b
<x; null="y", separator=",">""aa,byy,ba,ya,y,b
<if(x)>y<endif>""yyyyyy
<x:{it | <it>}>""aab""baab
<x:{it | <it>}; null="y">""yabyybayayb
<x:{it | <i>.<it>}>""1.a1.a2.b""1.b1.a1.a2.b
<x:{it | <i>.<it>}; null="y">""1.a1.a2.byy2.b1.ay1.ay3.b
<x:{it | x<if(!it)>y<endif>}; null="z">""xxxzzxxzxzx
<x:t():u(); null={y}>""aabyybayayb

Notice that a null value never gets passed as the iterated value.  Subtemplates are not applied to empty values. The null option is evaluated for null elements.

Currently, null option values are not counted in the <i> iterated index variable. This would be a breaking change.

Do we need a way to say null inside of template expressions? One thought might be as the default value for a template parameter, but null is the default value for a named parameter.  Parameter x has no value unless we set it.

List construction in expressions

Ok, now let's look at [...] list construction within ST expressions. [...] cats lists together or creates a list from single-valued elements. [a,b] is the same as creating {a,b} in Java. If we have lists A,B then [A,B] is a single list with all the elements combined from A and B. I currently allow a missing element to mean the same thing as null so [a,,b] is the same as creating {a,null,b} in Java.

[ ] is a non-null list with no elements. <[]> is "", <[]:t()> is "". 

exprvaluenotes
<[ ]>""list is empty, yields empty string
<[ ]; null="x">""list is empty; no null element => no x
<[[ ], [ ]]:{it | <if(it)>x<endif>}; separator=",">""[[],[]] collapses to [ ]
<[ ]:t()>""nothing to apply; template not evaluated
<[ ]:{it | <if(it)>x<endif>}>""nothing to apply; template not evaluated

Any missing map entry evaluates to null.

For dictionary d ::= [x:"x"], <d.(x):{it | <it>}> is x and <d.(y):{it | <it>}> is "".

So, to summarize, I think we're ok as-is. I propose that we change null values to count in <i> index expressions if there is a null option in a future version like 4.1 but let's not change it on a point release like 4.0.7.

 

Tree rewriting in ANTLR v4

Because most ANTLR users don't build compilers, I decided to focus on the other applications for ANTLR v4: parsing and extracting information and then translations. For compilers, we need to convert everything into operations and operands--that means ASTs are easier. For example, 3+4 should become a tree with + at the root and 3 and 4 as operands/children: (+ 3 4). The parse tree in contrast is probably (expr 3 + 4) where rule reference expr is the root node and the other elements are children. The parse tree is indeed less useful for compiler work.

On the other hand, parse trees are usually what we want for data extraction and translation tasks. Moreover, the parse trees can be automatically generated and ANTLR can generate listener and visitor tree walker mechanisms automatically as well. Users of ANTLR v4 don't have to learn AST construction syntax or AST tree grammar syntax and semantics.

ANTLR itself is a very complicated translator. It reads in grammar files and generates multiple Java files as output. The translation is complicated by the fact that the output looks very different from the input and the order of the output can be very different from the order the input. There can be nonlocal transformations. For example, ANTLR can decide it needs a local variable or parameter deep within a rule.  Token reference x=T defines a label x that ANTLR would translate as something like:

_localctx.x = _input.LT(1);
match(T);

That code goes somewhere inside the rule function for the surrounding rule.  At the same time, ANTLR must inject a definition of x into the rule context object for the surrounding rule, which must be generated outside of the rule function. A single element, x=T, generates code in multiple locations. The order of computation of the output is different than the order of the input elements.

The point is that ANTLR does all this without gradually transforming a grammar AST into a Java AST. That would be extremely cumbersome from lots of personal experience. Instead, ANTLR walks the tree multiple times to extract information and also to annotate the nodes. Once all of the information it needs is available, it does another walk over the tree and builds up a model of the output. The output objects are built up as the information becomes available during the tree walk of the input tree. The result of model construction is a complete tree of output model objects, which must be then rendered to text via StringTemplate templates.   The name of the object model class, such as RuleFunction, corresponds to a template with the same name that knows how to render that model object to text. I built an automatic walker that walks the model and builds up the giant output template. The final step is to ask StringTemplate to render that template the text, effectively generating code. It works great!

That does not say there is no use for tree to tree transformations. ANTLR does so itself. I believe that the time to use such transformation is when the transformation results in a tree representing the same language, such as C to C or ANTLR to ANTLR. For example, it's extremely appropriate to normalize or optimize input by rewriting the tree. x+0 can be converted to just x, which means converting AST (+ x 0) to x or parse tree (expr x + 0) to (expr x). ANTLR transforms left recursive grammars to non-left recursive grammars using tree rewriting, but it does so manually at the moment. (boo)

ANTLR does not have tree transformation support at this time and it will not look like it did in v3 when it arrives. Many powerful rewriting systems let you perform rewrites in the concrete syntax of the language you are translating. E.g., see ASF+SDF by example. I imagine we'll do something very similar and allow a series of declarative rules like:

"<x:expr> + 0" -> "<x>"
"<x:expr> * 1" -> "<x>"
"<x:expr> * 0" -> "0"

Working with parse trees make such concrete transformation rules much easier to implement.

Sam Harwell and I are also talking about an xpath-like or other query mechanism to specify paths and nodes within parse trees. That way you could say: ``go get me all variable declarations'' and so on. There are lots of possibilities, given all of the research that has been done in this area.