3. LL(*) Parsing (Excursion)

Excursion: LL(star) parsing

This short excursion into ANTLR's parsing strategy is optional. If you are not interested you can safely continue with the next section and still understand the rest of the tutorial. However, if you are interested, you can learn a little bit about parsing theory and the special strategy ANTLR 3 uses as opposed to ANTLR 2.x.

Look at these two rules for the start tag and the empty element. Both rules start with the same tokens, only the last token is different:

startTag  : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;

emptyElement : TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ;

When you tried to pass this grammar through earlier versions of ANTLR you would run into serious problems. The reason is that all parsers need to decide which rule they have to use to match an incoming token stream. Some parsing strategies solve this by delaying this decision until they have seen enough tokens. For this purpose tokens taken from the stream are often collected on a stack until there is enough information available to make the decision. In our example this would be until either the TAG_CLOSE or TAG_EMPTY_CLOSE is detected. Those strategies are called bottom up. Tools that use this strategy include famous yacc. Their drawback is a fair amount of complexity when it comes to debugging.

ANTLR 2.x can not do this as it uses a top down parsing strategy. Top down parsing strategies have to decide which rule to use before taking any tokens from the token stream. What ANTLR 2.x can do instead is to look ahead in the token stream to make its decision. However, no matter how far you look ahead there always is the horizon beyond which you can not see. Because of this top down parsers are thus usually more restrictive concerning the grammars they can handle. This is also true for ANTLR 2.x as the amount of look ahead is arbitrary, but fixed. Because the number of attributes in a tag is not fixed, ANTLR 2.x can not handle this grammar.

Now, this may sound a little bit complicated. But an example will help: This is a start tag

<tagname  attribute1="value1">

while this is an empty element

<tagname  attribute1="value1"/>

How many tokens would we need to look ahead to tell if this is a start tag or an empty element? Or, in other words, until we either see the > or the />? Let's count. Remember that whitespace has already been stripped by the lexer and this is how the attribute rule looks like:

attribute  : GENERIC_ID ATTR_EQ ATTR_VALUE ;

Tokens:

  1. < as text of token TAG_START_OPEN
  2. tagname as text of token GENERIC_ID
  3. attribute1 as text of token GENERIC_ID
  4. = as text of token ATTR_EQ
  5. "value1" as text of token ATTR_VALUE
  6. > as text of token TAG_CLOSE or /> as token TAG_EMPTY_CLOSE

OK, so we have six tokens as a required look ahead to decide which alternative to take. Setting the fixed look ahead (ANTLR and many others call it k) to 6 thus seems to be sufficient. However, it really is not, as this only works for this specific example. In general, there can be any number attributes and for any given k. I can give a tag that has too many attributes to be correctly predicted.

The prediction mechanism of ANTLR 3 is more powerful, because it can look ahead a random number of tokens that is not fixed at the beginning of a parse. Maybe you can image that ANTLR 3 has a scout sent out when decisions become tricky. He would go ahead until he has the significant information while ANTLR 2.x only had a telescope with a certain range.

This way ANTLR 3 can handle this grammar and is more tolerant to the way a grammar has to look like. Consequently, it allows for more natural looking grammars that are more intuitive to the programmer. What struck me most about ANTLR 3's new parsing strategy was that it appears to mimic what humans do. As a human we can easily decide the kind of the tag by looking at its end. We scan forward until we either see > or />. Once we made our decision based upon this information we return back to the start of the tag and collect all information inside of it. And that is what ANTLR 3 does. Both the way humans behave as well as ANTLR's parsing strategy come at a certain price, though. Firstly - and only theoretically - the parsing time may no longer be linear to the input size. Secondly, the forward scan can only handle a certain amount of structural complexity. It fails to properly detect too involved structures. This again is true both for humans as for ANTLR 3. However, when we encounter such complexities the language to be parsed may not be all too sensible.

My siblings (including me):