Parsing XML

Parsing XML

This is a tutorial about parsing XML using ANTLR 3. If you do not know what ANTLR 3 is, have a look at this introduction. XML has been chosen as an example as it should be known to wide range of developers. Additionally, it isn't all that easy to parse, but then also not too difficult. 

Caution: This is an example given for educational reasons. In the magnitude of all cases, you'll be off far better by using an XML parser such as Xerces, expat etc. or an XML binding tool such as XML beans. If you are not sure whether you should write a parser of your own, have a look at section 5.

Parsing XML using traditional parsing techniques is tricky, most of all because XML is a sick language. Depending on the context different characters are either keywords or simple text. Given an XML fragment like:

<tag attribute="value">

The '=' has a special meaning as well as the '>' while this is of course not the case in plain text. Additionally, newlines and other white spaces inside tags have to be ignored, but are part of the text outside of them.

For further investigations we restrict ourselves to a simple subset of XML. It will have the same complexity as XML seen from the perspective of parsing, but will make the examples much simpler. It is restricted to tags and parsable text (PCDATA) - no XML declaration, not doctype, no DTD or XSD, no processing instructions, no CDATA sections, no namespaces. We require that the reader has at least basic knowledge of XML. For a short introduction, you should have a look at this XML in 10 points article. And here is something more complete: W3C XML Recommendation.

 Note: Full sources are available as an attachment to this page.

Overview 

This tutorial is devided into multiple parts. Each part explains about the tasks of one of ANTLR'3 parser types. There is an optional excursion that tries to shed some light onto how ANTLR 3's parsing strategy works. If you are not interested in theory you can safely omit it.

Parser Type Overview 

Type

Task / Function

Input Structure

Output Structure

Lexer

  • condense multiple characters into tokens
  • check input for lexcial correctness
  • strip off unwanted white space

Character Stream (one dimension)

Token Stream (one dimension)

Parser

  • mimic inherent token structure in rule structure
  • check input for syntactical (structural) correctness
  • optionally execute actions
  • optionally generate an AST (stripping off unwanted tokens / adding condensing tokens)

Token Stream (one dimension)

optional AST (two dimensions)

Tree Parser

  • traverse an AST
  • optionally execute actions
  • optionally generate an AST (stripping off unwanted tokens / adding condensing tokens)

AST (two dimensions)

optional AST (two dimensions)

Sections