Generating C with ANTLR 3

Author

Jim Idle - jimi|at|temporal-wave|dott|com www.linkedin.com/in/jimidle

Status

In sync with ANTLR 3 Development, however RewriteTokenStream is not yet implemented.

The C code generation templates and C runtime for ANTLR 3 are in-sync with ANTLR release 3.1

Finding Example Grammars

If you are looking for the example projects, visit the downloads section of the main ANTLR page. Half way down the page is the examples tar/zip. Download this and expand it. In there is a subdirectory 'C' with VS2005 projects (also easy to build manually on UNIX).

Background

The C runtime and therefore the code generated to utilize the runtime reflects the object model of the Java version of the runtime as closely as a language without class structures and inheritance can. Compromises have only been made where performance would be adversely affected such as minimizing the number of pointer to pointer to pointer to function type structures that could ensue through trying to model inheritance too exactly. Other changes include the use of token and string factories to minimize the number of calls to system functions such as calloc().

The generated code is free threading (subject to the systems calls used on any particular platform being likewise free threading) and while I am not exactly certain that this is at all useful, it seems silly to write C code that is not free threading these days unless there is some over arching reason to avoid it.

Model

As there is no such thing as an object reference in C, I chose to create a number of typedef structs that reflect the calling interface chosen by Terence in the Java version of the same. The initialization of a parser, lexer, input stream or internal structure therefore consists of allocating the memory required for an instance of the typedef struct that represents the interface, initializing any counters, pointers and buffers etc, then populating a number of pointers to functions that implement the equivalent of the methods in the Java class.

The use and initialization of the C versions of a parser is therefore similar to the examples given for Java, but with a bent towards C of course. You should also be aware of memory allocation and freeing operations in certain environments such as Windows, where you cannot allocate memory in one DLL and free it in another.

The runtime provides a number of structures and interfaces that I have found useful when writing action and processing code within java parsers, and furthermore were required by the C runtime code if I was not to depart too far from the logical layout of the Java model. These include the C equivalents of String, List, Hashtable, Vector and Trie, implemented by pointers to structures. These are freely available for your own programming needs.

A goal of the generated code was to minimize the tracking, allocation and freeing of memory for reasons of both performance and reliability. In essence any memory used by a lexer, parser or tree parser is automatically tracked and freed when the instance of it is released. There are therefore factory functions for tokens and so on such that they can be allocated in blocks and parceled out as they are required. They are all then freed in one go, minimizing the risk of memory leaks. This has only one side effect, being that if you wish to preserve some structure generated by the lexer, parser or tree parser, then you must make a copy of it before freeing those structures, and track it yourself after that. In practice, it is easy enough not to release the antlr generated components until you are finished with their results.

Target Platforms

I have constructed the C code such that it will compile on any reasonable ANSI C compiler in either 64 or 32 bit mode, with all warnings turned on. This is true of both the runtime code and the generated code and has been tested with Visual Studio .Net (2003, 2005 and 2008) and later versions of gcc on Redhat and Ubuntu Linux, as well as on AIX 5.2/5.3, Solaris 9/10, HPUX 11.xx, OSX (PowerPC and Intel), Cygwin and MingW.

The C runtime is constructed such that the library can be integrated as an archive library, a shared library or DLL, or by integrating the source code into your own project or source code set (though this is not recommended, stick to linking with the libraries).

The C runtime is link compatible with C++ and Objective C. The generated code can also be compiled as C++ and you can embed C++ code in your grammars - use the relevant compiler link or option to force the C to compile as C++. The generated code can also be compiled by the objective C compiler.

Performance

It is C (well written and documented I hope to claim) and basic testing of performance against the Java runtime, using the JDK1.6 source code, and the Java parser provided in the examples (which is a tough test as it includes backtracking and memoization) show that the C runtime uses about half the memory and is between 2 and 3 times the speed. Tests of non-backtracking, non-memoizing parsers, indicate results significantly better than that.

Documentation etc

Documentation on using the C Target has now transferred to the generated doxygen documentation, linked from the main ANTLR Home page and at: http://www.antlr.org/api/C/index.html;