Generating C with ANTLR 3

Author

Jim Idle

Status

Nearly in sync with ANTLR 3 Development

The C code generation templates and C runtime for ANTLR 3 are essentially complete, but require a little more testing and some re-jigging in the light of experience. I expect to be finished (by which I mean in sync with Ter's Java version of the runtime and code generation templates) by the end of May 2006, and if possible, much sooner than that.

Background

The C runtime and therefore the code generated to utilize the runtime reflects the object model of the Java version of the runtime as closely as a language without class structures and inheritance can. Compromises have only been made where performance would be adversely affected such as minimizing the number of pointer to pointer to pointer to function type structures that could ensue through trying to model inheritance too exactly. Other changes include the use of token and string factories to minimize the number of calls to system functions such as alloc().

The generated code is free threading (subject to the systems calls used on any particular platform being likewise free threading) and while I am not exactly certain that this is at all useful, it seems silly to write C code that is not these days unless there is some over arching reason to avoid it.

Model

As there is no such thing as an object reference in C, I chose to create a number of typedef structs that reflect the calling interface chosen by Terence in the Java version of the same. The initialization of a parser, lexer, input stream or internal structure therefore consists of allocating the memory required for an instance of the typedef struct that represents the interface, initializing any counters, pointers and buffers etc, then populating a number of pointers to functions that implement the equivalent of the methods in the Java class.

The use and initialization of the C versions of a parser is therefore similar to the examples given for Java, but with a bent towards C of course. You should also be aware of memory allocation and freeing operations in certain environments such as Windows, where you cannot allocate memory in one DLL and free it in another.

The runtime provides a number of structures and interfaces that I have found useful when writing action and processing code within java parsers, and furthermore were required by the C runtime code if I was not to depart too far from the logical layout of the Java model. These include the C equivalents of String, List, Hashtable, and Array, albeit in a more limited form, implemented by pointers to structures.

A goal of the generated code was to minimize the tracking, allocation and freeing of memory for reasons of both performance and reliability. In essence any memory used by a lexer, parser or tree parser is automatically tracked and freed when the instance of it is released. There are therefore factory functions for tokens and so on such that they can be allocated in blocks (the size of which is influenced by runtime parameters to indicate small, average or huge lexer/parser/tree parser) and parceled out as they are required. They are all then freed in one go, minimizing the risk of memory corruption. This has only one side effect, being that if you wish to preserve some structure generated by the lexer, parser or tree parser, then you must make a copy of it and track it yourself after that. In practice, action code usually generates a new structure that is useful outside the generated code and this is not much of a problem.

Target Platforms

I have constructed the C code such that it will compile on any reasonable ANSI C compiler in either 64 or 32 bit mode, with all warnings turned on. This is true of both the runtime code and the generated code and has been summarily tested with Visual Studio .Net (2003 and 2005) and later versions of gcc on Redhat Linux.

The C runtime is constructed such that the library can be integrated as an archive library, a shared library or DLL, or by integrating the source code into your own project or source code set. As development progresses, it should be a matter of typing gmake or NMAKE and you will have everything you need on any particular platform. At least for a while, I will maintain binary versions on: Windows XP/2003, Linux in various guises, HPUX, AIX, Solaris and Open VMS. Others may also turn up over time. This is quite an effort though, so perhaps we can solicit volunteers to keep binary version up to date on some platform that they have an interest in.

Performance

It is C (well written and documented I hope to claim) and basically it kicks ass. When I have more time I will test it against the other targets, but in theory it should be faster than other targets because of the low overhead of the language and not because I claim this to be in any way superior to the code generated by other targets. I have some improvements to make to the performance of the ANTLR3_LIST hashing tables, but nothing drastic, then I will need to profile the generated parsers to look for possible improvements.

Documentation etc

At this point, this is all I have to say about the C target, but I will update this page as and when the C target is complete and I wish to release it for testing along side snapshots of the Java version of ANTLR 3.

ANTLR3 Code Generation - C