Generating C with ANTLR 3

Author

Jim Idle - jimi|at|temporal-wave|dott|com www.linkedin.com/in/jimidle

Image Added

Status

Nearly in In sync with ANTLR 3 Development, however RewriteTokenStream is not yet implemented.

The C code generation templates and C runtime for ANTLR 3 are essentially complete, but require a little more testing and some re-jigging in the light of experience. I expect to be finished (by which I mean in sync with Ter's Java version of the runtime and code generation templates) by the end of May 2006, and if possible, much sooner than thatin-sync with ANTLR release 3.1

Finding Example Grammars

If you are looking for the example projects, visit the downloads section of the main ANTLR page. Half way down the page is the examples tar/zip. Download this and expand it. In there is a subdirectory 'C' with VS2005 projects (also easy to build manually on UNIX).

Background

The C runtime and therefore the code generated to utilize the runtime reflects the object model of the Java version of the runtime as closely as a language without class structures and inheritance can. Compromises have only been made where performance would be adversely affected such as minimizing the number of pointer to pointer to pointer to function type structures that could ensue through trying to model inheritance too exactly. Other changes include the use of token and string factories to minimize the number of calls to system functions such as alloccalloc().

The generated code is free threading (subject to the systems calls used on any particular platform being likewise free threading) and while I am not exactly certain that this is at all useful, it seems silly to write C code that is not free threading these days unless there is some over arching reason to avoid it.

Model

As there is no such thing as an object reference in C, I chose to create a number of typedef structs that reflect the calling interface chosen by Terence in the Java version of the same. The initialization of a parser, lexer, input stream or internal structure therefore consists of allocating the memory required for an instance of the typedef struct that represents the interface, initializing any counters, pointers and buffers etc, then populating a number of pointers to functions that implement the equivalent of the methods in the Java class.

...

The runtime provides a number of structures and interfaces that I have found useful when writing action and processing code within java parsers, and furthermore were required by the C runtime code if I was not to depart too far from the logical layout of the Java model. These include the C equivalents of String, List, Hashtable, Vector and Array, albeit in a more limited form, Trie, implemented by pointers to structures. These are freely available for your own programming needs.

A goal of the generated code was to minimize the tracking, allocation and freeing of memory for reasons of both performance and reliability. In essence any memory used by a lexer, parser or tree parser is automatically tracked and freed when the instance of it is released. There are therefore factory functions for tokens and so on such that they can be allocated in blocks (the size of which is influenced by runtime parameters to indicate small, average or huge lexer/parser/tree parser) and blocks and parceled out as they are required. They are all then freed in one go, minimizing the risk of memory corruptionleaks. This has only one side effect, being that if you wish to preserve some structure generated by the lexer, parser or tree parser, then you must make a copy of it before freeing those structures, and track it yourself after that. In practice, action code usually generates a new structure that is useful outside the generated code and this is not much of a problem..h2 it is easy enough not to release the antlr generated components until you are finished with their results.

Target Platforms

I have constructed the C code such that it will compile on any reasonable ANSI C compiler in either 64 or 32 bit mode, with all warnings turned on. This is true of both the runtime code and the generated code and has been summarily tested with Visual Studio .Net (2003, 2005 and 20052008) and later versions of gcc on Redhat and Ubuntu Linux, as well as on AIX 5.2/5.3, Solaris 9/10, HPUX 11.xx, OSX (PowerPC and Intel), Cygwin and MingW.

The C runtime is constructed such that the library can be integrated as an archive library, a shared library or DLL, or by integrating the source code into your own project or source code set . As development progresses, it should be a matter of typing gmake or NMAKE and you will have everything you need on any particular platform. At least for a while, I will maintain binary versions on: Windows XP/2003, Linux in various guises, HPUX, AIX, Solaris and Open VMS. Others may also turn up over time. This is quite an effort though, so perhaps we can solicit volunteers to keep binary version up to date on some platform that they have an interest in..h2 (though this is not recommended, stick to linking with the libraries).

The C runtime is link compatible with C++ and Objective C. The generated code can also be compiled as C++ and you can embed C++ code in your grammars - use the relevant compiler link or option to force the C to compile as C++. The generated code can also be compiled by the objective C compiler.

Performance

It is C (well written and documented I hope ) and basically it kicks ass. When I have more time I will test it against the other targets, but in theory it should be faster than other targets because of the low overhead of the language and not because I claim this to be in any way superior to the code generated by other targets. I have some improvements to make to the performance of the ANTLR3_LIST hashing tables, but nothing drastic, then I will need to profile the generated parsers to look for possible improvements.

.h2 Documentation etc

At this point, this is all I have to say about the C target, but I will update this page as and when the C target is complete and I wish to release it for testing along side snapshots of the Java version of ANTLR 3.to claim) and basic testing of performance against the Java runtime, using the JDK1.6 source code, and the Java parser provided in the examples (which is a tough test as it includes backtracking and memoization) show that the C runtime uses about half the memory and is between 2 and 3 times the speed. Tests of non-backtracking, non-memoizing parsers, indicate results significantly better than that.

Documentation etc

Documentation on using the C Target has now transferred to the generated doxygen documentation, linked from the main ANTLR Home page and at: http://www.antlr.org/api/C/index.html;

Versions Compared

Old Version 1

New Version Current

Key

Generating C with ANTLR 3

Author

Status

Finding Example Grammars

Background

Model

Target Platforms

Performance

Documentation etc

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Generating C with ANTLR 3

Author

Status

Finding Example Grammars

Background

Model

Target Platforms

Performance

Documentation etc