FAQ - C Target

How can Antlr Parser actions know the file names and line numbers from a C preprocessed file?



One way is to follow this path:

  • Derive your own token from CommonToken and add file number field to it;
  • Get the lexer to produce those tokens and the parser to accept them;
  • Build a file table in the lexer and refer to it in error messages;
  • Keep track of current file in the lexer in case you need error messages from thelexer;
  • Keep track of current line number in the inferred file by setting it in your PPLINE rule then incrementing it in your NEWLINE rule;
  • Set the line and file number at the end of each rule (or override the nextToken stuff to set this automatically);

Start down that path and you will see the best way for your requirements.

In the C target there are user fields for storing such additional information so you can do the same thing as deriving a token.


What is the ASTLabelType to be specified for tree grammar in C target?



Use following:

	language = C;

How to skip a subtree with the C target of ANTLR v3.4?


Requirement: Implement an if/then/else interpreter. Parse only either the "then" or the "else" statement skipping the other one and going to the end of the if statement after having handled it.


Sample grammar snippet: Custom function that walk the subtree counting the UP & DOWN tokens and ignoring all other tokens.

// ^(IF expression statement (ELSE statement)?)
    @declarations { ANTLR3_MARKER thenIdx, elseIdx = 0; }
  : ^(IF expression
        thenIdx = INDEX();
        if (LA(1) == ELSE) {
            MATCHT(ELSE, NULL);
            elseIdx = INDEX();
    { // My code that rewind to either then or else block }

and the ignoreSubTree() function is implemented as following:

static void ignoreSubTree(psshell_tree ctx)
        ebd_sint32_t nestedS32 = 0;

        do {
            if  (HASEXCEPTION()) {
            switch(LA(1)) {
                case DOWN:
                case UP:
        } while (nestedS32 > 0);
        MATCHANYT(); // Eat last UP token

How to get rid of bug in AST walking while implementing control flow in pre-ANTLR 3.2.1?



In an AST walker grammar following this example : http://www.antlr.org/wiki/display/CS652/Tree-based+interpreters

    :   ^('if' c=expr s=. e=.?) // ^('if' expr stat stat?)

In the generated code , variables e,s are not defined! (yet they are assigned values ) and the compiler complains about that , but when they are defined by hand (under the definition of variable c ) the code compiles fine.


Till the bug is fixed (>= 3.2.1), follow these guidelines

1. use something like

    :   ^('if' c=expr s=. e=.)


You would not have the s=. and e=., just code to consume what you need.

//do whatever with return values or custom nodes in e&s

2. For index and related properties, use the method calls rather than the fields directly, but if you know you will never override the structure types then you only have to worry if the names of the fields are changed.

You probably need:


3. Nodes are valid between tree walks and rewrites so long as you do not free the node streams until you are completely done. You can dup a node outside the factory and then it will persist, but you need to free the memory.

Create your own 'class'/structure and populate it from the information in the node. You will probably need to reset the node stream etc too.

How to navigate the AST built using the ANTLR3 C target (how to access token information stored in the tree nodes: ANTLR3_BASE_TREE or ANTLR3_COMMON_TREE)?



Use the super pointer in the ANTLR3_BASE_TREE and cast it to pANTLR3_COMMON_TREE. The getToken returns the BASE_TOKEN, which has a super, which is casy to COMMON_TOKEN.

Check: makeDot in basetreeadaptor for some examples. Check that function in the source, but this time do not use generating options. Get the function to work for you.

nodes = antlr3CommonTreeNodeStreamNewTree(psrReturn.tree, ANTLR3_SIZE_HINT);
dotSpec = nodes->adaptor->makeDot(nodes->adaptor, psrReturn.tree);

Also, look at the code for the adaptor.

You can only get tokens for nodes that have them of course.

How to create my own custom AST instead of the pANTLR_BASE_TREE provided in the C runtime?


(I've been looking at antlr3basetree.h. Is the idea that we use this base tree, but set the "u" pointer to our own data


That's by far the easiest way, which is why I did that. Mostly, that's all people need. However, you can also write your own tree adaptor that just passes around (void *). Not a task for the inexperienced C programmer though. But if you copy the adaptor code and fill in your own functions, that will work. Much easier to just add your nodes to the u pointer though and let the current stuff do all the work (smile)

I'd like to use my own custom AST and write my own tree adapter. One thing that I haven't been able to figure out is how to substitute my adapter for the default adapter when the parser code is generated.

The adapter is declared in the parser header file as:


and then used with the macro definition:

#define ADAPTOR ctx->adaptor

What I was wondering is, is there an option that I can specify to have antlr generate a parser that refers to my adapter?


and the associated functions like:



Once you have created the parser, just install a pointer to your own adaptor. You can also do this in @apifuncs I think. As in:

ADAPTOR = pointerToMyAdaptor;

You adaptor structure needs to be the same structure that the common adaptor uses and provide the same function set.

To be honest with you, even in Java I have found it easier just to call my own functions directly from actions (i.e. to use the actions instead of the re-write rules) and not try and create an adaptor, though this is essentially because of what type of tree is needed in the case of errors and so on. This does end up tying your grammar in to a specific generated code.

The adaptor will work, but will possibly be more work. I am not sure if you might not have to be tricky with the rewrite stream functions.

Code Generation

How to generate C and Java from a single ANTLR grammar file that contains actions; say the grammar looks something like this?


selectStatement[int initRule]
@init 	{if(initRule) sse.pushCall(sse.SELECTSTAT);}
@init 	{if(initRule) sse->pushCall(sse.SELECTSTAT);}
	q = queryExpression[true]


Using a preprocessor for ANTLR grammar avoids having to maintain two versions of the grammar.
Following preprocessor for ANTLR grammar contributed by Andy Grove of CodeFutures Corporation:

# Preprocessor for ANTLR grammar files with multiple language targets
# Written by Andy Grove on 23-Jan-2009

def preprocess(filename, userTarget)
   f = File.open(filename)
   include = true
   currentTarget = "*"
   f.each_line {|line|
     if line[0,7] == '//ifdef'
       currentTarget = line[7,line.length].strip
     elsif line[0,9] == '//elifdef'
       currentTarget = line[9,line.length].strip
     elsif line[0,7] == '//endif'
       currentTarget = "*"
       if currentTarget=="*" || currentTarget==userTarget
         puts line

   if ARGV.length < 2
       puts "Usage: preprocess filename target"
       preprocess(ARGV[0], ARGV[1])

How to compile g3pl file to generate C code for 64 bit target?



  1. Rename it to a .g file, the extension names were because older versions of Visual Studio needed it to see what the output files were but that isn't relevant any more.
  2. The generated C is the same on all platforms and there is nothing special to do to generate 64 bit, 32 bit, Linux, Win32, Solaris etc. In fact it is designed so that you can generate the C on one platform and compile it on any.

So, just like any other file you run the ANTLR tool on it and it will give you a .c and .h file. The generated files are both 32 and 64 bit compatible but if you read the docs for the C runtime it will point you at ./configure --help where you will see a flag that you supply to build the libraries in 64 bit mode. Set the enable-64bit flag when you build the runtime.

How to make the generated *.c-file to be named *.cpp?



  1. Change the CTarget.java and rebuild the tool.
  2. But it is much easier to just add the "Compile as C++" flag to the compiler:
    1. MS: /TP
    2. gcc: -x cpp
  3. Or you could trivially add a makefile rule to rename them after the antlr tool is run.

How to get rid of type conversion errors of form shown below while compiling generated code using ANTLR C runtime (version 3.1.1)?

MyLexer.c:1634: error: invalid conversion from 'int' to 'const ANTLR3_INT32*'



  • If trying to compile as C++ code, use -I/usr/local/wherantlris and run 'make install'
  • Redefine some of your token names such that they do not clash with standard C/C++ macros. Specifically rename your NULL and DELETE tokens as KNULL and KDELETE. Fix these and try simpler compiles.

As a design pattern though, create a helper class and just call that, keep the code you embed in actions to a minimum for maintenance reasons. Try a simpler compile first, just as C:
gcc -I /usr/local/include -c src/sqlparser/DbsMySQL_CPPLexer.c

Then the same with g++

g++ -I /usr/local/include -c src/sqlparser/DbsMySQL_CPPLexer.c

If you turn on all the possible C++ warnings and errors, then it is likely you will get a lot as this is meant to compile as -Wall with gcc, but I make no claims for g++ (wink) You can compile your own C++ and so on with all the C++ warnings of course.

How to get rid of error in generated C code where a struct is referenced without being initialized?


sample with the problem:


e = expression 
 ( a1 = alias1 )? 
 { sse.addSelectItem($e.text, $a1.text); }

In generated parser, a variable "a1" is declared thus:

 DbsMySQL_CPPParser_alias1_return a1;

Under certain conditions the variable gets initialized by calling a method:


However, when parsing input, the variable is not getting initialized but it is getting referenced in the following expression, causing a seg fault:

(a1.start != NULL ? STRSTREAM->toStringTT(STRSTREAM, a1.start, a1.stop) :

Because a1 is never initialized, a1.start refers to non null value. If a1.start = NULL is added after a1 is declared then the seg fault is fixed. How to fix this?



| e = expression
        ( a1 = alias1
             { sse.addSelectItem($e.text, $a1.text); }
        |     { sse.addSelectItem($e.text, NULL); }


How to do custom code in parser which needs to access the data structure written in C++ (basically class)?


say for eg.

s  	→ CHAR '=' e
e  	→ t y 
y 	→ '+' t y 
t 	→ p x
x 	→ '*' t 
p	→ '('e')' 

where CHAR and NUMBER are tokens.

LETTER 	:	 ('a'..'z' | 'A'..'Z')


DIGIT	    : '0'..'9'

NUMBER	    : (DIGIT)+ '.' (DIGIT)+ | (DIGIT)+ 

Let say the input is i=5

Now in
s → CHAR '=' e, When this get executed for i depending upon the Right hand side I need to declare the 'i' as int. I will get what type of datatype by calling some member funtion of some class. Is it possible to do like this ANTLR.

J.R Karthikeyan

Download the example tar from the download page. It shows you how to do this.


Generated C code using ANTLR: Got this error in VS 2005 :

Error 1 fatal error
C1010: unexpected end of file while looking for precompiled header. Did you forget to add '#include "stdafx.h"' to your source? c:\documents and settings\kjambura\my documents\visual studio 2005\projects\wrapper\wrapper\checkforcompileparser.c 468


Problem was fixed by:

Compiler errors are coming in VS 2005 because the default options applied to a VS C++ project include pre-compiled header support.

The easiest way around this is to tell VS that your Antlr generated files don't use the pre-compiled headers. To do that, select the Antlr source files in the Solution Explorer, then right-click and select Properties. After that, select the C++->Precompiled Headers property page and then in the 'Create/Use Precompiled Header" property, select the option that says something like "Not using precompiled headers".

But Visual studio 2005 is no longer directly supported.

Clean solution is : Visual Studio 2008 is available in a free version (and probably 2010 ), so should really use that. The issue is that the vs2005 compiler does not support a few ANSI constructs used int eh runtime. This means that you must compile the runtime in 2008 and just link with it in your 2005 project. But unless you configure the include paths for you project and so on, then it will not compile anyway. That's why should download vs2008 and the example projects, then use the example projects to show you how to configure your own project.

Why does C target code generation of ANTLR 3.4 not set all the rule variables to NULL?



It is by design, to prevent trying to assign NULL to things that are not pointers.

Why am I getting error "set complement is empty"?


The following grammar does not generate target code. It says "error(139): skipper.g: set complement is empty".

grammar skipper;

     language = C;

     @init {
	  int braceCount = 1;
     : (
	  braceCount ++;
     | ')'
	  braceCount --;
	  if(braceCount == 0)
     | ~('('|')')
     ) *

What's wrong with it?

Anton Bychkov

Answer 1 Thread:

Try adding the lexer rules:

LParen : '(';
RParen : ')';

              Antlrwoks reports the same error on 28 line which is

                             | ~(LParen|RParen)

            There is also a strange thing in rule view, it looks like antlr does not see LParen and RParen in twiddle operator.


There are no other tokens than '(' and ')' defined, so ~(LParen|RParen) is wrong. Try adding a "fall through" DOT in your lexer grammar:

       @init {
               int braceCount = 1;
       : (
               braceCount ++;
       | RParen
               braceCount --;
               if(braceCount == 0)
                       LTOKEN = EOF_TOKEN;
       | Other
       ) *

LParen : '(' ;
RParen : ')' ;
Other  :  .  ;

Or like this:

LParen : '(';
RParen : ')';
Other  : ~(LParen | RParen);

Answer 2: 

You cannot use set complements in parser rules. That is for lexer rules only. In the next release, ANTLR will tell you about this. But don't use 'literals' while you are learning as it is too easy to get confused as to what they mean in terms of lexer vs parser.

Debugging and Error Checking

How to debug grammars for the C-Runtime using ANTLRWorks?



  • Generate with -debug option
  • Compile as normal
  • When you run the parser, it will appear to hang
  • Use ANTLRWORKS to load the grammar file
  • 'Debug remote' to localhost

Debugging parser written for C target:



Since I can not debug it in ANTLworks directly, I wrote a small test app that includes the parser and parses one line:


The parser and lexer have been compiled with the "-debug" option. I can connect to the port in ANTLworks, but as soon as I click "step forward" in the debugger, my program finishes and the debugger doesn't display anything.

This is the console output I get:

unknown debug event: location 796 5
unknown debug event: enterRule PLSQL3c.g select_expression
unknown debug event: exitSubRule 153
unknown debug event: location 646 5
unknown debug event: enterRule PLSQL3c.g select_statement
unknown debug event: location 640 5
unknown debug event: enterRule PLSQL3c.g select_command
unknown debug event: location 583 5
unknown debug event: enterRule PLSQL3c.g to_modify_data
unknown debug event: location 574 5
unknown debug event: enterRule PLSQL3c.g sql_command
unknown debug event: location 569 5
unknown debug event: enterRule PLSQL3c.g sql_statement
unknown debug event: location 147 5
unknown debug event: enterRule PLSQL3c.g statement
unknown debug event: location 126 14
unknown debug event: enterSubRule 20
unknown debug event: enterDecision 20
unknown debug event: LT 1 9 4 0 1 22 ";
unknown debug event: LT 2 0 -1 0 1 -1 "<EOF>
unknown debug event: exitDecision 20
unknown debug event: exitSubRule 20
unknown debug event: location 126 32
unknown debug event: enterSubRule 21
unknown debug event: enterDecision 21
unknown debug event: LT 1 9 4 0 1 22 ";
unknown debug event: exitDecision 21
unknown debug event: enterAlt 1
unknown debug event: location 126 32
unknown debug event: LT 1 9 4 0 1 22 ";
unknown debug event: consumeToken 9 4 0 1 22 ";
unknown debug event: exitSubRule 21
unknown debug event: location 126 38
unknown debug event: LT 1 0 -1 0 1 -1 "<EOF>
unknown debug event: consumeToken 0 -1 0 1 -1 "<EOF>
unknown debug event: location 127 2
unknown debug event: enterRule PLSQL3c.g parse_statements
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at org.antlr.runtime.debug.RemoteDebugEventSocketListener.eventHandler(R
at org.antlr.runtime.debug.RemoteDebugEventSocketListener.run(RemoteDebu
at java.lang.Thread.run(Thread.java:619)

Andi Clemens


You are better using the native debugger at that release point. I think that ANTLRWorks and the C runtime got out of sync. The current development branch is working if you want to use that, but it isn't tested very much of course.

How to handle HASEXCEPTION / HASFAILURE in following case:


Because the C version of ANTLR does not support @after actions, I have implemented the equivalent in my grammar by placing a block of code at the end of my rule. For example:

@init { sse.pushCall(sse.WHERECLAUSE); }
	 c = searchCondition
      { sse.popCall(); } <-- @after action

I noticed that this @after action was not running for some of my input data, so I added debug logging to the generated parser code and found that one of the calls to the HASEXCEPTION macro is returning true and the parser is finishing at that point.

How am I supposed to detect if an exception like this has occurred and how do I find out what is causing the exception?


The calling rule will make a call to the error message routines so check your stack there. But really you should parse, build a tree then do this in the tree walk and you won't have the issue. Read the API docs for C and displayRecognitionError.

How to write parser to check for correct syntax as well as parse erroneous input to do autocompletion in another program?



You have to be careful how you implement your grammar rules such that you can recover sensibly from errors. Generally you build a tree or partial tree then analyze that. You may also need to specifically code for some potential missing elements, but again you have to be careful not to introduce ambiguities that break the normal grammar.

For hints on how to code rules that recover well from errors (especially in loops), see:


Why stream name can't be printed out when error occurs(ANTLR C)?


I set the fileName for ANTLR3_INPUT_STREAM and the file name can be printed out when error occurs. But when the tree parser is involved, the name can't be always printed out (print unknown-source instead sometimes).

Mu Qiao

There can be multiple streams and you should reallty be using the file name in the filestream, which you get from the token. Originally, we could not get back to the filestream from the token and now we can, hence the example error code uses what was available when I wrote it and also avoids (at the time) dealing with stringstreams that might not have a name (though now they always do of course).

How to debug crash in dupNode in antlr3commontree.c in libantlr3c 3.1.3.


(dupNode is being called on a common tree node which has a null factory.

Edited stack trace from GDB:

dupNode (tree=0x97debe0)
at antlr3commontree.c:402
getMissingSymbol (recognizer=0x97ec8f8, istream=0x97ea300, e=0x97ecba8,
expectedTokenType=23, follow=0x81beca0)
at antlr3treeparser.c:227
recoverFromMismatchedToken (recognizer=0x97ec8f8, ttype=23,
at antlr3baserecognizer.c:1530
match (recognizer=0x97ec8f8, ttype=23, follow=0x81beca0)
at antlr3baserecognizer.c:478

I believe the node it's trying to duplicate is one of stream->UP/DOWN/EOF_NODE/INVALID_NODE initialised in antlr3CommonTreeNodeStreamNew (in antlr3commontreenodestream.c). It looks like those nodes never have their factory set.)


This is in the tree parser's error message routines. That means your tree parser is incorrect, or at least the input it is receiving has a mismatch. Unfortunately, I think that the default error message routine doesn't handle duping the node for error display that well when it is trying to duplicate from LT(-1) and LT(-1) is an UP/DOWN token; this is of course a slight bug. However, you are really expected to install your own error message display routines of course, so your fix for the moment is to make a copy of the tree parser displayRecognitionError, have it check that the token being duped to create the missing symbol has a tree factory and if it does not, then use LT(-2) etc (check for start of stream of course) and install it before invoking the tree parser.

Now, you need to debug your AST and the walker of course. The best way to do that is to produce a .png graphic of the tree that you have. From a prior post:

(First install graphviz from your distro or 1www.graphviz.org)

To use it from C, you just do this:

 pANTLR3_STRING dotSpec;
 dotSpec = nodes->adaptor->makeDot(nodes->adaptor, psrReturn.tree);

Where nodes is the pANTLR3_COMMON_TREE_NODE_STREAM and psrReturn is the return type of the rule you invoke on your parser.

You can then fwrite the spec to a text file:

dotSpec = dotSpec->toUTF8(dotSpec); // Only need this if your input was not 8
 bit characters
 fwrite((const void *) dotSpec->chars, 1, (size_t) dotSpec->len, dotFILE);

Then turn this in to a neat png graphic like this:

sprintf(command, "dot -Tpng -o%spng %sdot", dotFname, dotFname);

You can then use this png to debug your AST and when it looks correct, make sure that your tree parser is expecting that input. One of the two is incorrect in your case and obviously tree parsers should not encounter incorrect input.

How to ignore parsing errors in ANTLR for C target?


(I just want to parse known Statements with my grammar, all unknown statements (parsing errors) could be ignored.

Can I tell ANTLR (for the C target) to ignore those error messages and just return FALSE or something like that, so that I can decide wether to take an appropriate action?)


If you are getting errors it is because your grammar is incorrect. e.g. Your token in the parser (which you should move to the lexer anyway and not use 'LITERAL' in your parser code) is CREATEE but your input is create. Did you tell the runtime to be case insensitive?

Read the API or use antlr.markmail.org to see how to override displayRecognitionError(). You cannot just ignore errors though because somehow you have to recover. You could just make them silent and when the parser returns if the error count is >0 then ignore that source or something.

( What is the difference if I add "CREATE" or similar to the lexer? Is it more reliable in detecting the right tokens? )

When putting things in the parser, you have not enough control over the tokens both in terms of what they are named in code generation time (hence error messages are difficult, and producing a tree parser is difficult), and you cannot see the potential ambiguities in your lexer. It just makes things more
difficult for no(IMO) advantage.

If you have told the input stream to be case insensitive, then I am afraid that the problem is going to be with your grammar. You will have to single step though the code to find out why.

( Would it be better to have tokens like "CREATE USER" and "CREATE TABLE" in the lexer or doesn't this work anyway because of the whitespace?)

No - don’t make whitespace significant unless the language absolutely makes you do so.

What you have to do is left factor:

     | cr_user
     | cr_trigger

 : TABLE .....

I know how to install my own custom displayRecognitionError handler, but how to remember the errors to print out for later display.


(Sure, I could put them in a global variable, but I'd like for multiple instances of my parser to be able to be called concurrently. Is there any way to add a data structure (like a list<string>) to the ANTLR3_BASE_RECOGNIZER class?)


Yes, you can indeed do this (comes from using the code myself - I run in to the same things (smile) , but you do not add it to the . For some reason, the doxygen generated docs are not including the doc pages about this, I will have to find out why.

I generally use an ANTLR3_VECTOR (because then it is ordered), but tend to collect all such things (counters, parameters etc) in a mast structure like this:

        pMY_PARSE_SESSION              ps;                     // MY 
compiler session context for parser

Anything in the @lexer/@parser ::context section is adding verbatim into the context struct (referenced via ctx), and you can initialize it in @apifuncs, or externally.

The base recognizer has a void * pointer called super, which will point to the parser instance (you can see that the the default error display routines pick this up.). However, ANTLR3_PARSER also has this field, but it is not initialized by default because I cannot know that the instance of your generated parser is what you want it to point at (I suppose I could assume this and let you override it, but it is probably better to explicitly initialize it for future reference.

So, in your apifuncs section:

@parser::apifuncs {

  ctx->ps = myNewContextGrooveThang;

  PARSER->super = (void *)ctx;

Now, in your display routine, you will get the parser pointer from base recognizer, get the ctx pointer from super, and your custom, thread safe collection will be in your master structure. A few pointer chases, but this provides maximum flexibility:

displayRecognitionError        (pANTLR3_BASE_RECOGNIZER recognizer, 
pANTLR3_UINT8 * tokenNames)
   pANTLR3_PARSER            parser;
   pmyParser                             generated;
   pMY_PARSE_SESSION              ps;

   parser        = (pANTLR3_PARSER) (recognizer->super);
   generated     = (pmyParser)(generated->super);
   ps            = generated->ps;

   // Bob's your uncle.....

I know how to install my own custom displayRecognitionError handler, but for reporting the names of tokens is there a way that I can show the matching string for simple tokens rather than the name, (possibly limited to just those declared with "tokens {...}")?



  • It is the same as the other targets, in that you need to create a local function that returns/displays/adds to the message, the name you want to use for error display. It is just a switch statement on the token type basically, or you could create a local map and initialize it the first time it is required. It is just a bit of slog really. Java provides a method to override to do this, but in C, you just call your own local function.
  • I think the information I need is in the <Name>.tokens file, though, so possibly I could generate some code from that.

How to remove errors generated by using doxygen and the code generated by ANTLR using the C target?



Well to be honest I started adding doxygen to the generated code but after looking at what you get from it I decided that it wasn't really of much help. In the next version of ANTLR doc comments of rules will be passed through to code gen and that should help.

To change the template you need to either rebuild ANTLR or set your class path up so that it finds your version of C.stg before mine. I suspect though that what you will get is not really that useful. Better to document the grammar than the generated code.

Why is the input string "abc" on following grammar generating a NoViableAltException but using the C Runtime to parse a 'root' it passes successfully:


 grammar schema;

      language = C;

 root : letter* ;
 letter : A | B ;
 other : C;

 A   :    'a';
 B   :    'b';
 C   :    'c';


root: letter* EOF;

No exceptions in C so that top rule can only set flags.


What would I want to look at if I wanted to deduce what portion of the input data the parser had consumed?


Well, the EOF forces the parse to progress to the EOF 'token', so the parser will look at everything, perform recovery resync and go on until it sees the EOF, when it will finally breathe a sigh of relief (wink) . You override the error display (or a function earlier in that stack) and collect the errors in a form that you can then use after parsing. Generally, you don't print errors as they occur but collect them in a log/buffer/collection/kitchen sink and then decide where they should go after the parse is complete (I.E send to IDE, print to screen, email them and so on).

Understanding displayRecognitionError


Question 1. The following is what I find out while assigning my own error processing function to antlr's recognizer's reportError function pointer:

For my grammar definition like:

Query: Rule1 Rule2? Rule3? ; EOF

And for my test case that has a malformed syntax, which matches Rule1, but the rest doesn't match either Rule2 or Rule3, through debugging, I find that the parser then tries to match with ";", of course it doesn't match ";" again, then it tries to report error and recover.

In the antlr's error reporting (I did it by copying the content of displayRecognitionError() to my own error processing function), the "expecting" variable should point to the position of ";" as that's what it tries to match. But it shows that the expecting is a very large number]

Xie, Linlin

The routines in the runtime are just examples. You are expected to make a copy of them and install your own routines. Note that the expecting member of the token is not always a valid token number. For instance you have to check it for being -1, which means it was expecting EOF, which is probably what it is expecting in your grammar above right?

if    (ex->expecting == ANTLR3_TOKEN_EOF)

Also, for some exception types, expecting is not a valid field. The default routines should show you that as not all exceptions access expecting. Finally, for some exceptions, you don't get a token number, but a bit set, from which you must extract the set of tokens that coudl have matched. This is a possibility here because your possibilities are start(Rule2) & start(Rule3) & ';'. In the code that processes ANTLR3_MISMATCHED_SET_EXCEPTION, you can see how to deal with the bit set.

So, if your expecting code is not -1, then it is a bitmap in all likelihood.

Question 2. I find that the exception variable has a nextexception pointer which points to another exception variable.


The nextException isn't used by the default error reporting, I just included it in case anyone thought it useful.

Question 3: I would think start(Rule2), start(Rule3) and ; all should be the expected tokens, instead of EOF. Do you think if there is anything antlr can do to improve the error messages to make them more relevant? Or should I improve my grammar to get more appropriate error messages, and how?


You have to write your own message display routines that make sense with your grammar. The default ones do check for EOF though. Your issue is that because all the things leading up to EOF are optional, ANTLR assumes that they are just not present:

Say start(rule2) is FOO and start(rule3) is BAR.

Then after rule1 it says:

No FOO is there, so go past Rule2, it isn't present
No BAR is there so go past Rule3, it isn't present

Now, what is the start set that can come next? Only EOF, so match EOF - oh it failed, so the expecting token is -1 for EOF.

However, if you do this:

: rule1
   (  rule2
          rule3 EOF
        | EOF
     | rule3 EOF
     | EOF

Now, after rule1 has parsed, the followset will be FOO | BAR|EOF so you will get the error straight away. After rule2 is parsed, followset will be BAR|EOF so you will get the error straight away, after rule3, only EOF is viable.

Also I can see when the displayRecognitionError() checks the recognizer type, it only considers either parser or tree parser, why is lexer not considered here?

  1. Lexers can only say: "Not expecting character 'y' here. and so antlr3lexer.c has its own handler. You should install your own handler remember?
  2. If your lexer is throwing errors, then it is broken really. It should be coded to cope with anything one way or another. However, sometimes that is difficult of course. You need to make sure that your lexer rules can terminate just about anywhere, but throw your own (descriptive error) about any missing pieces. Then you have a final lexer rule:


ANY : . { SKIP(); log error about unknown character being ignored.

What this does is then move all your error handling up to the parser, where you have better context. Similarly, you should move any errors that you can out the parser and in to the tree parser, where once again you have better context. The classic example is trying to code the number of parameters that any particular function can take. Don;t do that, accept any, including 0, then check for validity in your first tree walk.

I can see that a lexer error is considered a No Via Alt parser exception, but there is still lexer error report from antlr, where can I find the lexer error report code? Or how can I intercept the lexer error like I do with the parser error report?

Intercept the same way, install your own displayRecognitionError, but make it say "Internal compiler error - lexer rules bad (sad) all your base belong to us"

How to solve the problem of parser crashes whenever a scoped attribute is referenced outside of a scope context?


[This kind of situation arises when a rule B that is referencing a scoped
attribute defined in a rule A may match on its own without requiring a
previous match to A.]

_ Luca_

It is a bug in your grammar which you can solve in one of a number of ways:

  1. The best thing to do would be to move the scope up the rule hierarchy so that you cannot call this rule without a scope in place.
  2. Protect the code your self.
  3. Use a different rule when the same syntax is parsed

How to access the input char stream from lexer's displayRecognitionError(String[] tokenNames, RecognitionException e) to display the characters preceding and following the error?


Use e.input from the Exception, on a CommonToken you can use getInputStream() (sometimes the input stream for the token may not be the same as the exception, for instance if you handle include processing etc).

How segmentation fault in isNilNode() from libatnlr3c.so was solved for following attached files (Eval.g, Expr.g, input.txt) :


[Test.cpp file attachment is missing; other attachments in markmail]

To see the error:

  1. Save all files in the same directory, cd to it and run "chmod +x compile"
  2. Open Expr.g in ANTLRWorks, generate code, then do the same with Eval.g
  3. Run (I'm using GNU C++ compiler): ./compile
  4. Run: ./Test input

_L. Rahyen _

Error was because in code, was translating from Java "newCommonTreeNodeStream" to "antlr3CommonTreeNodeStreamNew" instead of "antlr3CommonTreeNodeStreamNewTree"

After finishing processing, why is nonterminal rec->state->errors_count = 0? It has detected an error already.


Юрушкин Михаил

I presume you mean errorCount? Is this from the lexer? I fixed a bug whereby the lexer would not count errors - fixed in release 3.2. However, lexers should really be written so they don't throw recognition errors, but detect anomalies and report them more directly (such as missing terminating quotes and illegal characters. FInally don;t use intenrals directly, use the api calls: getNumberOfSyntaxErrors.

How to skip to end of line on error?


I want to parse a file that consists of bibliographic entries.  Each entry is on one line (so each record ends with \n). If a record does not match, I just want to print an error message, and skip to the end of line and start again with the next record. If I understand chapter 10 correctly, then '\n' should be in the resynchronization set, and the parser will consume tokens until it finds one.

This isn't happening.  Once I get an error, the parser never recovers. I get a bunch of NoViableAlt exceptions.  I'm hoping someone can explain what I'm doing wrong.

Here is a sample input file.  The 1st and 3rd lines are ok, the 2nd line is an error.

 Name. "Title," Periodical, 2005, v41(3,Oct), 217-240.

Name. "Title," Periodical, 2005, v41(3,Oct), Article 2.

Name. "Title," Periodical, 2005, v41(3,Oct), 217-240.

 Here is the grammar:

grammar Periodical;

    :    (article NL)* article NL?

    :    a=authors PERIOD SPACE QUOTE t=title COMMA QUOTE SPACE j=journal

authors    :    (~QUOTE)+;

title    :    (~QUOTE)+;

journal    :    (LETTER|SPACE|COMMA|DASH)+;

volume    :    (LETTER|DIGIT)+


pages    :    DIGIT+ DASH DIGIT+;

PERIOD    :    '.';
QUOTE    :    '"';
COMMA    :    ',';
SPACE    :    ' ';
DIGIT    :    '0'..'9';
LETTER  :    ('a'..'z')|('A'..'Z');
DASH    :    '-';
SLASH    :    '/';
NL    :    '\r'? '\n';

Rick Schumeyer

Basically, you need to prevent the parsing loop from dropping all the way out of the current rule because it finds an error (in your case within the article rule.) You will also find this much easier if rather than trying to accommodate files without a terminating NL, you just always add an NL to the incoming input, then you will not need the trailing article NL? But can have (article NL)* EOF.

 So, when an error occurs in the article rule, it will drop out of that rule, but may not resync, so you want to force the resync to the NL when the article rule returns. This is pretty simple, but requires quite a bit of 'inside' knowledge of the ANTLR behavior. What you need to do is create a rule with just the epsilon (nothing) alt, and invoke it directly before the article call but more especially directly after it:


    : reSync  (article reSync NL)* EOF // Assuming that this is where EOF should

Next, in your resSync rule, you want to resync to the follow set that will now be on the stack, which is actually the same as the first set for the following rule (because ruleSync is empty). Here we know that the followSet will only be NL, so you could hard code that, but this is a generally good technique to know, so let's use it generically). If you don't really understand this, don't worry too much, you can just copy the code and empty rule and it will work:





    syncToFirstSet(); // Consume tokens until LA(1) is in the followset at the
top of the followSet stack


: // Deliberately match nothing, but will be invoked anyway


Then in your superClass (best) or @members, implement the syncToFirstSet method:

    protected void syncToFirstSet ()


        // Compute the followset that is in context where ever we are in the

        // rule chain/stack


         BitSet follow = state.following[state._fsp];

         syncToFirstSet (follow);


    protected void syncToFirstSet (BitSet follow)


        int mark = -1;

        try {

            mark = input.mark();

            // Consume all tokens in the stream until we find a member of the
            // set, which means the next production should be guarenteed to be

            while (! follow.member(input.LA(1)) ) {

                if  (input.LA(1) == Token.EOF) {

                    // Looks like we didn't find anything at all that can help
us here

                    // so we need to rewind to where we were and let normal
error handling

                    // bail out.



                    mark = -1;





        } catch (Exception e) {

          // Just ignore any errors here, we will just let the recognizer

          // try to resync as normal - something must be very screwed.



        finally {

            // Always release the mark we took


            if  (mark != -1) {





And that's it. Every time you mention reSync in a rule, it will resync the input to a member of the current followSet, which will be the first set of the rule that follows reSync in the current production and you will therefore not drop out of the parsing loop, but reenter your article rule. The first invocation is just in case there is junk before the first article starts (depending on how this rule is invoked, you may need to resync before the articleList rule).

Question 2) I would like to make one change: when an error is encountered, I just want to print something like "problem on line 45" (and then continue with the next line).


 I did write this up at:


 If you look in the sync java routine in the article - we are looking at making this generic and adding to ANTLR - you will see that there is a marker there for printing errors. If you set a flag to say you have printed an error, then just print an error the first time you consume a token, you can get the line number etc from the token that you consume. You could also add a string parameter to the reSync routine which is a template or format for your message so you can say what type of construct you were parsing at the time. And pass this to the Java routine.

Is there a way where in if there is a lexer error, antlr reports the error and exits without creating the parser for it ?


for ex: In the calling program:

lex = LexerNew(input);
tokens = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT, TOKENSOURCE(lex));
//want the lexer to exit here only in case of a lexer error 

Meena Vinod

Invoke the lexer separately by asking for the first token. This will cause the tokens to be gathered (the lexer will run to completion). Then check if it issued any errors (you should probably override the error reporting mechanism) and stop if it did.

 However, you should code your lexer such that it does not give errors, at least not from the runtime. See lots of past posts about this on this list. Ideally, you want your process to go as far as it can without giving up such that your end user receives as much information as possible about what is wrong before having to run your parser again. See:


 for some concrete examples.

How to customize error handling for the lexer?


The java parser has a general notion of parser exceptions. The C runtime simulates this by checking some attributes in the base-recognizer structure thanks to the HASEXCEPTION macro. Anyway it happens that recovered lexer errors are not propagated to the parser so in this case an HASEXCEPTION check does not return true . Starting from a parser rule invocation, how to detect if one has an error in the lexer? Another way to formulate the same question is how to get a reference to the lexer starting from the invocation of a parser rule? How to customize error handling for the lexer?


That isn't how it works.

First the lexer is called to create all the tokens, then the parser runs. However, you can call the lexer on it's own (see C examples) then just check  the error count member of the recognizer before invoking the parser.

However, you should not program a lexer that can raise errors. Always cover all the bases with at least a final rule: ANY : . {error ...}; and left factor the other rules so that they cannot error out on their own. Then you can leave  errors to the parser.

Error handling using parallel instances of a C-target parser


I am working with a C-target parser, and I have multiple instances of the parser running in parallel.

Now I would like to stop the parser from printing error messages to stderr. Instead, I would like each instance of the parser to collect the error messages in a list of strings, so that the caller can access the complete list of error messages after the parser finished and decide what to do about them.

Johannes Goller

Use antlr.markmail.org and look for displayRecognitionError. Remember that if you have parallel threads, you will want the error collections to be thread instances, not global members. Therefore you add them as context members @apifuncs etc.





I am trying to turn off single token insertion and deletion error recovery in my parser (C target). I found the following comment in antlr3baserecognizer.c above the match() function.

/// Match current input symbol against ttype.  Upon error, do one token

/// insertion or deletion if possible.  

/// To turn off single token insertion or deletion error

/// recovery, override mismatchRecover() and have it call

/// plain mismatch(), which does not recover.  Then any error

/// in a rule will cause an exception and immediate exit from

/// rule.  Rule would recover by resynchronizing to the set of

/// symbols that can follow rule ref.

This seems fairly straightforward at first glance, but then I discovered that there is no mismatchRecover() function to override. Digging through the code, I suspect that this function was renamed to recoverFromMismatchedToken(), but I cannot simply override it with mismatch() because their prototypes do not match.

void        * (*recoverFromMismatchedToken)   (struct
ANTLR3_BASE_RECOGNIZER_struct * recognizer,

                    ANTLR3_UINT32 ttype,


void   (*mismatch)       (struct
ANTLR3_BASE_RECOGNIZER_struct * recognizer,

                     ANTLR3_UINT32 ttype,

  pANTLR3_BITSET_LIST follow);

As you can see, one returns a void *, and the other returns void. What is the correct way to do this?

Justin Murray

It means install your own version of recoverFromMismatchedToken and basically don't consume or insert but reset any flags etc.

Why am I getting memory leaks during error recovery?


I have the rule:

myRule[returns Type1 res] :
	rule1, rule2, rule3, rule4... ruleN { res = f($rule1, $rule2,..., ,
$ruleN) }

it's all ok. BUT if ruleN will fire exception, rule1, rule2.. rule(N-1) subtrees will be forgotten!!! 

PS: I tried to use "catch" attribute .. but it needs errorCode. It's not good, because there're may be different errors.


It is because you are trying to do things while you parse - another reason to build a tree and THEN operate on the tree.

Catch does not need a type in the C target, you can just use:

catch() { }

(assuming 3.2 of ANTLR).

The other thing you might do is break up your rule list so that exceptions in them do not drop out the whole rule, which is what happens in all targets unless you structure the rules a little. Break things down in to smaller units. The function call overhead (which may not even occur because of inlining) is very small in C.

File name, line number, column number etc.

How to get the string for ID in a grammar?



Use the commontoken fields to find the start and end of the text that represents the token and copy/print/etc directly from the input text.

Download the examples and look at the C parser in there.

When an error is detected during the tree parsing, how to print error with information on the original input tokens, not on the tree node which is not relevant for the end user?



You get the start and end token from the node, then ask the start token for its information and the end token for its information, and then you have the complete span of a node. Beware of -> ^(NODE1 c u ^(NODE2 x)) as NODE2 won't get the span information when the rewrite is like that.

To get the line and character number in the input file, get the token position , then you get the tokens, and ask for the information from them.

How to get line number and position of characters in input string?




How to get a pANTLR3_INPUT_STREAM from a std::string (or char* variable)?




 [1]pANTLR3_INPUT_STREAM [2]antlr3NewAsciiStringCopyStream ([3]pANTLR3_UINT8
                           inString, [4]ANTLR3_UINT32 size, [5]pANTLR3_UINT8
                           Create an ASCII string stream as input to ANTLR 3,
                           copying the input string.
                ANTLR3_API [7]antlr3NewAsciiStringInPlaceStream
 [6]pANTLR3_INPUT_STREAM ([8]pANTLR3_UINT8 inString, [9]ANTLR3_UINT32 size,
                           [10]pANTLR3_UINT8 name)
                           Create an in-place ASCII string stream as input to
                           ANTLR 3.
                ANTLR3_API [12]antlr3NewUCS2StringInPlaceStream
 [11]pANTLR3_INPUT_STREAM ([13]pANTLR3_UINT16 inString, [14]ANTLR3_UINT32
                           size, [15]pANTLR3_UINT16 name)
                           Create an in-place UCS2 string stream as input to
                           ANTLR 3.


How to write a pretty printer for input language, with C as target; using following approach:

returns [pANTLR3_STRING result]
@init {result = factory->newRaw(factory);}
: ^(SOMETOKEN anotherRule thirdRule)
  $result->append($result, "Start\n");
  $result->appendS($result, $anotherRule.result);
  factory->destroy(factory, $aotherRule.result);
  $result->appendS($result, $thirdRule.result);
  factory->destroy(factory, $thirdRule.result);
  $result->append($result, "\n\n");

Problems encountered:

  1. when factory->close(..) is called at the end of my program, I get a double free problem, and I don't see in the API where I can call remove on the string from the factory.
  2. However, more troubling is that when the return of one of the rules like anotherRule is composed of only small literal strings (like "()"), then calling destroy on the result sometimes frees too much memory, so that I cause problems for "thirdRule".
  3. Moreover, this just seems like an awkward way to build up my output string.


  1. You should not be calling the factory->destroy. Just close the factory, and if you use the parsers factory, you don't need to do that. When you are done, just memcpy (or strdup) the chars pointer from the final string. All the other memory will be discarded for you. If you are just going to write out the result, then fwrite the chars pointer and close as normal - all memory will be freed for you.
  2. factory->close(factory); // Do this only if this is your own factory, which there is no need for really.
  3. With respect to point 2 in the question, you don't use destroy() like this, so next time you try to use it, you have already corrupted the memory, so that causes problems for "thirdRule".
  4. With respect to point 3 in the question, that is because you are assuming that you have to do all the management. You don't do that, you just let the factory take care of it all. When you close, it has tracked all the memory and it frees it all for you. You just use the strings and forget about them as if they were Java objects. Use the factory in the parser (see C examples for poly for instance), and you don't even need to close your own factory.
  5. You do not need to use the string factory stuff. It is really just a convenience. You can copy the input text yourself using the token supplied offsets.

How can I access file, line and column information for tokens used by the tree parser (ANTLR 3.1)?


[ In the following example, how can I access the location of the 'for' token used in the tree grammar to pass to the AppendOp() function?

Similarly, I'd like to add debug (file+line+column) info for all generated objects.

// in lexer
   :    'for' '(' start=expression? ';' cond=expression? ';'
next=expression? ')' body=statement
     -> ^('for' $cond $next $body $start )

// in tree
    :    ^('for' cond=continuation step=continuation body=continuation
expression)    { AppendOp(Operation::ForLoop); }

Christian Schladetsch

1. It is the same as the other targets. Get a reference to the node, then the token that the node holds, or the token spans that the node represents (so you can use the start position of the start token and the end position of the end token for semantic errors say). To get the file name, ask the token what its input stream was, then ask the input stream what it its input is called.

Just be careful about the imaginary tokens as you can use their start and stop spans, but they were not generated from input, so the nodes themselves are imaginary.

However, you are using 'literals' in your parser and tree parser and this will make it very difficult for you to identify tokens unless you already have a context for them. You are well advised to replace these literals with 'real' tokens defined in the lexer, after which you can switch() on the token type and so on:

FOR: 'for';
SEMI: ';';

: FOR start=expression SEMI  ....etc

With good token names, the grammar will be no less readable. Finally, one tip is to pass the tree nodes around within your own routines, rather than the tokens they contain,span and so on. Finally, you have 3 int fields (user1, user2, user3) and a void * field (userp) available in a token, into which you can place anything you like at any point, because overriding token types and so on is a bit of a pain in C if you don't already know how to do it.

2. For getting filename, in the C target especially, switching input streams in the lexer is allowed, hence we added the ability for CommonToken to tell you what its input stream was.

Line and column can be taken as attributes 'line' and 'pos' from tokens though:

^(forTok = 'for' cond=continuation step=continuation body=continuation
      { line = $forTok.line;
        column = $forTok.pos;}

It may not be enough though when using error messages or other processing.

How to get line and column numbers from a CommonTreeNodeStream to use it to print error messages?



Say you had:

foo: bar e=expression;

And expression should be int but returns float instead. So, pass $e.tree to your message handler and cast it to CommonTree. This reference then has access to the documented methods of CommonTree. So:

public void logMsg(MessageDescriptor m, CommonTree ct, Object... args) {

  CommonToken st;
  CommonToken et;

  st = (CommonToken)(tokens.get(ct.getTokenStartIndex()));
  et = (CommonToken)(tokens.get(ct.getTokenStopIndex()));

  // Call the standard logger, using the information in the tokens
  logMsg(m, st.getLine(), st.getStartIndex(), et.getStopIndex(), args);


You can then print out something like:

Warning: (6, 33) : Expression should result in an integer value but gives a
bar 84/3

How to find the offset of a token, in terms of the number of characters from the start of the stream?



The very first token gives you a =1 for the char position in line I am afraid, I need to work around that I think, but the indexes are pointers in to memory (your input) and not 0, 1, 2 etc. Note that the token also remembers that start of the line that it is located on.

If the start of the first token is not the start of your data, then perhaps there are comments and newline tokens that are skipped before the first token that the parser sees? If this did not work, there would be a lot of broken parsers out there.

So, use the pointer to get the start, subtract it from the end pointer to get the length and print out that many characters, which will show you what the token matched. The line start is updated when a '\n' is seen by the parser, but you can change the character. This is useful for error messages when you want to print the text line that an error occurs in.

The offset of the token is the start point minus the input start (use the address you pass in (databuffer) and not input->data), however, the pointer is pointing directly at that anyway. The token stream does not return off channel tokens or SKIP()ed tokens.

Why am I getting incorrect values for CharPositionInLine for tokens on first line of input ?


I'm using ANTLR3 in one of my project and it seems that I found small bug in ANTLR C (version 3.2, 3.1.3 ).

ANTLR3 C runtime returns incorrect values of CharPositionInLine for tokens on first line of input. This is eventually propagated to tokens.I wrote small program to demonstrate this behavior - it is available  here: http://devel-www.cyber.cz/files/tmp/antlrc3-bug-pack.zip

Bug causes token stream post-processing a little bit more complicated  than it can ideally be ...

Here is expected output (produced by Python target). First character is from input.txt, second is result of  getCharPositionInLine():

--- cut --- cut --- cut ---

1 0

2 2

3 4

\n 5

1 0

2 2

3 4

\n 5

--- cut --- cut --- cut ---

and here is actual output from C target:

--- cut --- cut --- cut ---

1 4294967295

2 1

3 3

\n 4

1 0

2 2

3 4

\n 5

--- cut --- cut --- cut ---

Please notice that first token has undefined value in  getCharPositionInLine() and rest till end of line is shifted by -1.

Workaround is to set ctx->pLexer->input->charPositionInLine to zero  after constructing lexer and before actual lexing/parsing.

Ales Teska

Yes - you have to special case it. However, you only ever come across this while developing really ;-)


How to solve memory usage issues for following generated C parser?


(memory usage climb from 770MB to 4GB in a few seconds and the parser never returns.)

init code is as follows:

input = antlr3NewAsciiStringInPlaceStream((pANTLR3_UINT8)stringCopy,
stringLength, NULL);
if (input == NULL) {
    log->error("input error");

lexer = DbsMySQL_CPPLexerNew(input);
if (lexer == NULL) {
    log->error("lexer error");

tstream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,
if (tstream == NULL) {
    log->error("token stream error");

parser = DbsMySQL_CPPParserNew(tstream);
if (parser == NULL) {
   log->error("parser error");


I added some printf() statements to the generated code and tracked the issue down to the "LA(1)" macro which expands to "ctx->pParser->tstream->istream->_LA(ctx->pParser->tstream->istream, 1)". It seems that this call never completes. Here is the code in question:

statement(pDbsMySQL_CPPParser ctx)
   /* Initialize rule variables
    /* Empty */
         //  DbsMySQL_CPP.g:74:1: (i= insertStatement EOF | s=
selectStatement[true] EOF | u= updateStatement EOF | d= deleteStatement EOF
        printf("** 1\n");

        ANTLR3_UINT32 alt1;

        printf("** 2\n");

        switch ( LA(1) )  // this call never completes
        case INSERT:


Usually, this problem is created by an incorrect lexer rule that is matching nothing, hence the parser asks the lexer for the first token and the empty rule matches and nothing is consumed from the input stream. Hence your lexer will carry on returning empty tokens until it runs out of memory.

Look for the following errors:

TOKEN : ('a'..'z')* ; // Note that this should be + and not *

TOK : ; // Forgot to define this or meant it to be fragment

I am pretty sure that this is what you will find somewhere. If you single step into the LA routine, then you can trace through to the fillBuffer() method, which is trying to create the token stream before it gives the first token back to the parser. Keep stepping and you will find yourself in the code for a lexer rule. It is likely that first lexer rule that you enter that is in error.

Memory management in C target:


Problem: with very big input (e.g. 530.000 lines of C code) it crashes because it hits the 2gb process memory limit.


The C target will be a lot faster than the Java target, but the objects that are created are probably bigger. It is probably better to reduce the input though. 530,000 lines of C code as input seems a bit of a tall order for anything, even if you parse it. The individual input files would be better. Do it in place not by memcpy and so on. You can also write a simple override for the character stream.

Write an input stream wrapper that splits the input by just returning EOF at the split point then resets to the next unit.

Also, I think you were using $text references in your parser and these will create hundreds of thousands of string objects that will not be released until you release the parser. Try making a version that does not build a tree and see how it differs.

To use the text of an object it is better to get the pointer to the input from that object and use the length (start and end pointer are stored in the object) so that you make no copies or memory allocations. You cannot use strlen as this will not stop until the end of the input string, The start pointer points directly to the input text and so does the end point. The length is the difference in the two.

The $text (in the C target) is a convenience method that is relatively slow and inefficient; it is just there when you don't really care that much about those factors.

You can also try 64bit mode, which will raise the 2GB bar.

How to avoid using $text: It is clear that for tokens in the parser, you can use getStartIndex and getStopIndex directly to avoid using $text. How can you do this for an arbitrary tree node when walking the tree? It appears in this case that you also need the token stream (to ask for the token using get()). Is there any way to get the token stream from the tree node or is there another way to get the text associated with the node?



The pointer to the token (unless it is an imaginary token produced by the parser) is in the token field of pANTLR3_COMMON_TREE. So, take the pointer to the base tree that the reference in the tree gives you and cast the super field in there to (pANTLR3_COMMON_TREE) and then the token field is pANTLR3_COMMON_TOKEN. However, the getToken() method of pANTLR3_COMMON_TREE will do that for you. Look at the methods in antlr3commontree.c.

Remember that the tree parser only deals with pointers to the lowest basic structure which is pANTLR3_BASE_TREE and that has a pointer 'super' to the structure that contains it (normally pANTLR3_COMMON_TREE) which also has a suprt pointer in case you encapsulate it further (usually too much hassle to be worth it).

For a node with children then follow the lists recursively.

The code that produces dot files for an arbitrary tree is a good place to look for hints as this traverses pANTLR3_BASE_TREE and looks for the text that represents it. You will find that in antlr3basetreeadapator.c

How to reduce memory usage for the C target?



  • A user hacked commontoken to remove most of the function pointers, which halved the size of the tokens.
  • It is recommended to not use $text since it could lead to increased memory usage
  • A user ran some more experiments using valgrind to profile the heap allocations and now sees about 70:1 using only his modified lexer on a 64-bit system.
    Enabling parsing with AST construction roughly doubles this.
    This user's changes are in smalltoken.tar. The setText/getText functions have many dependencies so it wasn't as easy to do a search and replace to change those. The startIndex/stopIndex functions are used by the generated code so those were left alone.
  • It was found it was possible to remove the user1, user2, user3 fields and the custom function pointer with only minimal changes in other source files. This gave approximately a 10 percent reduction in memory usage.

How to solve problem of ANTLR running out of memory while parsing huge files?


[ have written a simple grammar to parse huge data files (several gigabytes each) and antlr seems to crash by running out of memory (I am using "C" as the target language).

The data files have the general format:
<several millions of lines here>

What seems to be the problem is that antlr tries to parse the whole data file at once. Is there a way to "force" parsing line by line? (at least for the "BODY" part?)]

Nick Vlassopoulos

  • You will need to split the input into more manageable chunks yourself I am afraid. When you start the parser it asks the lexer for the first token, which causes the lexer to tokenize the entire input.
    You can feed line by line by resetting the lexer and parser and providing a new string stream with the pointer and lengths set accordingly and hence a new token stream for the chunk you wish to parse next. There is a relatively small overhead in doing this from C and it is the same technique you would use to parse any chunk. If your input is several gigabytes, then the standard technique of reading the whole file at once and parsing it all at once would not be so useful anyway. In your position i would write a custom input stream that performed buffered reads on the file and returned EOF at strategic points, but which could be reset (or maybe auto-reset) until the real EOF is found. Your parser can retain state so you know where you are. At each EOF, you can ask the input stream if it was really the end or just a fake end, which you can then restart with. Make sure that you retain the input stream for as long as you need to actualized the text of the tokens as the tokens just point in to the input stream. However, you can set the text explicitly or build up your output on the fly and so on.
  • You don't want something like a 'line parser' that's instantiated for each line of the BODY section to do this really, you will create millions of malloc/free calls - go with the custom input stream I mentioned and you will be fine. It sounds like you can easily pick out the faked EOF points without parsing them.
    If the input is just millions of data elements, then you could parse the headers, then have the input stream traverse the data points with a little custom code, until the next header is seen.
    The input lines are of the form"var = data" so they are pretty simple!
  • Perhaps you can do this all in the lexer and not create tokens for the data but just use the input stream in your own lexer action code.

But I was thinking this:

  1. Copy my input stream code and name it for yourself;
  2. Have it respond to LA() using buffered reads until it finds the token that starts the body, say it is 'BODY', then it returns EOF;
  3. Invoke the parser/lexer/inputstream stack and it will set up the information you need for the incoming data and stop, the input stream remembers where it was;
  4. Process the data using a little custom C code that works with the input stream until you see the data has ended, tell the input stream where to restart;
  5. Tell the input stream to set up for the next header starting at the data end location. If it wasn't at real EOF, then go to 3)
  6. End

You can also do the same thing without a custom input stream, but then you would be reading the entire file and pre-scanning and so on.

If your headers are pretty simple, you might also find that an awk script or just plain C code is a better method.


How to label multiple attributes for referencing in actions?



decl: type id=ID ';' { print "var" + $id.text; }

With the C language target, the $id.text gets converted nicely into:


However, if you have more than one attribute:

decl: ^( TYPE ids+=ID* )

...$ids becomes a pANTLR3_VECTOR. How to access these attributes in this case?


Some pointers:

  • The += syntax is really only used for tree rewriting
  • Instead of gathering a list, then trying to process it afterwards, do the following (and note that you use + as otherwise if there are no IDs then it is just a TYPE alt):

    : TYPE // No IDs
    | ^(TYPE
         i=ID { some code that does $i.whatever }
      { action code to finish up }
  • The attributes count will be available in the tree parser as the args will be in a vector and you can reference the count of the list.

How to create a system of preprocessor (by extending the "mTokens" function in the lexer? )?


Fabien Antoine


  • mTokens is generated but you can install any function you like in its place by adding code in the initialization. You should see that the address of the static function is just stored in the lexer structures. Just replace the pointer with a pointer to your own function. Read through the source though to find out what you need to do in that function and the functions that it invokes. If you are trying to override this function, then you are really replacing the lexer.
  • Look at the generated code and the 'constructor' for your lexer. If you trace through that code you will see that the mTokens() is just a static function, the pointer to which is installed in the lexer structures. You ca install your own.
    However, pre-processors are easy enough and though you have to add extra code to your lexer rules, it is much easier (and more maintainable) to set the channel of the token to a member variable value or call SKIP. Then implement the pre-processor in the lexer. Sometimes you cannot do this though. For instance the C# pre-processor is implemented in the lexer, but the VB pre-processor is its own up front program (which I implemented as a parser grammar myself).
  • Override the nextToken and nextTokenStr functions. Copy them from antlr3lexer.c, modify as needed and perform the SKIP operation when your flag is set to 'off'. You then install the pointer to your version of nextToken() in the lexer structures after you have created your lexer and before anything calls in to it.
    (neater to explicitly code for this in the lexer rules.)

How to concatenate literals in parser in ANTLR 3?


In SQL we must be able write

SELECT 'aaa' 'bbbb'

And this should be same as

SELECT 'aaabbbb'

I.e. Parser must concatenate literals self.

Ruslan Zasukhin

option 1: Each token contains the char * pointer that is in to the input stream start, or use Jim's build in string stuff and have it auto free then it is just:

@declarations { pANTLR3_STRING s; }
   { s= $s1.text; }
	 s->append(s, $s2.text);

     { $s1->setText(s);  /* Check that, but I think it is this */ }


option 2: There is no need to do that in the parser, just use:

: s1+=STRING -> $s1+  /* Or, ->^(SLIT $s1+) */

Then just do the string manipulation in the tree walk (which means you will only use it if you have to). You still need to reference the text of course.

So, really, don't use the $x.text as it is slow, just call an external C++ method/object that takes a pointer to the token or base tree object and extracts the string (start is a void * address of first char, end is a void * address of the last char, length is the difference). You can make a neat C++ class that can accept either of these in the constructor and has an overloaded append() and a getCstr(). Don't try to do too much in the code itself and there is no need to amalgamate text and things in the parser.

Follow-up to above question:

Can following be used to resolve everything at the LEXER level?

So we must resolve tokens as

  • STRING_LITERAL 'aa' ws* 'bb' => Token( "aabb" )
  • STRING_LITERAL 'aa\'bb' => Token( "aa'bb" )
  • STRING_LITERAL 'aa''bb' => Token( "aa'bb" )
  • STRING_LITERAL 'aa''bb''cc' => Token( "aa'bb'cc" )
  • HEX_LITERAL x'aa' => Token( "aabb" )
  • HEX_LITERAL x'aa' ws* 'bb' => Token( "aabb" )


You can of course process things anywhere that it does not cause ambiguity but the best approach is to defer any processing that you can until the last point in time, so that you do not process anything that you find you don't actually need to. The second 'rule' is that you only want to process things once, so process and cache the result for later.

If you can modify the input stream, then you don't need to copy anything here, just move the start and end pointers in the token and overwrite the few bytes that you are moving. That way there is no malloc and nothing to free. If you cannot modify the input stream, then you will need to copy from the token pointers of course.

So, here you should lex the escape characters and the embedded '' in to STRING_LITERAL but not try to process the WS* there, return two or more tokens. Then the parser or tree parser can process the strings. If you are going to do multiple walks, then probably in the parser, but if just one walk (ot only one walk where you care about the text represented by the tokens), then process in the tree parser when you hit the STRING_LITERAL+

How to manipulate(push/pop) some additional data structures all along the stack of input streams.

(want to manage data structures in a parallel stack and pop on EOF in v3 in C)



Override nexttoken : how to do it is explained in antlr.markmail.org. Basically you make a copy of the functions you need and install your own pointers.

How to read a stream from standard input?



Because the lexer requires the entire input to work, you need to find some point in your input stream where you can accumulate packets, then you can use the standard constructors for in memory input streams and so on, reset/rebuild the lexer/parser/tree parser combinations for each 'packet'. There is not continuous parsing at the moment, though I have some ideas on how this might be done (it is not trivial).

Values of uninitialized pointers


Xie, Linlin

The pointers that are not initialized are set to value 0xcdcdcdcd by windows debut malloc. You won't be able to rely on that value in production mode or other platforms. The user pointers are not initialized deliberatly as they are for you to use. So if you need them to be null sometimes then you should set them in your own version of next token or whatever.

How to reduce size of the generated files for Sparql grammar and get the files to compile; options used:

options {
     language = C;
     output = AST;
     ASTLabelType = pANTLR3_BASE_TREE;



The huge file size occurs because your lexer/parser is probably trying to do too much or asking ANTLR to do lots of disambiguation and the complex overlaps are generating huge tables. In the case of the parser, I suspect that you need some single token predicates to help with keyword disambiguation; have you removed ALL the warnings that ANTLR generates on your grammar? If you do not remove all the warnings then this sort of thing happens a lot. Especially on a terrible language such as SQL has morphed in to.

The code only LOOKS small in Java because the generated java uses run length encoded strings for the table values that it must expand at runtime - the C target lays down the exact same tables, but in static so that it is set up at compile time. Java is unable to use compile time initialized tables like this until JDK 1.7, so the Java target must jump through hoops to generate the tables. So in fact generating the C is a better indicator of how efficient your grammar is. You can probably trace the table sizes down to a few key decisions.

Your set text errors are likely that you are not using the SETTEXT macro correctly in some way. Also, I would avoid doing that at lex time and do any manipulation if you actually use the token in question. I can't help unless I see the lexer code in question though.

Use the 3.4 beta C runtime - there is no difference in the release version except for the API documentation.

If you are just generating the lexer that was written for the Java target by only changing the language= option. You must use the SETTEXT macro for the C target.

The lexer rules:

BLANK_NODE_LABEL : '_:' t=PN_LOCAL { setText($t.text); };

VAR1 : QUESTION_MARK v=VARNAME { setText($v.text); };

VAR2 : '$' v=VARNAME { setText($v.text); }

are coded for Java and not C, you cannot simply change the target language when there is embedded Java code.

All the lexer rules are specified as ('E'|'e' etc, which will generate bigger tables than the other ways to implement case insensitivity as explained on the wiki. Also, it has a lot of rules that it has just left ANTLR to sort out, which is fair enough, but it is much better to left factor the rules and change the $type once you know what the token is. For instance all the numeric rules.

The parser grammar will just work, but it is just naturally a big one. You might contact the authors about it. There are probably a lot of ways it could be made more efficient, but as the tables are all static, then it does not matter that much in C. Look at the size of the data segment once it is compiled as this is a better indicator than the size of the source code, which has lots of annotations.

Finally look at the code that it is output, find the decisions that are generating large decision trees and look at the corresponding rules for any optimizations. However fix up the SETTEXT and it will just work.

To fix the SETTEXT I would just not do what they are doing but merely advance the start pointer in the token by 1 or 2 when/if you use it (or within the lexer code if you must). That is trivial and better performance. In otherwords just take the setText() actions out altogether.

reuse() method in 3.4 C runtime: (06/24/11)


Because the documentation is not yet up to date, here is an example of reusing the allocated memory in input streams and token streams:

    for (i=0; i<iterations; i++) {

        // Run the parser.



        // --------------------------------------

        // Now reset everything for the next run.

        // Order of calls is important.

        // Input stream can now be reused


        input->reuse(input, sourceCode, sourceLen, sourceName);

        // Reset the common token stream so that it will reuse its resources



        // Reset the lexer (new function generated by antlr now)



        // Reset the parser (new function generated by antlr now)




Note that tree parsers cannot reuse their allocations but this is rarely an issue. The input->reuse() will reuse any memory it has allocated, but requires that you handle the reading of the input files (or otherwise supply a pointer to them). The input files are assumed to be encoded in the way that the original input was created, for instance:

input = antlr3FileStreamNew(fname, ANTLR3_ENC_8BIT);

Then all reused input must be 8 bit encoded.


How to resolve conflict between ANTLR emit and QT emit


(Qt defines "emit" as a keyword. This conflicts with the definition of emit in ANTLR v3:

emit       (pANTLR3_LEXER lexer)

Unfortunately changing this requires changes in both the ANTLR generator and the C runtime. Without the changes ANTLR v3 is unusable in C++ source files that also use Qt.)


You should:

  1. Probably not include both ANTLR and QT headers in the same space - you should not be mixing code in with ANTLR code I think;
  2. If you must do this, and I can't see why you must, as you should call external helper methods from actions, then undef emit before including the antlr headers.
    But, the answer is to decouple your code.

How to free the ANTLR3_STRING returned from pANTLR3TOKEN_STREAM's toStringSS()?


Chetty, Jay

The correct way is to do nothing at all. The string factory tracks the memory and releases it once you close the factory. The factory will be created by the token or node stream and will be closed when you free the token or node stream.

Note that the string factory stuff is basically convenience methods for things like toStringTree(). If you need pure performance, then you should access the token structures yourself.

You can use valgrind to check that all your memory is being freed correctly.

Also note that while making a string from the AST is a quick way of looking at your tree, you might find the dot generator and the dot program much more useful (install graphviz from your distro or www.graphviz.org).

To use it from C, you just do this:

pANTLR3_STRING    dotSpec;
dotSpec = nodes->adaptor->makeDot(nodes->adaptor, psrReturn.tree);

Where nodes is the pANTLR3_COMMON_TREE_NODE_STREAM and psrReturn is the return type of the rule you invoke on your parser.

You can then fwrite the spec to a text file:

dotSpec = dotSpec->toUTF8(dotSpec);   // Only need this if your input 
was not 8 bit characters
fwrite((const void *) dotSpec->chars, 1, (size_t) dotSpec->len, dotFILE);

Then turn this in to a neat png graphic like this:

sprintf(command, "dot -Tpng -o%spng %sdot", dotFname, dotFname);

How to remove compiler warnings due to conflicting types in the ANTLR generated C code: (warning C4244: '=' : conversion from 'ANTLR3_MARKER' to'ANTLR3_UINT32', possible loss of data

This comes from the line:

axisMask_StartIndex = INDEX();

axisMask_StartIndex is declared as type ANTLR3_UINT32, and INDEX() is

returning type ANTLR3_MARKER. On a 64-bit build (on a Windows


ANTLR3_UINT32 is a typedef of uint32_t, and ANTLR3_MARKER is of type

ANTLR3_INT64 which is a typedef of int64_t.

It seems to me that this is a bug in the template, and that axisMask_StartIndex should have been declared as type ANTLR3_MARKER.

grammar Test;










: 'A'..'Z';



  • It is probably because the backtrack and memorize option. I strongly advise that you don't use these but left factor your grammar.
  • Justin Murray: I tracked down and fixed this bug in the C target template. Attached is the patched version of C.stg, which I extracted from antlr-3.4-complete-no-antlrv2.jar/org/antlr/codegen/templates/C/C.stg and modified. It looks like it is a simple one line fix on line 1699, changing from ANTLR3_UINT32 to ANTLR3_MARKER. I've confirmed that this patch makes the warnings go away in my 64-bit build.
    (Attachment available in markmail)

Problem compiling an application using LibXML2 (http://xmlsoft.org/) with a C lexer/grammar generated from ANTLR.


My includes look like:

#include    "MyParser.h"
#include    "MyLexer.h"
#include    "antlr3.h"
#include <libxml/tree.h>
#include <libxml/parser.h>

The problem is my program refuses to compile. My problems are documented here: http://stackoverflow.com/questions/6769548/antlr-c-target-and-xmllib


The header files are interfering with each other. You probably need to separate your logic from your parser .g file. Look at the lines indicated as in error and you will probably find some symbol that ANTLR defines that libXML is also defining. When you know what it is, you can possibly undef it before including the libXML headers.

I am working on converting the C++ grammar by Aurelian Melinte from a C back end to a C# back end and cannot find out how to properly convert the @declarations section to C#.



Declarations is only needed for C targets (and really only to be C89 compliant which used to be important before gcc everywhere). So just roll them up:

	int i;
	i = 0;

from a C target is just

	int i = 0;

In C# and Java targets and probably most others.

Does antlr3.1.3 generated parser work with UTF-8 input? If it does, how should I configure in the grammar? I noticed there are two macros ANTLR3_INLINE_INPUT_ASCII and ANTLR3_INLINE_INPUT_UTF16, but no UTF-8 one.



I have written a new universal input stream for the next version of the C runtime. It takes 8bit, 16 bit, UTF-8, UTF-16, UCS2, UTF32 and EBCDIC (code gen will change slightly to support this). It is not well tested right now but will be available as a snapshot 3.3 release shortly in the downloads page.

In the meantime the easiest thing to do is to convert to UCS2 using the supplied converter in the current runtime. Use ConvertUTF8toUTF16() in the file called 'antlr3convertutf.c".

Though this will not work with surrogate pairs in UTF-16 though but most people do not need that.

If you really need UTf-8 without conversion then it is easy enough to write, or you can just steal the code from my check in of the code in about 10 minutes. Note that while the streams work, I have not provided ANTLR3_STRING support for UTF-8 and so on yet and so getting $text from such a stream may or may not work,

Normally a tree is built with a nil node and the children. But if the parser recognizes just one line, the nil node is not used (probably there are no children). What is going wrong?

this is my top-rule...

      @init{ _pParser->m_bError = false; _pParser->m_ScopeDelimiter =
"@"; }
      : ( pragma
        | expression_statement
        )* EOF!

Can you see a problem?



Actually the nil node should never be there so there must be something awry with your grammar. Try making sure that your tope rule looks like:

top : myrule EOF! ;

You should always rewrite the top node so you have a single root node and that is your problem as the nil node is created to hold the children and then as you don't rewrite it, it stays there but it is not created when there is just a single node. So just do this:


      @init{ _pParser->m_bError = false; _pParser->m_ScopeDelimiter = "@"; }
      : telement EOF

		->^(TUNIT telement)

     :( pragma
      | expression_statement

And you will be all set.

How to make the lexer thread-safe ?


My lexer has to rely on some internal status like the following:

DQUOTE  :   '"' { if(LA(-1) != '\\') double_quoted = !double_quoted;
SQUOTE  :   { double_quoted }? => '\'';
SINGLE_QUOTED_STRING_TOKEN  :   { !double_quoted }? => '\'' .* '\'';

"double_quoted" is a bool variable declared in @member section. The generated code will declare it in global scope, which is not thread safe. I wonder if there is any way to make the lexer thread-safe? For example declare the variable in xxxLexer_Ctx_struct.

Mu Qiao

    ANTLR3_BOOLEAN			double_quoted;		// blah

    lexCtx-> double_quoted	= ANTLR3_FALSE;			// Init


Why doesn't @after action work for C target?


Ronghui Yu

@after only works with ANTLR 3.2. You can also use exception() with the C target on 3.2

How to solve the problem of structures used for return values and those used for scope values not being initialized?



I want to have a scoped value which is a structure, and my structure holds some std::strings

struct MyStruct
  std::string s1;
  std::string s2;

//this is part of my grammar
scope {MyStruct s;} //scoped VALUE
: rulegoeshere....;

The code generated by antlr creates a scoped wrapper structure that holds MyStruct, something like:

ctx->pMyParser_myruleTop = pMyParser_myrulePush(ctx); // this will create a wrapper for the scoped value by calling ANTLR3_MALLOC

the wrapper looks like this:

typedef struct  MyParser_myrule_SCOPE_struct
    void    (ANTLR3_CDECL *free)    (struct MyParser_myrule_SCOPE_struct *
    MyStruct s;

As you can see my struct is inside this structure. The problem is that to create the wrapper (see pMyParser_myrulePush above) antlr calls ANTLR3_MALLOC (which does malloc of course); this results in a crash.]

Cristian Târşoagă

– Initialize all your values in the @init{} section.

– As the memory has already been allocated, the correct way to initialize it is to call the constructor with placement operator new.

And it is necessary to register your own free method to call the destructor.

grammar MyGrammar;

options {
        language = C;

scope GS {
  MyStruct s;

@parser::includes {
#include <new>
#include "MyStruct.h"

scope GS;
  ctx->pMyGrammarParser_GSTop->free = &free_MyStruct;
      : 'foo'
      | 'bar'

#include <string>
#include <antlr3.h>

struct MyStruct
  std::string s1;
  std::string s2;

extern "C" {
  void ANTLR3_CDECL free_MyStruct(struct MyGrammarParser_GS_SCOPE_struct

extern "C" {
void ANTLR3_CDECL free_MyStruct(struct MyGrammarParser_GS_SCOPE_struct *scope)
  // Call the destructor

– There is a macro to access the SCOPE_TOP().

How to add custom data in ANTLR3_BASE_TREE?




There is a field "u" in the ANTLR3_BASE_TREE, which can be used to point to your own struct that contains the data you need for that node in the tree.

1) Set it to NULL yourself at an appropriate place.
2) In the ANTLR source code, you will find:

/// Default definition of ANTLR3_MALLOC. You can override this before including
/// antlr3.h if you wish to use your own implementation.
#define	ANTLR3_MALLOC(request)          malloc  ((size_t)(request))

Change that to

/// Default definition of ANTLR3_MALLOC. You can override this before including
/// antlr3.h if you wish to use your own implementation.
#define	ANTLR3_MALLOC(request)          calloc  (1, (size_t)(request))

And rebuild the runtime and you will sacrifice a little performance for nulled space.

3) Find the code that creates new nodes from a tree factory (the function newPooltree in antlr3commontree.c) and before the return statement, add:

tree->baseTree.u = NULL;

How to change "#include <antlr3.h>"  in generated .h parser file? Can i change this path in grammar options?


Юрушкин Михаил

You don't - just use the correct -I option on the compile line.

Why do these lex rules make the generated .c source file grow to over 40M size?


 I found using the same lex rule,the size of the generated .Java file is much smaller than the generated .C file. For example, the size of the generated  .Java file is 124K, while the size of .c file could reach to 14M!

: 'BEGIN_ENTITY' ( options {greedy=false;} : . )* 'END_ENTITY' SEMI
: 'PROCEDURE' ( options {greedy=false;} : . )* 'END_PROCEDURE' SEMI

: 'TYPE' ( options {greedy=false;} : . )* 'END_TYPE' SEMI
: 'SUBTYPE_CONSTRAINT' ( options {greedy=false;} : . )*

: 'RULE' ( options {greedy=false;} : . )* 'END_RULE' SEMI

: 'CONSTANT' ( options {greedy=false;} : . )*  'END_CONSTANT'

: 'REFERENCE'  'FROM' ( options {greedy=false;} : . )*  SEMI
:  'USE' 'FROM' ( options {greedy=false;} : . )*  SEMI

       :       '(*'
               (       ~('('|'*')
                       |       ('(' ~'*') => '('
                       |       ('*' ~')') => '*'
                       |       COMMENT
'--' ( ~('\n'|'\r') )*
{  $channel=HIDDEN; }

: ';'

: '('

: ')'

: '['

: ']'

: '{'

: '}'


: ( ' '
| '\f'
| '\t'
| ( '\r\n' // Evil Dos
| '\n\r' // Unknown
| '\n' // Unix
| '\r' // Macintosh
{ $channel=HIDDEN; }


: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*


It's an illusion. In C, the transitions tables are laid down statically, and you see their true size, in Java they are expanded at runtime from run length encoded assignments, so you see the size at run time and there is an associated overhead in building them that is not present in C.

 You can also often do better by looking at your lexer rules and being less precise about the characters you accept (or perhaps exclude) from a token and checking these externally to the rule. This allows better error messages anyway. The lexer will just say "Unexpected character 'x' at ...", wheras your own check can say "The identifier 'aaaaaxaaaaa' contains an illegal character 'x' ...."

 The C target doesn't implement the compressed format for DFAs. This would be pointless as they would just incur a runtime overhead to expand them to the same size as well as the storage space for the compressed forms, whereas now they are laid out as static and require no runtime overhead.

Why am I getting a null pointer to setTokenBoundaries in the following line of generated code? 


ADAPTOR->setTokenBoundaries(ADAPTOR, retval.tree, retval.start, retval.stop);

The grammar works under Java but not in C. I have set in options:

options {
	backtrack 	= 	true;
	memoize		= 	true;
	language	=	C;
	output		=	AST;

The null is inside 'ctx' inside 'adaptor' at 'setTokenBoundaries'.

It is inside a function 

* $ANTLR start line
* /Users/acondit/source/GCCnv/LatheBranch/trunk/Parser/RS274ngc.g:184:1: line :
( ( line_number )? ( segment )+ K_NEWLINE -> ^( STMT ( segment )+ ) | (
line_number )? K_NEWLINE -> | oword_stmt -> ^( STMT oword_stmt ) );
static RS274ngcParser_line_return
line(pRS274ngcParser ctx)

which I assume, based on the comment, is generated from this rule:

 line	:	line_number? segment+ K_NEWLINE
		-> ^(STMT segment+)
	|	line_number? K_NEWLINE
	|	oword_stmt
		-> ^(STMT oword_stmt)

The grammar is for parsing an existing language not one of my invention, and grammatically the newlines delineate a semantic block therefore must be known by the parser, but empty lines are discarded and therefore should not be in the tree.

Alan Condit

Answer thread :

 I think you will have to put those three productions in separate rules


I can put two of the productions in separate rules but the first two productions are really one split for simplicity of writing the rewrite rules.

 Without the rewrite rules it is this

line : line_number? segment* K_NEWLINE
 |  oword_stmt

With the rewrite rules you can get to this

line	:	line_number? ((segment+)? -> ^(STMT segment+)?) K_NEWLINE
	| 	oword_stmt
		-> ^(STMT oword_stmt)

You can split those two productions into two separate rules but they ultimately have to be combined. Like shown below:

program	:	stmt

stmt	:	line+

line	:	line_number? ((segment+)? -> ^(STMT segment+)?) K_NEWLINE
	| 	oword_stmt
		-> ^(STMT oword_stmt)

So by splitting them you would get something like this:

program	:	stmt

stmt	:	line+

line	:	sline
	| 	oline

		// a segment line can have 0 to several segments
		// but segment lines with 0 segments should not be in the AST tree
sline	:	line_number? ((segment+)? -> ^(STMT segment+)?) K_NEWLINE

	| 	oword_stmt
		-> ^(STMT oword_stmt)

The line_number and the K_NEWLINE token are never in the tree. Bottom line is you still have to deal with an empty rewrite rule.


Why have you got (segment+)? And you are discarding your line number and rewriting in the subrule. 

Try this first:

   : line_number? segment* K_NEWLINE

        ->^(STMT line_number? segment*)

   | oword_stmt

        ->^(STMT oword_stmt)

The problem is that your telling me that the cardinality of segment is + but it is in fact *. I am pretty sure that this will work then.

Final decision on value initialization in C target



After some wrestling with the templates I have found a way to preserve the tree rewriting semantics but allow a more C like behavior for grammars whereby if the grammar programmer does not initialize a return value from a rule, then it is just left as it would be in C, being uninitialized and therefore likely to be garbage.

 While this may break backwards compatibility (and so I will emphasize this in the release notes), it seems better to behave like C than Java in this respect. Because I am able to preserve the tree rewriting semantics, I feel that those few affected by this will agree that not initializing return values without being told to makes more sense in the long run.

 So, the ways to initialize parameters are:

r returns [enum FRED = FRED_VAL1]
 : ... ; 

And the initialization will be generated for you. Or you can place initialization in the @init section of a rule, or otherwise initialize via actions. Note all the same rules apply about @after vs @finally vs exception code (see Markmail search if you don't remember seeing that email - will document for next release).

I think that should keep just about everyone happy. Please remember though that the C rules return a struct unless there is only one return parameter and no tree nodes are being generated. This limits some of the things that you can declare as return types. Also, because there are some limitations in the generic parsing of the return element specs, it is quite often desirable to make a typedef of a complex declaration. Finally, passing things around in a parser instead of waiting for a tree parse is not generally a good idea anyway, because of the complications of freeing things if you hit a parsing error.

Why does the following grammar generate NoViableAltException in ANTLRWorks ?


1) Consider the following grammar:

      grammar schema;

               language = C;

       root : letter* ;
       letter : A | B ;
       other : C;

       A       :       'a';
       B       :       'b';
       C       :       'c';

If you run it on the input string "abc" in ANTLRWorks it generates a NoViableAltException (as I would expect), but using the C Runtime to parse a 'root' it passes successfully. 

Michael Coupland


 root: letter* EOF;

No exceptions in C so that top rule can only set flags. 

Question 2:

The C Target generates many structs with members called "free" which, while not technically a reserved word, isn't an ideal choice for an identifier name. There are codebases where free is #defined to be something else, which can lead to problems in the generated code that uses 'free' as a normal identifier. I haven't yet looked into modifying the C target to solve this locally, which doesn't seem like a huge task, but it would be nice if the default behavior were to use some other less-overloaded identifier.

Answer thread:

Maybe, but as free is a function in every C runtime that I know of,  #defining it in a system header file would break a lot more than the  ANTLR runtime. Which system are you thinking of that #defines free? The trade off is the use of an intuitive method name vs something like 'release' or 'close'.


It's definitely a rarity, and something that you have to be very careful about, but many performance-sensitive codebases do ugly/sneaky things to hijack control of memory allocations throughout the system. I know of at least one high-profile commercial game engine that #defines free to be something else, and the MySQL SDK does this as well. I understand the argument for intuitive names, but wouldn't something like freeObject be equally intuitive, and less likely to collide?


Yeah - I think that one would just need to #undef free though to be honest. I don't think that system headers ever override free, only SDKs, so you know when and where to cope. Antlr does not use free, it uses ANTLR3_FREE MALLOC/CALLOC, so that you can predefine that and build the runtime on systems that need something different. Really, SDKs should do the same thing.

Question 3:

I can't seem to find documentation on how the C Target's error handling works. Clearly the documentation at http://www.antlr.org/wiki/display/ANTLR3/Error+reporting+and+recovery isn't directly relevant. 


 It  basically does the same thing as the other targets, but without  exceptions.

Question 3.1:

Where can I find more information about this? Is there a good way to understand how the C Target emulates the Java Target's use of exceptions, apart from reading generated code? There don't seem to be any examples that deal with custom error reporting using the C Target.


Many past posts though:


The docs at: http://antlr.org/api/C/index.html document displayRecognitionError which, just like in Java, is what you must override to implement your own error display. Also, I have commented that routine to death so that you can copy it and modify it to do what you need personally. Just read through the function.

Question 3.2:

 I had noticed displayRecognitionError, but wanted to make sure there wasn't some other error-handling mechanism that I needed to worry about as well.


No. The advantage and disadvantage of the C runtime is that it is pretty raw of course and you tend to have to know a bit more about the internals than with the other targets (though not tons more). The Java and C# versions are  great and in fact I always protoype with one or the other of these, but performance is an issue so tangling with the C is worth it in the end if you need the speed.

For instance, my T-SQL parser has 1100 regression tests and the timings for parsing these and walking the tree look like this:


jimi(50_64)-clean: time tsqlc . >out

0.19user 0.15system 0:00.34elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k

8inputs+232outputs (0major+102770minor)pagefaults 0swaps


jimi(50_64)-clean: time tsql2005j . >out

2.33user 0.08system 0:02.05elapsed 117%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+456outputs (0major+29634minor)pagefaults 0swaps

Question 4:

I was running into some problems with scope variables, and saw this thread: http://www.antlr.org/pipermail/antlr-interest/2009-March/033769.html but the link to http://antlr.org/downloads doesn't seem to work. http://www.antlr.org/hudson/job/ANTLR_Tool/lastSuccessfulBuild/ seems like a good place to get the latest development build, but I can't seem to actually find a download link anywhere?

Answer thread:

It is download rather than downloads. From hudson just click on the project. The first thing that comes up is a list of source code artifacts that you can download???



http://antlr.org/download looks very useful! Is there a link to this page from the main ANTLR webpage? Most download links seem to point to http://antlr.org/download.html , which is different...

For the C Runtime, yes, it's pretty easy to find the download in Hudson, but I didn't know where to go in Hudson to get the specific file you mentioned in the other email (antlr-master-3.1.4-SNAPSHOT-completejar.jar )  - it's  not 100% clear which Hudson job it would belong to (probably ANTLR_Tool, though), and there don't appear to be any downloadable artifacts at http://antlr.org/hudson/job/ANTLR_Tool/


I see what you are saying. You need to take the "Module Builds" link from the  last successful build page:


How to solve the following cast problem?

Using antlr-3.1.3 with the C runtime (libantlr3c-3.1.3); compiling the generated antlr code for parser using g++; For a rule like:

                                :                           (descrs +=

I'm getting a void* conversion error in the generated code:

SVParser.c:1895: error: invalid conversion from ‘void*’ to ‘SVParser_description_return*’

Unfortunately compiling with gcc is not a solution . I've tried to change the AST.stg from the C code generation templates and I think this should solve my problem, but I was not able to build using maven the entire  antlr-3.1.3.jar (I think this is refered in the build instructions as Uber Jar). 

 Is it possible to build the uber jar for the 3.1.3 release based only on antlr.org 3.1.3 source distribution?


Dragos Tarcatu

Jim Idle: Actually, I think I may have fixed this just the other day with perforce change #6115. Try with the latest ANTLR snapshot from the download directory listing:


(3.1.4-SNAPSHOT) or from Hudson:


If you want the patch for 3.1.3 rather than 3.1.4-SNAPSHOT, then download the source distribution, install Maven, copy the patch at:

[1]http://fisheye2.atlassian.com/changelog/antlr/?cs=6115 , rebuild and you are done. If you don't want to build the source, then you can just unjar what you have, change the C.stg template and jar it back up again.

Why is the C runtime crashing during parsing?


I generated code for my first try at antlr, and it crashes during parsing. It's erroring during parsing, and crashing while printing the error message:

antlr3commontoken.c line 346

token->tokText.text    = token->strFactory->newStr8(token->strFactory,


the strFactory pointer is not valid in my case. I have not yet found where strFactory is assigned to. I can get around this by setting up the factory manually:

pANTLR3_INPUT_STREAM            input = ...
pFactoringLexer                             lexer = ...

pANTLR3_STRING_FACTORY        stringFactory = antlr3StringFactoryNew();
lexer->pLexer->rec->state->tokSource->eofToken.strFactory = stringFactory;

pANTLR3_COMMON_TOKEN_STREAM        lexTokens =
pFactoringParser                parser = ...
parser->progStart(parser); // crash here 

Ben Ratzlaff

This may be a bug, already listed in the bug system, which can be browsed publicly at: [1]http://antlr.org/jira

There I think you will find that I fixed this issue, but if not then I know I do have it listed as something to fix for 3.1.2. If fixed already then you can get the latest C runtime dist by visiting Hudson at:  [2]http://www.antlr.org/hudson/job/ANTLR%20C%20Runtime/

You will likely need a latest jar for the ANTLR tool as I think that there are C codegen template changes to go with it. Again, Hudson is your friend: [3]http://www.antlr.org/hudson/job/ANTLR%20Tool/

There is probably something going on in your grammar that triggers this, but it could be a function of the fact that in the latest version of the runtime I changed from using calloc to malloc and at the same time tokens don't necessary have a strFactory (which is detected by the pointer being NULL). I have fixed some things in this area in the latest runtime, and you may be running across this.

You should not need to set up the factory manually as mentioned.

Problems with C runtime and output=template


Richard Lewis

Question 1:

I'm trying to convert my tree grammar from JAVA to C and running into issue with code generation. I'm targeting string templates as the output type and run into the following error:

1>Translating to tree parser.

1> : error 10 : internal error: no such group file ST.stg  

1>Project : error PRJ0019: A tool returned an error code from

"Translating to tree parser."

Are string template outputs supported in the C code generator? I can't

find a corresponding ST.stg in the codegen/templates/C


The C runtime does not support stringtemplates as they are an object oriented system.

Question 2:   I found out from the WIKI that RewriteTokenStream is not implemented yet. Is this related to string template rewriting not being available? If so, when will this be implemented since it's a roadblock for me.


No, it is just that the RewriteTokenStream has been in flux somewhat and is still pending changes to reduce the overhead of using it. No one has asked for it before now and I did not want to write and re-write it. It will be  implemented in 3.2 assuming that the Java version performance is corrected by then.

Question about option greedy


I wrote this grammar:

	: (property comment property)*
	: COMMENT { printf("Comment: \%s\n", $COMMENT.text->chars); }

 	:  '/*' ( options {greedy=false;} : . )* '*/'

	: TOKEN { printf("Property: \%s\n", $TOKEN.text->chars);}


fragment DIGIT  	
	: '0'..'9'

fragment ALPHA
	: 'a'..'z' | 'A'..'Z' |'@'|'.'| ' ' 

The input is:

This is a test /* with a comment */ in the middle

This is a test /* with a comment */ in the middle

This is a test /* with a comment */ in the middle

The result looks good, but some errors are printed out:

test.txt(1) : lexer error 3 :

at offset 49, near char(0XA) :

This is a test /* w

test.txt(2) : lexer error 3 :

at offset 50, near char(0XA) :

This is a test /* w

test.txt(3) : lexer error 3 :

at offset 50, near char(0XA) :

Property: This is a test 

Comment: /* with a comment */

Property:  in the middle

Property: This is a test 

Comment: /* with a comment */

Property:  in the middle

Property: This is a test 

Comment: /* with a comment */

Property:  in the middle

BTW: The line ending in this file is 0x0A.

Andreas Volz

Question 1: Could anyone explain this error and how to prevent it?


You have not specified to the lexer what it should do with those chars (I assume that this is C from your code above):

 NL : ('\r' | '\n')+ { $channel=HIDDEN; } ;
ANY : . { SKIP(); } ; // Always make this the very last lexer rule

Question 2:  How do I not include the '/*' and '*/' tags in the comment match?


From the top of my head:

COMMENT : '/*' { $start = $pos; } ( options {greedy=false;} : .)* { 
EMIT(); } '*/' ; 

I think it is $pos, but you might need to use GETCHARPOSTIONINLINE() rather than $pos.

Question 3: To exclude the /* and */ would something like this work? Didn't that used to work in Antlr 2?  I think that would be a very useful feature to have back.

 	:  '/*'! ( options {greedy=false;} : . )* '*/'!


No. We have explained this many times in the past but it is to do with the performance gains achieved by not associating text with tokens unless you really need it and so do so yourself.

You can also do this:

T1 : '/*' r=FRAGRULE '*/'  { setText($r.text); } ; 

So in practice it is only a minor inconvenience, for a simpler and faster lexer :-)

Question about parse keywords and variables


I have a problem with parse key words.  Like in PL/SQL "EXIT"  is key word. But some variable names also include this key word.

 CURSOR my_cursor IS
 OPEN my_cursor;
 exit WHEN my_cursor%NOTFOUND;
CLOSE my_cursor;

:GLOBAL.exit := 'Y'; --------- (if this statement is  :GLOBAL.exit123:= 'Y', no

If my g file has set "exit" as keyword, when the g file analyses the variable name , always go to "statement" rule to match keyword then throw exception. How can let the parser knows the second "exit" is variable( go to varName rule) not key word ( don't  go to "statement" rule)?


                     "EXIT"^ (expression)? (WHEN! (expression))?

varName :
COMMIT) )? ( DOT (IDENT) )?       {#varName.setType(VARIABLE_NAME); }

Renee Luo

You need an identifier rule and use that not an ID token, when identifiers can also be keywords:

 id : ID | t=EXIT { $t.setType(ID); } .... ;

It can be done for all SQL keywords:



Segfault in C target on EOF error reporting


If there's an EOF in the grammar, the C target crashes on input with certain syntax errors. I wrote this quick fix. I hope it at least helps you pinpointing the problem.

 --- libantlr3c-3.2/src/antlr3baserecognizer-orig.c 2009-12-11 23:54:59.000000000


+++ libantlr3c-3.2/src/antlr3baserecognizer.c 2011-02-03 14:39:59.942609300

@@ -2216,7 +2216,7 @@

if	(text != NULL)
-		text->append8(text, (const char
+		text->append8(text, expectedTokenType == EOF ? (const char *)"EOF" : (const
char *)recognizer->state->tokenNames[expectedTokenType]);
 		text->append8(text, (const char *)">");

Marco Trudel

Jim Idle: I have already fixed this I think. It is because the EOF token is trying to be duplicated or otherwise modified. However, the runtime error message routine is just an example - you are expected to implement your own that does something sensible ;-)

Issue with Missing header file in dist 3.1.3: antlr3config.h


I'm trying to learn Antlr. I'm attempting to exercise the hoisted predicates code in C . The code is failing due to a missing header file.

I've downloaded the latest runtime 3.1.3, but it does not contain the header either.  Any suggestions.

Here is the error:

In /Users/jskier/Documents/libantlr3c-3.1.3/include/antlr3defs.h:217:26: error: antlr3config.h: No such file or directory

John Skier

Please read the build instructions in the API documentation.


configure, make, sudo make install

Looks to me like you did not and are trying to include the files directly from the expanded tar you downloaded. 

How to prepare the intstream for function call?


I'm using the C API to implement function call. I've already recorded the index of the function body and created a new node stream for the function call. But I find that I have to call istream->size(istream) before calling istream->seek. Or errors will be reported. Is it because the input buffer is not ready or anything else? What is the correct step to reset the index? 

Mu Qiao

[Posted on 070411] Yes, the input stream is not ready. I think that this is 'fixed' in later releases. However, the idea is that I did not want to test a flag to see if the istream was ready on every call in to the stream, so you must use a call that initializes the stream, and then everything else works.

C Target: setUcaseLA bug?



Brian Catlin

Question 1: Is setUcaseLA known to work in ANTLR3c-3.2? 


Works for me, I just tried it. There must be something else going on. Are you sure you are looking at the debugger correctly? Are you tracing in to the upper case version of LA?

Question 2: At the beginning of my generated lexer routine mTokens, it switches on LA(1) trying to build a token, but LA(1) is returning the lowercase input character.  As you can see below, I call setUcaseLA immediately after the input stream was created.






Can you see your problem if I highlight the above line? ;-) You are not calling LA, as you have told the lexer to use inline input code and not call the input routines.

How to push and pop scope using antlr3 c code generator?


I'm new to antlr3. I want to wirte a simple parser and do scope pushing and poping during parsing. How can i define this in antlr3 grammer rules?Is there some way to do things like this in antlr grammer rules:

   enterBlock|enterFunction { pushScope();}

   exitBlock|exitFunction { popScope(); } 

Yang Yang

1. There is an example in examples-v3/C/C/C.g in the examples:


2. Just use the built in global scopes.

Using ANTLR and MySQL in C target


I am trying to build an ANTLR grammar using the C target.  I want to store some data from the grammar into a MySQL database.  However, when I include both the mysql and antlr C target runtime libraries in my project, I get errors: 

[Please refer link above for errors]

I have used the MySQL library and the ANTLR C runtime library separately with the relevant sections of the same program and have no trouble then.  Please let me know if anyone has used these two libraries together and if so, how you can overcome the above redeclaration errors.

Dinesha Balasuriya Weragama

The MySql headers do a lot of strange things such as #define free - it probably means that you are trying to lump everything together and include all the headers everywhere. Try splitting this out in to a supporting translation unit and just including what you need. For instance, just pass a pointer to the token text and its length and don't including the antlr headers.

Backtrack + rule arguments + C target


I have some problems with backtrack option and rule arguments.

Example grammar:

grammar testbt;

	language = C;
	backtrack = true;

	@init {
		int i = 1;
	num[i] | id[i]

num[int i]	:	'0'..'9'+;
id[int i]	:	'a'..'z'+;

Generated files does not compile with error "error C2065: 'i' : undeclared identifier".

(The problem is that C target does not insert rule arguments into argument list of autogenerated 'synpred' functions.)

How can I fix this?

Anton Bychkov

Please read the documentation:

@declarations {}
@init {}

Then search the list using antlr.markmail.org for hoisted predicates and local variables. It isn't the C target it is just that the local variable or parameter is out of scope for the predicate, so you must use scopes if you have to use semantic predicates (with parameters).

Saving, duplicating and substituting tree nodes in the C API


Question 1: I'm working on a language that allows assignment of functions to variables. Something like this.

fun = execute(params);

Later in the same scope if I want to do.

another_fun = fun

The most natural way to do this seems to me to save the tree from the first execute(params) in a symbol table and then, while generating the AST, when I see that "fun" is defined, to substitute a duplicate of it's tree for the assignment to another_fun.

David Minor

That doesn't sound like the correct way to do things to be honest, the AST is just the parsed elements and you probably need to do a lot more with these than just store them? Linguistically, that looks very confusing, you should probably use a different operator than = :-)

Question 2: The trouble is the symbol table is in C world not Antlr.  It looks like I could just duplicate the node save the pointer and then re-use it, but I don't see any examples of how to do this.

Does anyone have an idea?  Even a Java example would be helpful.


What are you trying to do/build? An interpreter? Basically, you should do as little as possible in the parser other than the basics such as build the symbol table if that is a a possibility (you know what all the types are while you are parsing etc). Generally, you would not want to duplicate the sub-tree for a symbol table but you can just reuse the pointer anyway as it is returned by the rule that parses your function. $rule.tree should do it I think.

Antlr2-C++ target patching?


I'm forced to use antlrv2 because antlr3 does not support C++! The C target encloses the generated lexer/parser code in 'extern "C"'  which prevents using C++ constructs (like templates or constructs from other C++ libs) inside the parser.

I found that the last version of antlr2 (2.7.7) requires a patch because of missing headers. Where do I have to post this patch?

Oliver Kowalke

You should not be using C++ inside the actions. Create a helper class, and call from it. The headers are extern "C" and then compiler the generated C as C++ and you are fine. I doubt that anyone will be patching 2.7.7 C++ target.

Uninitialised global scope struct instance in C code


I am using Antlr Version 3.2 and C libantlr3c-3.2. I have a grammar "Foo". In the grammar I have a global scope declared:

scope ParserGlobals {
  int x;

and a rule that initialises it before descending further into the parse tree

ruleA returns [int result]
@init  {
  $ParserGlobals::x = 0
  : ruleB { $result = $ParserGlobals::x; }

The code that is generated by the C target is comprised (with other code omitted for brevity) of the following

static String ruleA(pFooParser ctx)  {
  (SCOPE_TOP(ParserGlobals))->x= 0;

in it's post processed form (by gcc -e) 

static String ruleA(pFooParser ctx)  {
  (ctx->pFooParser_ParserGlobalsTop)->x= 0;

However for some reason ctx->pFooParser_ParserGlobalsTop is null. I changed the generated code to the following

if ((SCOPE_TOP(ParserGlobals)) == NULL)  {
  printf("ParserGlobals stack is not initialised");

And the 'if' is evaluated to true.

In the "constructor" for the parser, I noticed

ctx->pFooParser_ParserGlobalsTop      = NULL;

However as far as I can see there is no code in the generated source that changes the value of pFooParser_ParserGlobalsTop before my @init code accesses it (thus causing a seg fault).

I noticed that someone else had a similar problem with local scopes (http://www.antlr.org/pipermail/antlr-interest/2008-April/027524.html) However I'm not sure why a global scope in a grammar is generating code that is seg faulting due to the scope not being initialised.

Kieran Simpson

If the scope is null it means (as per the comments and antlr.markmail.org), that you hoave a path in to your rule that does not go through the rule where the scope lives. Put a breakpoint in the if condition and it will tell you which path it is.         

Kieran Simpson:  I see where I went wrong.  I needed to include the scope in the rule  eg:

ruleA returns [int result]
scope ParserGlobals;
@init  {
  $ParserGlobals::x = 0
  : ruleB { $result = $ParserGlobals::x; }

 Thanks for the pointer.

Is there string template support for C runtime?


Jeffrey Newman

StringTemplate is an object oriented system and cannot be reproduced in C. 

IDs and keywords


I have just started a project to convert our current way of processing the file generated by our program into a more elegant way by using a parser generator. ANTLR so far as proven to be quite powerful but I think I have hit a bit of a wall. Here is an extract of my grammar:

grammar MFL;

 language = C;

'MODEL'  ID  'ASSOC' cpa_vars id1=ID? id2=ID? ';'

cpa_vars returns [long var]:
    'CPA'     {$var = JVCPA;}
    | 'PSAT'  {$var = JVSVP;}


the rule model should match things like, but there are situation where it does not work as I expected:

example: this matches OK


but says that <missing ID> where cpa is.


Debugging the code I understand the lexer has assigned token type to the literal present in the cpa_vars rule instead of the mode generic ID token type. My question is: how do I make sure I match ID instead of  'CPA' of another rule for this case?

The configuration file I am trying to parse follows this structure that depending in the place the tokens are, they are considered actual tokens or else they are just general identifiers.

Nuno Pedrosa

This is well covered in this forum, so use the search engine. But:

1) Do not use 'LITERALS', create real tokens;

2) Use a rule id instead of the token ID

3) The id rule has ID and all the keywords as alts, if producing AST, then change type to ID;

4) Where this introduces ambiguity, use a one token predicate, or explicit k=1;

Assuming AST...

id:  ID
  |  CPA    -> ID[$CPA]
  |  ASSOC  -> ID[$ASSOC]
... etc

model: MODEL id ASSOC cpa_vars id? id? SEMI ;

CPA  : 'CPA';


#defines in C target


I am building a parser for a C target, and am running into issues with the generated #defines from the lexer tokens. Essentially, the lexer names that were chosen in some cases conflict with other #defines in the code (either from our own code, or from standard libraries like Windows.h). While I realize that I could just rename the lexer rules to remove the conflict, this makes the grammar file less readable, and at this point is a considerable amount of work (our language has hundreds of keywords). Is it possible to tell the lexer generator to prefix/postfix all of these defines to make them more unique, while preserving the nice, readable grammar definition file?

Justin Murray

I had wanted to do that but a limitation in the code generation templates (which I no longer remember) meant that I could not. You need to suffix your token names with _ as that is least intrusive. You need only do that for the tokens that clash and you will find it isn't as much work as you think using visual studio. You will not notice. The other way I do it is to prefix with K such as KNULL KTRUE and so on.

antlr-c and llvm


In my main.cpp if I include <llvm-c/Core.h> at the very top, everything compiles nicely.

However, if I include it in the @includes section of my tree grammar, or for that matter anywhere after I include the header for my tree walker, then I get a ton of errors.  Note: including <antlr3treeparser.h> does not appear to cause these issues.

Here's some things it's choking on in the llvm code (stripped down for readability):

namespace llvm {

template<typename ValueTy> class StringMapEntry;

typedef StringMapEntry<Value*> ValueName;

class Twine;

class Value {
ValueName *Name; //!Expected unqualified-id before numeric constant
void setName(const Twine &Name); //!Expected ',' or '...' before numeric

Aaron Leiby

You are doing this the wrong way around. You want as little code as possible in your tree walker, just just want a set of API calls that  your tree walker can make, where it passes pointers to trees and/or tokens. The API code is then I n a separate C file, and it is that C file that makes the LLVM calls and has the includes you need for that. That file also has the antlr headers so that it knows about tree structures etc.

Golden rule is to keep code out of the .g files. 

Recursive Tree Walking C Target


Just wondering if anyone had any tips for recursively walking an ANTLR_BASE_TREE produced from a parser. I seem to be getting some memory issues. I.e. A snippet of my evaluate method is:

std::string ConditionTree::evaluate(pANTLR3_BASE_TREE root) {

 //Variable declarations
 std::string value1 = "";
 std::string value2 = "";

 //Firstly get the textual description from the node
 std::string nodeText = (const char*) root->getText(root)->chars;
 qDebug() << "Node Text: " << nodeText.c_str() << "\n";

 //Get the nodes children and check to make sure there are some
 pANTLR3_VECTOR children = root->children;
 qDebug() << "Got the chidlren";

 if(children != NULL) {
  int count = children->count;
  qDebug() << "Number of Children: " << count;

   pANTLR3_BASE_TREE c1 = (pANTLR3_BASE_TREE) children->get(children,0);
   pANTLR3_BASE_TREE c2 = (pANTLR3_BASE_TREE) children->get(children,1);
   value1 = evaluate(c1);
   value2 = evaluate(c2);


Thomas Davis

I would not use the getText() method if I were you as that is just a convenience I created if you are just hacking something up. Look at what that method does and copy it to produce your c_str directly.

Next, you are assuming that there are always 2 children, but ANTLR produces a tree which has 0..n nodes.

So, look at the function makeDot() in antlr3basetreeadaptor.c and copy it. This code does in fact use getText(), but it does so because I don't know how long you will want to keep the strings around so they are tracked by the string factory and released when you free the tree.

2nd thread for this question:

FYI, I transform the Antlr tree into my own C++ data structure for tree  walking.

Kenneth Domino

Answer: Not sure why you would need to do this, you are just adding an extra layer and I don't see that you are getting much for your added complexity.Answer:


With this conversion, I can now do things more easily, because I don't use the Antlr C runtime data structures, which are hard for me to understand and debug.  (I still cannot understand why the target isn't just C++.)  


For a start, a C++ target will generally have more overhead as it isn't quite as close to the metal. Secondly  though, C++ compilers are not universally available, whereas almost every platform has a C compiler. Thirdly,  many professional software companies do not allow C++ because not enough people understand it properly  and they end up with unfathomable, uncommented, C++. Hence C is the basis of everything and in this case  deliberately so.


I can now add an iterator for tree walking, or change the behavior of getText(), which allocates a new copy of the string every time it is called.


Unless you install your own function, which is why all the structures use pointers to functions. But as I have said many times, getText() is really not meant for hard core work. I also explained to you that I can't know what you want to do with the text, so if you call getText, you will get another copy. If I don't do that, then you  would manipulate what I give you and it would become the text for the token as a byproduct of using it, which  is not what you want (generally). If you want to preserve the text, don't call getText - cache it. There is even a  pointer that you can use in the token structure. If I made the default be what you want, then it would be incorrect for most purposes. You also misunderstood the code as you were looking at the code that decides if  the lexer has overridden the default text or not, but don't let that stop you commenting. Finally, if you are not  changing the text, then don't copy it at all, just use the pointer to the input, which is stored in the token.

The C code is completely flexible, but it is raw C, aimed at being as fast as it can be and does not come for free. I can't help thinking that you have done a lot more work here than you would have done if you had read  through the docs or asked a few more questions even. 


In addition, in my tree walker I need to associate associate some data with each node. I could create a std::map<pANTLR3_BASE_TREE, DATA *> but this was slow because of all the thousands of nodes.


Yes, because you are performing thousands of new(), another reason I did not write this in C++. You say you  don't understand why it isn't C++ but  in the next breath, you immediately run across one of the problems of doing that in C++ or trying to make the runtime be all encompassing for all purposes; it deliberately isn't.  Reading the comments, you would have seen that I thought of all that and that is why there is a void * that you  can use for anything you like. 


Alternatively, I could have tried to modify the default node type in tree construction, but I could not find an example to make my life easier, and I am not motivated enough to read and understand "newPoolTree (pANTLR3_ARBORETUM factory)" in antlr3commontree.c.


If you are not going to read the code and comments and doxygen, you won't see that there are fields in  the default node that are specifically reserved for holding data. They are also documented in the doxygen docs.


void *  u 

Generic void pointer allows the grammar programmer to attach any structure they like to a tree node,  in many  cases saving the need to create their own tree and tree adaptors. 


Vectors in C generated code


Question 1: I'm trying to implement a SQL parser.  I'm working from the SQLite grammar. I need to return a column type. (e.g. var char)

The problem is that there is no vectors in the ctx structure. I have not been able to find and example of using the vectors.

If I use name = ID+,  then the name is just stomped with char, the var is gone.

Please direct me where I see an example of using the C vectors. How to include them, how to access them, how to make the generated code compile.

Jeffrey Newman

Here is an example on creating and using a vector with the C API:

// Initialize vector factorypANTLR3_STRING_FACTORY str_factory =
antlr3StringFactoryNew();pANTLR3_VECTOR_FACTORY vec_factory = *
// create source vectorspANTLR3_VECTOR vector = *vec_factory*->*newVector*
// we must free this memory at some point
pANTLR3_STRING temp1 = str_factory->newStr8(str_factory, "one ");
temp2 = str_factory->newStr8(str_factory, "two ");
vector->*add*(vector, (void *)temp1, NULL);  vector->*add*(vector, (void
*)temp2, NULL);
int i;for(i = 0; i < vector->size(vector); i++){  pANTLR3_STRING temp =
(pANTLR3_STRING) vector->*get*(vector, i);
  ANTLR3_PRINTF(temp->chars);  ANTLR3_PRINTF("\n");}

Question 2:

I need to figure out where to put in this code sample, and how to call.

On page 132 of Ter's book (pdf version) 

When a rule matches elements repeatedly, translators commonly need to build a list of these elements. As a convenience, ANTLR provides the += label operator that automatically adds all associated elements to an New in v3. ArrayList, whereas the = label operator always refers to the last element matched. The following variation of rule decl captures all identifiers into a list called ids for use by actions:


decl: type ids+=ID (',' ids+=ID)* ';' ; // ids is list of ID tokens

my rule is a little different (and simplitied)

    : name+=ID+ (LPAREN size1=signed_number (COMMA size2=signed_number)?


The key point here being name += ID+

The relevant generated code looks like:

/home/jeffn/Development/antlr/trunk/edu/sqliteGui/SQLite.g:207:11: name+= ID
            	            name = (pANTLR3_COMMON_TOKEN) MATCHT(ID,
            	            if  (HASEXCEPTION())
            	                goto ruletype_nameEx;

            	            if (list_name == NULL)
<<<<<<------- Key point.
            	            list_name->add(list_name, name, NULL);


The key point here, in the line below (and marked above). is the ctx->vectors.


There is no "vectors" element in the ctx structure.



HOW TO I INIT THE ctx ->vectors TO POINT TO MY NEWLY MINTED vectors function.


(Or more precisely in my case. Since I can simply build a string and pass it back to the calling rule.  How do I access the vector's element (eg the individual string of the compound type name (ie var char) to build a composite (concatenated) string.)


One of my questions is why did the code generator generate code for functions it did not create?


I created a rule;

vectors: ;

This indeed put a vectors element in the ctx structure (I thought I was home free, all I would have to do was initialize with the newly minted vectors function that Stanley gave me.)

So I added an

@members {
	myVectors() {}

@init {
 ctx->vectors = myVectorr;

And I found that the generated code did indeed contain the myVectors routine. And nothing was generated to update the ctx->vectors element. Clearly the my idea of an @init outside of a rule being executed was wrong.


My options are:

options {
	language = C;
	k = 4;


You are trying to use the += operator but you are not producing an AST (output=AST; option). These operators are ONLY for producing ASTs (which you should really be doing anyway if you are doing anything other than something very trivial). So, if you do not want to produce an AST and want to accumulate things like that, then you need to write your own code (though you could use the vectors, hashtables of course). Then general rule us to take all suck code out of actions so that you call an API that builds lists - this means your action code is just the C/C++ code that is required to pass the structures of interest to your library routines.

You could add the output=AST option and ignore the resulting AST too, but this would be pretty wasteful of course.

Also, if you are still learning ANTLR, then unless you don't know Java or C# at all, I would start with one of those targets as the C target has the steepest learning curve. This is not to cast aspersions on your C programming skills, it is just that it is better in general to learn about the ANTLR 'stuff' then move to the C target.


You should not specify k unless you have good reason to and you need output=AST if you want to use the vectors, but then you should really use an AST and not try to perform things in the parser as the next problem you will have is tracking things that you malloc when a syntax error is found.


Missing MATCHRANGE macro

I'm using ANTLR v3 C runtime and found this macro is missing from the generated *Parser.c and *Parser.h . This macro can be found in *Lexer.c file . Is this a bug ?

My tool version is 3.4 and C rt is libantlr3c-3.4-beta4 , following is the errored code fragment of PLSQLParser.c:

switch (alt17)
       case 1:
         // PLSQL.g:91:4: '0' .. '9' ( '0' .. '9' )*      // <=

This is my grammar file

          root_0 = (pANTLR3_BASE_TREE)(ADAPTOR->nilNode(ADAPTOR));

          MATCHRANGE('0', '9');                  // <= This is
the missed macro


I think I've found the reason . I've written a rule as follow

    :    '0'..'9' ('0'..'9')*
    |    ('0'..'9')* '.' '0'..'9' ('0'..'9')*

which will be translated to MATCHRANGE in Parser , if I change it to this :

    :    INT
    |    FLOAT
    :    '0'..'9' ('0'..'9')*
    :    ('0'..'9')* '.' '0'..'9' ('0'..'9')*

Answer 2: 

Better to do this:

fragment FLOAT;
INT : '0'..'9'+ ( '.' '0'..'9'+ { $type = FLOAT; } | ) ;

Answer 3: 

Answer 2 REQUIRES at least one digit to the left of the decimal place on FLOAT. which is not what the OP had. but is easily fixed, i believe, as:

FLOAT : '.' '0'..'9'+ ;
INT : '0'..'9'+ ( '.' '0'..'9'+ { $type = FLOAT; } )? ;

(note that i also replaced the empty alternative with use of the `?` meta-operator. i think the meta-operator is stylistically clearer, but maybe there is some other reason not to use it?)

Problem with splitting grammars


I have one big grammar that needs to be split to reduce compile time (Csource). For ANTLR 3.1.3 there are two problems:

1. The generated source code may change every time I run ANTLR even if I don't change the grammar

http://paste.pocoo.org/show/428505/ shows the difference.

2. The comment probably shouldn't tell about the date I use sed to remove the comment but I don't think it's a good idea. ANTLR shouldn't generate different source code if I don't change the grammar. The date information should belong to the property of the generated files, not the content of these files.

If your grammar has not changed, then your build process should not generate it. However, I think that I may have decided to take out the date sometime after 3.1.3.

If you need to generate smaller C files, then split the grammar up and use the import x.g, y.g, a.g; functionality.

You might also try 3.4-beta4.

SKIP() vs skip() in 'C' runtime


Where is the code for SKIP() found in the 'C' runtime? I had SKIP() in my C code version of the parser then I had to move to Java to find some bugs in my grammar. There I had to change SKIP() to skip(). Now I am going back to 'C' but I would like to change the 'C' runtime so that it will accept the lowercase skip().

Alan Condit



However it is a macro defined in the generated code, all you need do is:

#define skip() SKIP()

In an @section that follows the macro definition of SKIP


Where to get the source from? ("when building I've obtained most of the libraries from the FishEye source website as mentioned in the downloads page of antlr.orgexcept for "antlrconfig.h")



Download the source tar from the antlr web site. The tarball is a 'standard' ./configure based build and you run that to create the Makefile. Please read the API docs linked on the front page.

Best practices advice:



Avoid backtracking like the plague if you need performance. But if you are careful in the order of your alts and use it on just a few decisions/rules, then it might not be so bad (but remember that your error messages will be weak).

If k=1 on a decision then ANTLR will work that out so you don't need to specify but if you want to avoid ANTLR following every possible alt then you can use k=1 on a particular rule or sub rule to avoid ambiguity errors. Basically if you know that a decision will be correct at k=1 even though ANTLR can see ambiguities, then tell it so. Before 3.4 this would still give a warning unless you added a 1 token predicate, but I believe this is changed for 3.4.
e.g. In order to turn off a warning, you would have to do this:

FRED: FRED ((WAS)=>WAS))? var;

Now you can use k=1 to turn that warning off (but experiment on small grammars to get it right).

Tips for getting started with C target:


1. What are the following for:

Suffix Grammar should contain...
.g3l A lexer grammar specification only.
.g3p A parser grammar specification only.
.g3pl A combined lexer and parser specification.
.g3t A tree grammar specification.


It is for cases where you need to know what is in the grammar file
because your build tool isn't able to work it out. For instance GNU make
can ask antlr what the output files are, others cannot. Hence it accepts
those suffixes for build rules.

2. Download 3.1.3 or 3.1.4 snapshot from the ANTLR download page.

3. If using Linux, learn to use valgrind (takes about 20 minutes) and kcachegrind.

More tips on getting started with C target:



Starting at the C API documentation linked from the ANTLR front page, select ANTLR3 C Usage Guide then Using the ANTLR3 C Target, then implementing Customized Methods, where you find the following text:

Implementing Customized Methods
Unless you wish to create your own tree structures using the built in ANTLR AST rewriting notation, you will rarely need to override the default implementation of runtime methods. The exception to this will be the syntax err reporting method, which is essentially a stub function that you will usually want to provide your own implementation for. You should consider the built in function displayRecognitionError() as an example of where to start as there can be no really useful generic error message display

Selecting that shows you the documentation for that function. This is what you should override. Start by copying the example version in the runtime, then adapt it to your own needs. It is pretty easy.

In general all the names of methods and routines in the C runtime reflect the Java names so that you can find Java examples and then see the differences with C by looking for the C examples or searching the List or reading the code. You will find it very helpful to single step your parser in the C debugger for instance.

Also, searching from the Support page of antlr.org:


More guidelines for getting started :



1. If you get a problem as shown below in your parser.h file:

typedef struct TPQParser_statements_return_struct
   /** Generic return elements for ANTLR3 rules that are not in tree  
parsers or returning trees
   pANTLR3_COMMON_TOKEN    start;
   pANTLR3_COMMON_TOKEN    stop;


You need to set the option:

    ASTLabelType = pANTLR3_BASE_TREE;

in your grammar files. It should default in string template but does not seem to in some places.

2. Probably just OS X related, but when I run autoconf to build the runtime I get this error on a vanilla OS X Leopard:

$ autoconf
configure.ac:49: error: possibly undefined macro: AM_INIT_AUTOMAKE
If this token and others are legitimate, please use
See the Autoconf documentation.
configure.ac:52: error: possibly undefined macro: AM_MAINTAINER_MODE
configure.ac:53: error: possibly undefined macro: AM_PROG_LIBTOOL


These are known issues with autoconf, but you don't need to run autoconf, that is used to produce the configure and makefile stubs. You need to unpack the tar.gz in the dist subdiretory (or now, the official C runtime download) and just run:


Besides, when using a different version of autoconf you first have to use autoreconf -i --force

3. Please read the API docs where it tells you how to build the runtime libraries. The link "BUilding from source will show you how".

How to solve missing header file 'antlr3config.h' while compiling parser


(Well, it seems that I forgot to build the antlr runtime library.

The antlr-3.3.tar.gz download doesn't have a configure.sh file in the runtime/C directory. When I try to generate one using autoconf or automake, it complains about missing m4 macros. Anyone have the fix for this?

This is the tar g zed file I am using. http://antlr.org/download/antlr-3.3.tar.gz)


You should be using 3.4 runtime distribution, the pure source is for developers only There is no 3.3 C runtime - it was skipped. You do not run autoconf or automake. Then read the build instructions in the API docs.
./configure --help



Please download the C runtime distribution, not the complete source tar. That tar is for people working on ANTLR, not building the runtime.

How to avoid conflict caused by config files (generated by autotools):


(the antlr3 C runtime is build with autotools. the autotools generate a header antlr3config.h, this header has defined: #define PACKAGE "libantlr3c"

my project is also build with autotools, there is a header config.h, which define: #define PACKAGE "polaris"
when I compile my code, there is an error:

In file included from /opt/antlr3c/include/antlr3defs.h:246,
from /opt/antlr3c/include/antlr3.h:33,
from PolarisAdmin.cc:13:
/opt/antlr3c/include/antlr3config.h:96:1: error: "PACKAGE" redefined

the error indicate that "PACKAGE" has redefined.


It's a bit of an issue mixing autotools, but after you have included the one header, #undef PACKAGE before including the next. If possible though, you should try and keep the includes separate.

I think that we can probably do that as the C runtime does not reference those defines. Ideally, you should not encounter a need to include config.h from two separate non-sub-packages.

Error while upgrading grammar from antlrworks v1.1.7 to v1.3.1.


[I have an antlr grammar that works perfectly with antlrworks v1.1.7 using the C runtime.

But since upgrading to v1.3.1. I get the following error:

[17:08:29] error(10): internal error: C:\umajin2\Umajin1.g :
java.util.NoSuchElementException: no such attribute: ASLabelType in template
context [outputFile parser genericParser(...) rule ruleBlockSingleAlt alt
rewriteCode rewriteAlt else_subtemplate rewriteElementList rewriteElement
rewriteTree rewriteElement rewriteTree rewriteElement rewriteNodeActionRoot]


I think this was a bug in the output template that was checked in when 1.3.1 was built and has since been corrected. Either expand ANTLRWorks replace the C.stg and repack, or build directly from source. However using the output = C in ANTLRWorks tends not to be useful (wink) . I have always commented out the language=and just used remote debug. Anyway the problem is not your grammar.

Getting started with ANTLR 3, IDEs and Maven



Prompted by a request earlier this week for a sample of how to use Antlr  grammars within IDEs and builds, I thought I would create a Maven Archetype that can create a sample Antlr project seemingly out of thin air. If you are willing to use Maven, or already do so, then this saves you a lot of work trying to configure the project from scratch.

There is now a Wiki page on getting started with Maven builds at:


and this shows you how to create a template Maven project  that you can then open with Netbeans, Eclipse or any other IDE that supports maven projects. Thetemplate project is a real lexer, parser (with imports) and tree parser, with a

driver program that scans directories for files it can parse. It will also
produce a dot specification for the AST it produces and turn that spec into a
.png file assuming you have loaded the Graphviz application. Basically it is
designed to answer a lot of the FAQ type stuff that happens with the Maven
plugin and ANTLR/Java in general.

C Target AST Debug Compile Error


Why am I getting the following compile errors:

 SimpleCWalker.c: In function ‘SimpleCWalker_Ctx_struct*  



SimpleCWalker.c:358: error: ‘struct  

ANTLR3_COMMON_TREE_NODE_STREAM_struct’ has no member named ‘tokenSource’

SimpleCWalker.c:358: error: ‘struct  

ANTLR3_COMMON_TREE_NODE_STREAM_struct’ has no member named ‘tokenSource’

SimpleCWalker.c:360: error: ‘struct ANTLR3_TREE_PARSER_struct’ has no  

member named ‘setDebugListener’

I modified the C Target Makefile in order to get the samples to build a binary under Mac OS X.   

William H. Schultz

There are no changes you need to make to the Makefile for MAC. Perhaps you did not configure/build correctly?

 Have you:

a) Downloaded the latest ANTLR and C runtime?

b) Downloaded the latest samples?

c) run ./configure in the C runtime directory taht is created when you untar the C runtime;

d) Run make all after ./configure

e) Run sudo make install

d) Compiled as:

gcc -o sample *.c -I. -I/usr/local/include

(assuming that you did a make install to /usr/local

Please check the online documentation on building the runtime and so on.. Take the Runtime API link from the home page.

Changes to C runtime for 3.4


Jim Idle

posted on 062411:

Please note that the documentation for the C runtime in 3.4 is yet to be updated. In the meantime, if you wish to try it, then there is one change that you need to be aware of:

1) The distinction between ASCII and UCS2 input streams is now removed and there is a single function: antlr3FileStreamNew() to replace the file related input streams and a function” antlr3StringStreamNew to replace the memory related input streams. Prototypes and usage:

antlr3FileStreamNew(pANTLR3_UINT8 fileName, ANTLR3_UINT32 encoding)

antlr3StringStreamNew(pANTLR3_UINT8 data, ANTLR3_UINT32 encoding,

ANTLR3_UINT32 size, pANTLR3_UINT8 name)

fileName – path to input file in 8 bit characters. Used to call fopen()

data – pointer to input data in any encoded form (note that I will change

this to void * in the next beta/release)

size – the size of the input data (always bytesm regardless of encoding)

name – The name to use for the string stream (passed to error handlers for

instance) may be NULL

Then the encoding values are:

ANTLR3_ENC_8BIT    – 8 bit encoding (ASCII/latin1/etc) (replaces the

existing ASCII stream)

ANTLR3_ENC_UTF8    – UTF8 encoding  (eats any BOM that may be present)

ANTLR3_ENC_UTF16   – UTF16 encoding (also handles UCS2) – determine byte 

order from BOM or machine natural without BOM

ANTLR3_ENC_UTF16BE – UTF16 encoding (also handles UCS2), big endian but no


ANTLR3_ENC_UTF16LE – UTF16 encoding (also handles UCS2), little endian but

no BOM

ANTLR3_ENC_UTF32   - UTF32 encoding – determine byte order from BOM or

machine natural without BOM

ANTLR3_ENC_UTF32BE - UTF32 encoding – big endian but no BOM

ANTLR3_ENC_UTF32LE - UTF32 encoding – little endian but no BOM

ANTLR3_ENC_EBCDIC  - EBCDIC encoding (8 bit).

Note that EBCDIC encoding means that the input is in EBCDIC and it is not changed. The LA() method for EBCDIC encoding converts a character to ASCII before matching. Therefore the pointers to the first character of the token in the input stream remain pointing at EBCDIC and you are responsible for any conversion of the token strings if you need to convert them. 

Encoding is as per the Unicode standards and supports the full Unicode character range and all surrogate pairs are decoded in UTF16. Note however that for performance reasons, errors in the encoding are usually ignored (for instance a valid hi surrogate that does not have a lo surrogate), but that invalid sequences that may not be ignored, may screw up your input. You can of course override any of the LA methods and report such things as errors, should you need to do so. The purpose of LA() is to return the 32 bit integer Unicode code point for the specified character – how it does that is irrelevant to the lexer, which is just matching 32 but numbers. This means you should not code your lexer to match surrogates, just the code points.

Status of libantlr3c-3.1.1.tar.gz in Hudson


Does the libantlr3c-3.1.1 tarball contain all patches to date or is it frozen as of Oct 1,2008?

Richard Lewis

[posted on 021209]

That contains just what it says, the released version of 3.1.1, so you want libantlr3c-3.1.2b2 if you want the latest development patches. To use that runtime though you will also need the latest ANTLR tool.

You can get all of these from the Hudson continuous build system:


Choose the project, select the latest successful build and download the build artifacts you need.

Optimizations of tree rewriting in C target


Jim Idle

[posted on 022809]

I have just checked in some changes to the C target code generation templates and the C runtime itself in order to greatly improve AST construction and rewrites. Depending on quite what you do with your grammar, this will have a  huge effect on performance.

If you want the full nitty-gritty on the changes, look at the code changes for perforce change list #s 5558 and 5559.

The changelist comment for 5559 is below. I will continue to look for optimizations and ways to reduce the system time, but getting reasonably close to the hand crafted gcc seems good enough for the moment, while I fix the remaining few runtime bugs ready for 3.1.2.

I would appreciate it if those of you that use the C target would download the latest Antlr Tool (codegen templates have changed so you need the latest build) and runtime distribution and give this a try. Testing your own code and calling it good is never a good idea ;-) PLease make sure that the build you download includes these changes, which it should be build #20 for the C  runtime and build #49 for the ANTLR Tool.

C target Tree rewrite optimization

There is only one optimization in this change, but it is a huge one.

The code generation templates were set up so that at the start of a rule, any rewrite streams mentioned in the rule were pre-created. However, this is a massive overhead for rules where only one or two of the streams are actually used, as we create them then free them without ever using them. This was copied from the Java templates basically. This caused literally millions of extra calls and vector allocations in the case of the GNU C parser given to me for testing with a 20,000 line program.

After this change, the following comparison is avaiable against the gcc compiler:

Before (different machines here so use the relative difference for comparison):


real    0m0.425s

user    0m0.384s

sys     0m0.036s


real    0m1.958s

user    0m1.284s

sys     0m0.656s

After the previous optimizations for vector pooling via a factory, plus this huge win in removing redundant code, we have the following (different machine to the one above):


0.21user 0.01system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+328outputs (0major+9922minor)pagefaults 0swaps


0.37user 0.26system 0:00.64elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (0major+130944minor)pagefaults 0swaps

The extra system time coming from the fact that although the tree rewriting is now optimal in terms of not allocating things it does not need, there is still a lot more overhead in a parser that is generated for generic use, including much more use of structures for tokens and extra copying and so on. I will continue to work on improviing things where I can, but the next big improvement will come from Ter's optimization of the actual code structures we generate including not doing things with rewrite streams that we do not need to do at all.

The second machine I used is about twice as fast CPU wise as the system that was used originally by the user that asked about this performance.

Difficulty Building C runtime on Solaris with Configure


I can't get the system specific include built with configure on a  Solaris Unix system.

I downloaded the 3.1.3 source distribution.  There wasn't a configure script in the C runtime area but there was a configure.ac.  So I ran  autoconf on it which generated a configure.Running ./configure gives me this error:

configure: error: cannot find install-sh, install.sh, or shtool in "."  

"./.." "./../.."

Michael Boyer

Please take some time to read through the online documentation again - specifically the instructions on building the source code. Perhaps that is what you are trying to follow, but if there is no configure script then you have downloaded the wrong version from somewhere. I don't think that putting source and binary versions of ANTLR on distributions is very useful, it seems to throw people off.

Reading those instructions all the way thorough once will be of enormous benefit.

The versions on the download page (linked from main download page on the home page) are the ones you want:


I just checked these distributions and the configure script is where it should be. Otherwise you need to have the autoconf authoring tools loaded and run autoreconf -ir - but that is only for maintainers. Configure works out that you don't have install and uses a pre-supplied script. By the way, you should probably install those kinds of tools anyway. I thought that Ubuntu had some "Development" package that loaded all that stuff, however I had to give up on Ubuntu as it was too unstable and out of date.

The instructions you should be reading if you are not doing so already are linked from the ANTLR home page as Runtime API Doc. Select the C link and read ANTLR3 C Runtime API and Usage Guide.


How to get access to the $channel=HIDDEN tokens from the parser, but only inside one rule.



The FAQ and advice on the mailing list talk about subclassing the ANTLR3_COMMON_TOKEN_STREAM class and rewriting the skipOffTokenChannels() function to send back every token in the stream.

When you want to do this in one rule, just use get it directly starting from the current index. Say you want to look back from a particular token in a rule to see if there was a comment on channel 2 (please note that I have not compiled this, just typed it in from memory). Basically though you can do anything you can do in Java (more in fact), just by looking a ta Java example and realizing that the C methods are all the same names (just about) but instead of x.y(z), you use x->y(x, z).

: f=FUNCTION x y z
       int sIndex;
       pANTLR3_COMMON_TOKEN tok;

       sIndex = $f->getTokenIndex($f) - 1;   // Index for first token

      // Now look back up looking for tokens
      tok = INPUT->get(INPUT, sIndex);

      if (tok->getChannel(tok) == whatever) { .... }

// and so on


[Note: get returns a pointer to a TOKEN, which has a super pointer to the common token (off the top of my head), if it is returning a pANTLR3_COMMON_TOKEN already (it might be) then you don't need the super.
Also, the macros stop you having to learn too much about the internals, which can change between releases ]

Use the C runtime for cribbing the functions and also the Doxygen API docs (see API docs on antlr home page). You are well advised to read through all of this. Also use the search at: http://antlr.markmail.org/ - it is very good.

If you really want to override a function, then copy the function from the runtime as a start point and put it in your own source. Then in your grammar use:

 // Install custom error message display
    RECOGNIZER->displayRecognitionError = produceError;

This piece of code will run after all the standard methods are set up and you can install your own version of any function whatsoever. No need to alter the standard runtime.

How to convert the token positions from (line, column) pair to absolute indexes in the input stream (indexStart, indexEnd)



For CommonToken itself you can use getStartIndex() and getStopIndex().

However, in the lexer rules you are still forming some of this information, so some references are only good for fragment calls and so on. Look at the values of $start, $stop, $pos, etc.

How to avoid problems because of tokens being #defined?



The #defines are only used within the context of the include file, and in practice all you need do is stick a K in front of any TOKEN name that clashes with the system, such as FILE etc. So, make that KFILE and all is good.

Basically, all the targets do not attempt to protect you from the target itself, so for instance you can't use a parser rule called package in the Java target and so on. The problem with doing so is that it is never 100% correct anyway. Also, when I experimented with this, there was one part of the code gen that did not ask the target templates for the token name and so it all fell over. That could be fixed I am sure, but in the end, I decided that it is better to see the token names without obfuscation when debugging the generated C code.

Is getTokenType(string) function present in java, but not in C? How to achieve same functionality in C?



  • The lexer tokens are #define statements that can be found in both the generated lexer and parser .h files. Look for this heading in xxxParser.h:
    /** Symbolic definitions of all the tokens that the parser will work with.
  • The token constants are in the .h file. If you are looking in a tree, you get the tree node, then get its payload token and get the type from there. You can include the .h file and write code to create a string map if you want to use "dddd" but I think that you mean you want DDDD, which is the #define in the generated .h file.
  • I always advise to use:
    p: A ;
    A : 'a' ;

    And not

    p: 'a' ;

    There is no way for you to create a cross correlation. In the former you will get:

    #define A 5

    And the latter:

    #define T__5 5

    With the former, you can build a static table yourself:

    A, "An", "a token";

    But with the latter you can do nothing because you do not know what token number is defined. In error messages, you will have the token number and can look in your table for the string that represents it and so on.

    The C target does not tell you about the parser rules you are in, as this is only kept by the Java runtime, so you have to create a stack yourself and push the rule in actions. However the parser rule is rarely that useful expect in some error reporting situations.
    Use a tree and real tokens and it will all make sense to you.

How to achieve token rewrite in C runtime?



The token rewrite stream is not yet implemented in C.

Andy Grove: I went ahead and wrote my own TokenRewriteStream (happy to contribute if you want it, it is very simple) and it is working fine other than the fact that I am not getting whitespace tokens from the original token stream. I am iterating over the 'tokens' vector in pANTLR3_COMMON_TOKEN_STREAM).

To get a version of the token stream including whitespace, use the 3.1.3 runtime.

Question about naming tokens:


Is it possible to write the following:

int_literal: CHAR_STRING "_" kind ...

or I must specify the token UNDERSCORE in lexer grammar and write

int_literal: CHAR_STRING UNDERSCORE kind ...

Юрушкин Михаил

  • Literals are single quoted.
  • Option 2 is much better to do this as then the lexer rules will be explicit as you won't get confused over ambiguities.

How to get tokens to come out as (char *) types?


Sunil Sawkar

When you want to use the text of the token, use the pointers in the token (start and end not start and length) and the knowledge of the input encoding and create the Cstring directly. The $text is just a convenience method in the C target - you should use your own methods when doing something non-trivial.

Also, remember to only call external Helper methods from your parsers/tree walkers. Do not embedded any code other than the calling code and pass the whole tree or token pointer. This means your calls won't care what gets done by the helper API and the helper API will not care how the parsers decided to call it. Anything else is an unmaintainable mess.

Literals, Predicates and Actions


I am working on translating a long ACCENT grammer to ANTLR under the C target, and have a few questions about the use of literals to define tokens in combined lexer-parser grammars. I understand that literals in parser rules create implicit lexer rules, and I find this to be a very useful feature for naturalistic languages that have a set of keywords of notable size which can increase frequently, and include frequent alternatives.

Can I somehow apply global predicates and actions to the implicit rules generated for literals. For example, I know I can write:


  '{' { theCompiler->BraceLevel++; }

block: LCURLY (statement)* RCURLY

But I'd really like to be able to write:

  { if (strcmp(GETTEXT()->chars,"{"))
      theCompiler->BraceLevel++; }

block: '{' (statement)* '}'

Julian Mensch

No, you cannot do this. I strongly advise that you do not use literals in your grammar. While at first it seems more intuitive and is perfectly fine for simple grammars, as soon as you want to provide good error  messages, or walk a tree, you will find that they get in the way. You won't know what T42 actually is, and it will even change names when you add and change literals.

Question 2:   Predicates for literals would also be really useful, in the case, for example, where you have a limited set of keywords that are universal to the language, but your ever-expanding larger set is only valid in some lexical circumstances. For example:


  { isUniversalKeyword(GETTEXT()->chars) || inFullKeywordMode }?


Don't use these macros directly, use $text otherwise you will be subject to the vagaries of me changing my mind ;-)

Question 3:  I know there's no such thing as the "@literals" construct I'm showing here, but I'm wondering if there's any way to duplicate the effect I'm going for with it.


Well, it is just re-inventing the wheel really. I understand where you are coming from, but if you go with what the tool does now, you will soon find it all second nature.

Question 4: Currently I'm matching all keywords as IDENT and using string tables, setType() and tokens with 'fragment' to handle keywords.

Answer thread:

Use $type where you can of course. I tend not to use the ident method, I just use an identifier rule that allows the keywords. Which approach is best depends on preference and circumstance of course.

Question: I don't understand what you're talking about here. You mean you just use a single IDENT token and check for keywords with syntactic predicates that test token attributes?  That sounds like it would make the grammar really hard to read -- unless, I guess, you make a /parser/rule for every keyword, which could then be parse-context sensitive. I never thought of that till now.

Yes, this is what I mean, like this:

IF : 'if';
 THEN : 'then';

if : IF ident THEN ident;

   : ID
   | IF         { you could change the token type to ID here}
   | THEN

You have to be a little careful in some places of course, and so I can end end up with some composites likeL

 : ID
 | keywords

 : exprKeywords
 : statementKeywords
 | exprKeywords

Though the division above is mostly for effect :-)

3.1.4 + C runtime: Imaginary tokens lost their names?


My imaginary tokens aren't getting printable names now unless I explicitly set them via a rule like this:

	: [...whatever...]
	| lc='struct' IDENTIFIER -> ^(STRUCT[$lc, "STRUCT"] IDENTIFIER)

If in my calling program I do this:


printf((const char*)TheAST.tree->toStringTree(TheAST.tree)->chars);

I get the following:

( ( int) (DIRECT_DECLARATOR abc (PARAM_TYPE_LIST ( ( ( int) (DIRECT_DECLARATOR nothing))))) (COMPOUND_STATEMENT (RETURN 1))) ( ( int) (DIRECT_DECLARATOR main (PARAM_TYPE_LIST ( ( ( int) (DIRECT_DECLARATOR dummy))))) (COMPOUND_STATEMENT ( ( int) ( (DIRECT_DECLARATOR var))) (FUNCTION_CALL call1 1.2 0) (FUNCTION_CALL CreateThread thread1) (FUNCTION_CALL CreateThread thread2) (FUNCTION_CALL StartMultitasking) (RETURN 0)))

Every one of the "( "'s up there is an imaginary token that I can't do the "$lc" thing to because they derive from nonterminals. This is using "ANTLR Parser Generator Version 3.1.4-SNAPSHOT Mar 31, 2009 24:26:37" and the latest 3.1.3 C runtime.

Any idea what's going wrong?

Gary R. Van Sickle

Basically, this is the only time that setting the value of an imaginary token to its name by default is useful. I did not want the overhead of initializing tokens when they are not used and I changed the way that text is stored for tokens. I can probably make such tokens use a fixed pointer to the token names array; the fact that I did not probably being an oversight.

Is there anyway to get the char index of one token in parser?

It seems the charPosition() function from token strut returns the char position in line, not the offset from the start of the input stream.So, I wondering is there anyway to get the char index of one token from the start of the input stream in parser?



As you mention a struct, I presume you mean from C. In the struct, the field start gives a pointer to the start of the text in the input stream and the field stop gives a pointer to the last character. You can get thse using getStartIndex() and getStopIndex(). You will see also though that it keeps track of the start of the line in which the token resides (uses '\n' by default) so you can use this in conjunction with getCharPositionInLine() for error messages etc. You can override the character used to detect end of line (input stream) and can of course install your own functions to provide such information.

How can I emit multiple tokens in a lexer rule, using the C target?


There only seems to be documentation on how to do this using targets with object oriented languages, which merely subclass the `emit` method and wrap a buffer around it. This becomes a problem for the C target of course because A. there are no virtual methods, and B. there is no subclassing, and C. there is no (at least that I can find) `emit` method.

Is there a way of accomplishing this with the C target?

Billy O'Neal

A) Yes there are, you just install your own pointer;

B) Yes there is, you just wrap the existing structure in your own,but ther are user definable fields which make things easier than trying to sub class;

C) Yes there is:

lexer->emit(lexer);					/* Assemble the token and emit it to the stream */

How to make case insensitive token checker for the lexical analysis part of antlr in C program?


Question 1: I want to convert the token stream into upper case at the time when lexical rules are checked. How to do this in C program?


See the wiki FAQ article:


However, the built in conversion is only ASCII friendly, so you would have to write your own UTF8 version. Then you install the pointer to your function in place of the default one (look at the source code for setUcaseLA()) in your input stream.

Question 2: Are there any open source C projects using antlr3?


Most of the C based projects are not open source. But there are a lot of SQL parsers using the C engine and a lot of other proprietary products.