Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

How can Antlr Parser actions know the file names and line numbers from a C preprocessed file?

...

ANTLR3 C runtime returns incorrect values of CharPositionInLine for tokens on first line of input. This is eventually propagated to tokens.I wrote small program to demonstrate this behavior - it is available  here: http://devel-www.cyber.cz/files/tmp/antlrc3-bug-pack.zip

Bug causes token stream post-processing a little bit more complicated  than it can ideally be ...

...

Why have you got (segment+)? And you are discarding your line number and rewriting in the subrule. 

Try this first:

Code Block
 line
   : line_number? segment* K_NEWLINE

        ->^(STMT line_number? segment*)

   | oword_stmt

        ->^(STMT oword_stmt)
;

The problem is that your telling me that the cardinality of segment is + but it is in fact *. I am pretty sure that this will work then.

Final decision on value initialization in C target

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:61+mid:etopx7invqgr4kfu+state:results

JimIdle

After some wrestling with the templates I have found a way to preserve the tree rewriting semantics but allow a more C like behavior for grammars whereby if the grammar programmer does not initialize a return value from a rule, then it is just left as it would be in C, being uninitialized and therefore likely to be garbage.

 While this may break backwards compatibility (and so I will emphasize this in the release notes), it seems better to behave like C than Java in this respect. Because I am able to preserve the tree rewriting semantics, I feel that those few affected by this will agree that not initializing return values without being told to makes more sense in the long run.

 So, the ways to initialize parameters are:

Code Block
r returns [enum FRED = FRED_VAL1]
 : ... ; 

And the initialization will be generated for you. Or you can place initialization in the @init section of a rule, or otherwise initialize via actions. Note all the same rules apply about @after vs @finally vs exception code (see Markmail search if you don't remember seeing that email - will document for next release).

I think that should keep just about everyone happy. Please remember though that the C rules return a struct unless there is only one return parameter and no tree nodes are being generated. This limits some of the things that you can declare as return types. Also, because there are some limitations in the generic parsing of the return element specs, it is quite often desirable to make a typedef of a complex declaration. Finally, passing things around in a parser instead of waiting for a tree parse is not generally a good idea anyway, because of the complications of freeing things if you hit a parsing error.

Why does the following grammar generate NoViableAltException in ANTLRWorks ?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:61+mid:hei4jogrrkzzuu2n+state:results

1) Consider the following grammar:

Code Block
      grammar schema;

       options
       {
               language = C;
       }

       root : letter* ;
       letter : A | B ;
       other : C;

       A       :       'a';
       B       :       'b';
       C       :       'c';

If you run it on the input string "abc" in ANTLRWorks it generates a NoViableAltException (as I would expect), but using the C Runtime to parse a 'root' it passes successfully. 

Michael Coupland

Answer:

Code Block
 root: letter* EOF;

No exceptions in C so that top rule can only set flags. 

Question 2:

The C Target generates many structs with members called "free" which, while not technically a reserved word, isn't an ideal choice for an identifier name. There are codebases where free is #defined to be something else, which can lead to problems in the generated code that uses 'free' as a normal identifier. I haven't yet looked into modifying the C target to solve this locally, which doesn't seem like a huge task, but it would be nice if the default behavior were to use some other less-overloaded identifier.

Answer thread:

Maybe, but as free is a function in every C runtime that I know of,  #defining it in a system header file would break a lot more than the  ANTLR runtime. Which system are you thinking of that #defines free? The trade off is the use of an intuitive method name vs something like 'release' or 'close'.

Michael:

It's definitely a rarity, and something that you have to be very careful about, but many performance-sensitive codebases do ugly/sneaky things to hijack control of memory allocations throughout the system. I know of at least one high-profile commercial game engine that #defines free to be something else, and the MySQL SDK does this as well. I understand the argument for intuitive names, but wouldn't something like freeObject be equally intuitive, and less likely to collide?

Answer:

Yeah - I think that one would just need to #undef free though to be honest. I don't think that system headers ever override free, only SDKs, so you know when and where to cope. Antlr does not use free, it uses ANTLR3_FREE MALLOC/CALLOC, so that you can predefine that and build the runtime on systems that need something different. Really, SDKs should do the same thing.

Question 3:

I can't seem to find documentation on how the C Target's error handling works. Clearly the documentation at http://www.antlr.org/wiki/display/ANTLR3/Error+reporting+and+recovery isn't directly relevant. 

 Answer:

 It  basically does the same thing as the other targets, but without  exceptions.

Question 3.1:

Where can I find more information about this? Is there a good way to understand how the C Target emulates the Java Target's use of exceptions, apart from reading generated code? There don't seem to be any examples that deal with custom error reporting using the C Target.

Answer:

Many past posts though:

http://markmail.org/search/list:antlr?q=C+displayRecognitionError

The docs at: http://antlr.org/api/C/index.html document displayRecognitionError which, just like in Java, is what you must override to implement your own error display. Also, I have commented that routine to death so that you can copy it and modify it to do what you need personally. Just read through the function.

Question 3.2:

 I had noticed displayRecognitionError, but wanted to make sure there wasn't some other error-handling mechanism that I needed to worry about as well.

Answer:

No. The advantage and disadvantage of the C runtime is that it is pretty raw of course and you tend to have to know a bit more about the internals than with the other targets (though not tons more). The Java and C# versions are  great and in fact I always protoype with one or the other of these, but performance is an issue so tangling with the C is worth it in the end if you need the speed.

For instance, my T-SQL parser has 1100 regression tests and the timings for parsing these and walking the tree look like this:

C

jimi(50_64)-clean: time tsqlc . >out

0.19user 0.15system 0:00.34elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k

8inputs+232outputs (0major+102770minor)pagefaults 0swaps

Java

jimi(50_64)-clean: time tsql2005j . >out

2.33user 0.08system 0:02.05elapsed 117%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+456outputs (0major+29634minor)pagefaults 0swaps

Question 4:

I was running into some problems with scope variables, and saw this thread: http://www.antlr.org/pipermail/antlr-interest/2009-March/033769.html but the link to http://antlr.org/downloads doesn't seem to work. http://www.antlr.org/hudson/job/ANTLR_Tool/lastSuccessfulBuild/ seems like a good place to get the latest development build, but I can't seem to actually find a download link anywhere?

Answer thread:

It is download rather than downloads. From hudson just click on the project. The first thing that comes up is a list of source code artifacts that you can download???

http://antlr.org/hudson/job/ANTLR%20C%20Runtime/

Michael:

http://antlr.org/download looks very useful! Is there a link to this page from the main ANTLR webpage? Most download links seem to point to http://antlr.org/download.html , which is different...

For the C Runtime, yes, it's pretty easy to find the download in Hudson, but I didn't know where to go in Hudson to get the specific file you mentioned in the other email (antlr-master-3.1.4-SNAPSHOT-completejar.jar )  - it's  not 100% clear which Hudson job it would belong to (probably ANTLR_Tool, though), and there don't appear to be any downloadable artifacts at http://antlr.org/hudson/job/ANTLR_Tool/

Answer:

I see what you are saying. You need to take the "Module Builds" link from the  last successful build page:

[7]http://antlr.org/hudson/job/ANTLR_Tool/lastSuccessfulBuild/

How to solve the following cast problem?

Using antlr-3.1.3 with the C runtime (libantlr3c-3.1.3); compiling the generated antlr code for parser using g++; For a rule like:

Code Block
description_list
                                :                           (descrs +=
description)*
                                ;

I'm getting a void* conversion error in the generated code:

SVParser.c:1895: error: invalid conversion from ‘void*’ to ‘SVParser_description_return*’

Unfortunately compiling with gcc is not a solution . I've tried to change the AST.stg from the C code generation templates and I think this should solve my problem, but I was not able to build using maven the entire  antlr-3.1.3.jar (I think this is refered in the build instructions as Uber Jar). 

 Is it possible to build the uber jar for the 3.1.3 release based only on antlr.org 3.1.3 source distribution?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:63+mid:ujcp265wewbxyy33+state:results

Dragos Tarcatu

Jim Idle: Actually, I think I may have fixed this just the other day with perforce change #6115. Try with the latest ANTLR snapshot from the download directory listing:

[1]http://www.antlr.org/download 

(3.1.4-SNAPSHOT) or from Hudson:

[2]http://www.antlr.org/hudson/job/ANTLR_Tool_Daily/lastSuccessfulBuild/org.antlr$antlr-master/

If you want the patch for 3.1.3 rather than 3.1.4-SNAPSHOT, then download the source distribution, install Maven, copy the patch at:

[1]http://fisheye2.atlassian.com/changelog/antlr/?cs=6115 , rebuild and you are done. If you don't want to build the source, then you can just unjar what you have, change the C.stg template and jar it back up again.

Why is the C runtime crashing during parsing?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:63+mid:acdnnnyutifwmkcm+state:results

I generated code for my first try at antlr, and it crashes during parsing. It's erroring during parsing, and crashing while printing the error message:

antlr3commontoken.c line 346

token->tokText.text    = token->strFactory->newStr8(token->strFactory,

(pANTLR3_UINT8)"<EOF>");

the strFactory pointer is not valid in my case. I have not yet found where strFactory is assigned to. I can get around this by setting up the factory manually:

Code Block
pANTLR3_INPUT_STREAM            input = ...
pFactoringLexer                             lexer = ...


pANTLR3_STRING_FACTORY        stringFactory = antlr3StringFactoryNew();
lexer->pLexer->rec->state->tokSource->eofToken.strFactory = stringFactory;


pANTLR3_COMMON_TOKEN_STREAM        lexTokens =
antlr3CommonTokenStreamSourceNew(...);
pFactoringParser                parser = ...
parser->progStart(parser); // crash here 

Ben Ratzlaff

This may be a bug, already listed in the bug system, which can be browsed publicly at: [1]http://antlr.org/jira

There I think you will find that I fixed this issue, but if not then I know I do have it listed as something to fix for 3.1.2. If fixed already then you can get the latest C runtime dist by visiting Hudson at:  [2]http://www.antlr.org/hudson/job/ANTLR%20C%20Runtime/

You will likely need a latest jar for the ANTLR tool as I think that there are C codegen template changes to go with it. Again, Hudson is your friend: [3]http://www.antlr.org/hudson/job/ANTLR%20Tool/

There is probably something going on in your grammar that triggers this, but it could be a function of the fact that in the latest version of the runtime I changed from using calloc to malloc and at the same time tokens don't necessary have a strFactory (which is detected by the pointer being NULL). I have fixed some things in this area in the latest runtime, and you may be running across this.

You should not need to set up the factory manually as mentioned.

Problems with C runtime and output=template

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:67+mid:luwlkmjlfuqgvqu5+state:results

Richard Lewis

Question 1:

I'm trying to convert my tree grammar from JAVA to C and running into issue with code generation. I'm targeting string templates as the output type and run into the following error:

1>Translating to tree parser.

1> : error 10 : internal error: no such group file ST.stg  

1>Project : error PRJ0019: A tool returned an error code from

"Translating to tree parser."

Are string template outputs supported in the C code generator? I can't

find a corresponding ST.stg in the codegen/templates/C

Answer:

The C runtime does not support stringtemplates as they are an object oriented system.

Question 2:   I found out from the WIKI that RewriteTokenStream is not implemented yet. Is this related to string template rewriting not being available? If so, when will this be implemented since it's a roadblock for me.

Answer:

No, it is just that the RewriteTokenStream has been in flux somewhat and is still pending changes to reduce the overhead of using it. No one has asked for it before now and I did not want to write and re-write it. It will be  implemented in 3.2 assuming that the Java version performance is corrected by then.

Question about option greedy

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:67+mid:5ted5ozdkshwpxsv+state:results

I wrote this grammar:

Code Block
startrule
	: (property comment property)*
	;
comment
	: COMMENT { printf("Comment: \%s\n", $COMMENT.text->chars); }
	;

COMMENT
 	:  '/*' ( options {greedy=false;} : . )* '*/'
 	;

property
	: TOKEN { printf("Property: \%s\n", $TOKEN.text->chars);}

TOKEN
	: (ALPHA | DIGIT)+

fragment DIGIT  	
	: '0'..'9'
	;

fragment ALPHA
	: 'a'..'z' | 'A'..'Z' |'@'|'.'| ' ' 
	; 

The input is:

This is a test /* with a comment */ in the middle

This is a test /* with a comment */ in the middle

This is a test /* with a comment */ in the middle

The result looks good, but some errors are printed out:

test.txt(1) : lexer error 3 :

at offset 49, near char(0XA) :

This is a test /* w

test.txt(2) : lexer error 3 :

at offset 50, near char(0XA) :

This is a test /* w

test.txt(3) : lexer error 3 :

at offset 50, near char(0XA) :

Property: This is a test 

Comment: /* with a comment */

Property:  in the middle

Property: This is a test 

Comment: /* with a comment */

Property:  in the middle

Property: This is a test 

Comment: /* with a comment */

Property:  in the middle

BTW: The line ending in this file is 0x0A.

Andreas Volz

Question 1: Could anyone explain this error and how to prevent it?

Answer:

You have not specified to the lexer what it should do with those chars (I assume that this is C from your code above):

Code Block
 NL : ('\r' | '\n')+ { $channel=HIDDEN; } ;
ANY : . { SKIP(); } ; // Always make this the very last lexer rule

Question 2:  How do I not include the '/*' and '*/' tags in the comment match?

Answer:

From the top of my head:

Code Block
COMMENT : '/*' { $start = $pos; } ( options {greedy=false;} : .)* { 
EMIT(); } '*/' ; 

I think it is $pos, but you might need to use GETCHARPOSTIONINLINE() rather than $pos.

Question 3: To exclude the /* and */ would something like this work? Didn't that used to work in Antlr 2?  I think that would be a very useful feature to have back.

Code Block
 COMMENT
 	:  '/*'! ( options {greedy=false;} : . )* '*/'!
 	;

Answer:

No. We have explained this many times in the past but it is to do with the performance gains achieved by not associating text with tokens unless you really need it and so do so yourself.

You can also do this:

Code Block
T1 : '/*' r=FRAGRULE '*/'  { setText($r.text); } ; 

So in practice it is only a minor inconvenience, for a simpler and faster lexer :-)

Question about parse keywords and variables

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:69+mid:x4nwgbgnwvgbl5ht+state:results

I have a problem with parse key words.  Like in PL/SQL "EXIT"  is key word. But some variable names also include this key word.

Code Block
 DECLARE
 CURSOR my_cursor IS
 ........
BEGIN
 OPEN my_cursor;
 LOOP
 ......
 exit WHEN my_cursor%NOTFOUND;
 .......
END LOOP;
CLOSE my_cursor;


:GLOBAL.exit := 'Y'; --------- (if this statement is  :GLOBAL.exit123:= 'Y', no
problem)
 EXECUTE_TRIGGER('WHEN-WINDOW-CLOSED');

If my g file has set "exit" as keyword, when the g file analyses the variable name , always go to "statement" rule to match keyword then throw exception. How can let the parser knows the second "exit" is variable( go to varName rule) not key word ( don't  go to "statement" rule)?

 

Code Block
statement
         :
         .......
         |
                     "EXIT"^ (expression)? (WHEN! (expression))?
SEMI!
         |
                     ...........
         ;
......


varName :
        (COLON)? IDENT^ (DOLLAR IDENT)? ( DOT IDENT )?  ( DOT (IDENT |
COMMIT) )? ( DOT (IDENT) )?       {#varName.setType(VARIABLE_NAME); }
                ;

Renee Luo

You need an identifier rule and use that not an ID token, when identifiers can also be keywords:

Code Block
 id : ID | t=EXIT { $t.setType(ID); } .... ;

It can be done for all SQL keywords:

http://www.temporal-wave.com/index.php?option=com_psrrun

<http://www.temporal-wave.com/index.php?option=com_psrrun&view=psrrun&Itemid=56>
&view=psrrun&Itemid=56

Segfault in C target on EOF error reporting

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:72+mid:4gzzcwytpeff4rqb+state:results

If there's an EOF in the grammar, the C target crashes on input with certain syntax errors. I wrote this quick fix. I hope it at least helps you pinpointing the problem.

 --- libantlr3c-3.2/src/antlr3baserecognizer-orig.c 2009-12-11 23:54:59.000000000

+0100

+++ libantlr3c-3.2/src/antlr3baserecognizer.c 2011-02-03 14:39:59.942609300
+0100

@@ -2216,7 +2216,7 @@

Code Block
if	(text != NULL)
 	{
-		text->append8(text, (const char
*)recognizer->state->tokenNames[expectedTokenType]);
+		text->append8(text, expectedTokenType == EOF ? (const char *)"EOF" : (const
char *)recognizer->state->tokenNames[expectedTokenType]);
 		text->append8(text, (const char *)">");
 	}

Marco Trudel

Jim Idle: I have already fixed this I think. It is because the EOF token is trying to be duplicated or otherwise modified. However, the runtime error message routine is just an example - you are expected to implement your own that does something sensible ;-)

Issue with Missing header file in dist 3.1.3: antlr3config.h

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:72+mid:52up2jxwdun6vrsb+state:results

I'm trying to learn Antlr. I'm attempting to exercise the hoisted predicates code in C . The code is failing due to a missing header file.

I've downloaded the latest runtime 3.1.3, but it does not contain the header either.  Any suggestions.

Here is the error:

In /Users/jskier/Documents/libantlr3c-3.1.3/include/antlr3defs.h:217:26: error: antlr3config.h: No such file or directory

John Skier

Please read the build instructions in the API documentation.

Run:

configure, make, sudo make install

Looks to me like you did not and are trying to include the files directly from the expanded tar you downloaded. 

How to prepare the intstream for function call?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:78+mid:2b6bnu36acrp3czk+state:results

I'm using the C API to implement function call. I've already recorded the index of the function body and created a new node stream for the function call. But I find that I have to call istream->size(istream) before calling istream->seek. Or errors will be reported. Is it because the input buffer is not ready or anything else? What is the correct step to reset the index? 

Mu Qiao

[Posted on 070411] Yes, the input stream is not ready. I think that this is 'fixed' in later releases. However, the idea is that I did not want to test a flag to see if the istream was ready on every call in to the stream, so you must use a call that initializes the stream, and then everything else works.

C Target: setUcaseLA bug?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:78+mid:ienywjp4bh7qljcz+state:results

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:78+mid:qzunaewgld463v5d+state:results

Brian Catlin

Question 1: Is setUcaseLA known to work in ANTLR3c-3.2? 

Answer:

Works for me, I just tried it. There must be something else going on. Are you sure you are looking at the debugger correctly? Are you tracing in to the upper case version of LA?

Question 2: At the beginning of my generated lexer routine mTokens, it switches on LA(1) trying to build a token, but LA(1) is returning the lowercase input character.  As you can see below, I call setUcaseLA immediately after the input stream was created.

Code Block
...
 
@lexer::header

{

#define     ANTLR3_INLINE_INPUT_ASCII

}
...

Answer:

Can you see your problem if I highlight the above line? ;-) You are not calling LA, as you have told the lexer to use inline input code and not call the input routines.

How to push and pop scope using antlr3 c code generator?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:80+mid:tyjc34hmlkpsp3xe+state:results

I'm new to antlr3. I want to wirte a simple parser and do scope pushing and poping during parsing. How can i define this in antlr3 grammer rules?Is there some way to do things like this in antlr grammer rules:

Code Block
topdown:
   enterBlock|enterFunction { pushScope();}

bottomup:
   exitBlock|exitFunction { popScope(); } 

Yang Yang

1. There is an example in examples-v3/C/C/C.g in the examples:

http://www.antlr.org/download/examples-v3.tar.gz

2. Just use the built in global scopes.

Using ANTLR and MySQL in C target

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:80+mid:iirdphgcadzz6y5m+state:results

I am trying to build an ANTLR grammar using the C target.  I want to store some data from the grammar into a MySQL database.  However, when I include both the mysql and antlr C target runtime libraries in my project, I get errors: 

[Please refer link above for errors]

I have used the MySQL library and the ANTLR C runtime library separately with the relevant sections of the same program and have no trouble then.  Please let me know if anyone has used these two libraries together and if so, how you can overcome the above redeclaration errors.

Dinesha Balasuriya Weragama

The MySql headers do a lot of strange things such as #define free - it probably means that you are trying to lump everything together and include all the headers everywhere. Try splitting this out in to a supporting translation unit and just including what you need. For instance, just pass a pointer to the token text and its length and don't including the antlr headers.

Backtrack + rule arguments + C target

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:81+mid:qegg5yvrq3y4iny4+state:results

I have some problems with backtrack option and rule arguments.

Example grammar:

Code Block
grammar testbt;

options
{
	language = C;
	backtrack = true;
}

expr	
	@init {
		int i = 1;
	}
	:
	num[i] | id[i]
	;

num[int i]	:	'0'..'9'+;
id[int i]	:	'a'..'z'+;

Generated files does not compile with error "error C2065: 'i' : undeclared identifier".

(The problem is that C target does not insert rule arguments into argument list of autogenerated 'synpred' functions.)

How can I fix this?

Anton Bychkov

Please read the documentation:

Code Block
@declarations {}
@init {}

Then search the list using antlr.markmail.org for hoisted predicates and local variables. It isn't the C target it is just that the local variable or parameter is out of scope for the predicate, so you must use scopes if you have to use semantic predicates (with parameters).

Saving, duplicating and substituting tree nodes in the C API

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:82+mid:lx2j6z3oypetr53i+state:results

Question 1: I'm working on a language that allows assignment of functions to variables. Something like this.

Code Block
fun = execute(params);

Later in the same scope if I want to do.

Code Block
another_fun = fun

The most natural way to do this seems to me to save the tree from the first execute(params) in a symbol table and then, while generating the AST, when I see that "fun" is defined, to substitute a duplicate of it's tree for the assignment to another_fun.

David Minor

That doesn't sound like the correct way to do things to be honest, the AST is just the parsed elements and you probably need to do a lot more with these than just store them? Linguistically, that looks very confusing, you should probably use a different operator than = :-)

Question 2: The trouble is the symbol table is in C world not Antlr.  It looks like I could just duplicate the node save the pointer and then re-use it, but I don't see any examples of how to do this.

Does anyone have an idea?  Even a Java example would be helpful.

Answer:

What are you trying to do/build? An interpreter? Basically, you should do as little as possible in the parser other than the basics such as build the symbol table if that is a a possibility (you know what all the types are while you are parsing etc). Generally, you would not want to duplicate the sub-tree for a symbol table but you can just reuse the pointer anyway as it is returned by the rule that parses your function. $rule.tree should do it I think.

Antlr2-C++ target patching?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:82+mid:3762bje72frxsgoi+state:results

I'm forced to use antlrv2 because antlr3 does not support C++! The C target encloses the generated lexer/parser code in 'extern "C"'  which prevents using C++ constructs (like templates or constructs from other C++ libs) inside the parser.

I found that the last version of antlr2 (2.7.7) requires a patch because of missing headers. Where do I have to post this patch?

Oliver Kowalke

You should not be using C++ inside the actions. Create a helper class, and call from it. The headers are extern "C" and then compiler the generated C as C++ and you are fine. I doubt that anyone will be patching 2.7.7 C++ target.

Uninitialised global scope struct instance in C code

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:85+mid:cqwra33a73v67wlx+state:results

I am using Antlr Version 3.2 and C libantlr3c-3.2. I have a grammar "Foo". In the grammar I have a global scope declared:

Code Block
scope ParserGlobals {
  int x;
}

and a rule that initialises it before descending further into the parse tree

Code Block
ruleA returns [int result]
@init  {
  $ParserGlobals::x = 0
}
  : ruleB { $result = $ParserGlobals::x; }
 ;

The code that is generated by the C target is comprised (with other code omitted for brevity) of the following

Code Block
static String ruleA(pFooParser ctx)  {
  (SCOPE_TOP(ParserGlobals))->x= 0;
}

in it's post processed form (by gcc -e) 

Code Block
static String ruleA(pFooParser ctx)  {
  (ctx->pFooParser_ParserGlobalsTop)->x= 0;
}

However for some reason ctx->pFooParser_ParserGlobalsTop is null. I changed the generated code to the following

Code Block
if ((SCOPE_TOP(ParserGlobals)) == NULL)  {
  printf("ParserGlobals stack is not initialised");
}

And the 'if' is evaluated to true.

In the "constructor" for the parser, I noticed

Code Block
ctx->pFooParser_ParserGlobalsTop      = NULL;

However as far as I can see there is no code in the generated source that changes the value of pFooParser_ParserGlobalsTop before my @init code accesses it (thus causing a seg fault).

I noticed that someone else had a similar problem with local scopes (http://www.antlr.org/pipermail/antlr-interest/2008-April/027524.html) However I'm not sure why a global scope in a grammar is generating code that is seg faulting due to the scope not being initialised.

Kieran Simpson

If the scope is null it means (as per the comments and antlr.markmail.org), that you hoave a path in to your rule that does not go through the rule where the scope lives. Put a breakpoint in the if condition and it will tell you which path it is.         

Kieran Simpson:  I see where I went wrong.  I needed to include the scope in the rule  eg:

Code Block
ruleA returns [int result]
scope ParserGlobals;
@init  {
  $ParserGlobals::x = 0
}
  : ruleB { $result = $ParserGlobals::x; }
 ;

 Thanks for the pointer.

Is there string template support for C runtime?

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:85+mid:lccpta4agtbqbbac+state:results

Jeffrey Newman

StringTemplate is an object oriented system and cannot be reproduced in C. 

IDs and keywords

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:87+mid:pe6hvtcy3qftvj26+state:results

I have just started a project to convert our current way of processing the file generated by our program into a more elegant way by using a parser generator. ANTLR so far as proven to be quite powerful but I think I have hit a bit of a wall. Here is an extract of my grammar:

Code Block
grammar MFL;

options{
 language = C;
}

model:
'MODEL'  ID  'ASSOC' cpa_vars id1=ID? id2=ID? ';'
;

cpa_vars returns [long var]:
    'CPA'     {$var = JVCPA;}
    | 'PSAT'  {$var = JVSVP;}
;

ID
    :('a'..'z'|'A'..'Z'|'0'..'9'|'_')
('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|','|'+'|'.')*
;

the rule model should match things like, but there are situation where it does not work as I expected:

example: this matches OK

Code Block
MODEL mymodel ASSOC CPA BIP1 ;

but says that <missing ID> where cpa is.

Code Block
MODEL cpa ASSOC CPA BIP1 ;

Debugging the code I understand the lexer has assigned token type to the literal present in the cpa_vars rule instead of the mode generic ID token type. My question is: how do I make sure I match ID instead of  'CPA' of another rule for this case?

The configuration file I am trying to parse follows this structure that depending in the place the tokens are, they are considered actual tokens or else they are just general identifiers.

Nuno Pedrosa

This is well covered in this forum, so use the search engine. But:

1) Do not use 'LITERALS', create real tokens;

2) Use a rule id instead of the token ID

3) The id rule has ID and all the keywords as alts, if producing AST, then change type to ID;

4) Where this introduces ambiguity, use a one token predicate, or explicit k=1;

Assuming AST...

Code Block
id:  ID
  |  CPA    -> ID[$CPA]
  |  ASSOC  -> ID[$ASSOC]
... etc

model: MODEL id ASSOC cpa_vars id? id? SEMI ;

CPA  : 'CPA';
PSAT : 'PSAT';

etc...

#defines in C target

http://antlr.markmail.org/search/?q=%23defines+in+C+target#query:%23defines%20in%20C%20target+page:1+mid:dlx7zntuwngf6qp6+state:results

I am building a parser for a C target, and am running into issues with the generated #defines from the lexer tokens. Essentially, the lexer names that were chosen in some cases conflict with other #defines in the code (either from our own code, or from standard libraries like Windows.h). While I realize that I could just rename the lexer rules to remove the conflict, this makes the grammar file less readable, and at this point is a considerable amount of work (our language has hundreds of keywords). Is it possible to tell the lexer generator to prefix/postfix all of these defines to make them more unique, while preserving the nice, readable grammar definition file?

Justin Murray

I had wanted to do that but a limitation in the code generation templates (which I no longer remember) meant that I could not. You need to suffix your token names with _ as that is least intrusive. You need only do that for the tokens that clash and you will find it isn't as much work as you think using visual studio. You will not notice. The other way I do it is to prefix with K such as KNULL KTRUE and so on.

antlr-c and llvm

http://markmail.org/message/4dq3wfngsv2nklvd#query:+page:1+mid:lekqtywarmc2baeh+state:results

In my main.cpp if I include <llvm-c/Core.h> at the very top, everything compiles nicely.

However, if I include it in the @includes section of my tree grammar, or for that matter anywhere after I include the header for my tree walker, then I get a ton of errors.  Note: including <antlr3treeparser.h> does not appear to cause these issues.

Here's some things it's choking on in the llvm code (stripped down for readability):

Code Block
namespace llvm {

template<typename ValueTy> class StringMapEntry;

typedef StringMapEntry<Value*> ValueName;

class Twine;

class Value {
...
ValueName *Name; //!Expected unqualified-id before numeric constant
...
void setName(const Twine &Name); //!Expected ',' or '...' before numeric
constant

Aaron Leiby

You are doing this the wrong way around. You want as little code as possible in your tree walker, just just want a set of API calls that  your tree walker can make, where it passes pointers to trees and/or tokens. The API code is then I n a separate C file, and it is that C file that makes the LLVM calls and has the includes you need for that. That file also has the antlr headers so that it knows about tree structures etc.

Golden rule is to keep code out of the .g files. 

Recursive Tree Walking C Target

http://markmail.org/message/yikh52nnnadmu7me#query:+page:1+mid:yikh52nnnadmu7me+state:results

Just wondering if anyone had any tips for recursively walking an ANTLR_BASE_TREE produced from a parser. I seem to be getting some memory issues. I.e. A snippet of my evaluate method is:

Code Block
std::string ConditionTree::evaluate(pANTLR3_BASE_TREE root) {

 //Variable declarations
 std::string value1 = "";
 std::string value2 = "";

 //Firstly get the textual description from the node
 std::string nodeText = (const char*) root->getText(root)->chars;
 qDebug() << "Node Text: " << nodeText.c_str() << "\n";

 //Get the nodes children and check to make sure there are some
 pANTLR3_VECTOR children = root->children;
 qDebug() << "Got the chidlren";

 if(children != NULL) {
  int count = children->count;
  qDebug() << "Number of Children: " << count;

   pANTLR3_BASE_TREE c1 = (pANTLR3_BASE_TREE) children->get(children,0);
   pANTLR3_BASE_TREE c2 = (pANTLR3_BASE_TREE) children->get(children,1);
   value1 = evaluate(c1);
   value2 = evaluate(c2);

}
...

Thomas Davis

I would not use the getText() method if I were you as that is just a convenience I created if you are just hacking something up. Look at what that method does and copy it to produce your c_str directly.

Next, you are assuming that there are always 2 children, but ANTLR produces a tree which has 0..n nodes.

So, look at the function makeDot() in antlr3basetreeadaptor.c and copy it. This code does in fact use getText(), but it does so because I don't know how long you will want to keep the strings around so they are tracked by the string factory and released when you free the tree.

2nd thread for this question:

FYI, I transform the Antlr tree into my own C++ data structure for tree  walking.

Kenneth Domino

Answer: Not sure why you would need to do this, you are just adding an extra layer and I don't see that you are getting much for your added complexity.Answer:

Question:

With this conversion, I can now do things more easily, because I don't use the Antlr C runtime data structures, which are hard for me to understand and debug.  (I still cannot understand why the target isn't just C++.)  

Answer:

For a start, a C++ target will generally have more overhead as it isn't quite as close to the metal. Secondly  though, C++ compilers are not universally available, whereas almost every platform has a C compiler. Thirdly,  many professional software companies do not allow C++ because not enough people understand it properly  and they end up with unfathomable, uncommented, C++. Hence C is the basis of everything and in this case  deliberately so.

Question:

I can now add an iterator for tree walking, or change the behavior of getText(), which allocates a new copy of the string every time it is called.

Answer:

Unless you install your own function, which is why all the structures use pointers to functions. But as I have said many times, getText() is really not meant for hard core work. I also explained to you that I can't know what you want to do with the text, so if you call getText, you will get another copy. If I don't do that, then you  would manipulate what I give you and it would become the text for the token as a byproduct of using it, which  is not what you want (generally). If you want to preserve the text, don't call getText - cache it. There is even a  pointer that you can use in the token structure. If I made the default be what you want, then it would be incorrect for most purposes. You also misunderstood the code as you were looking at the code that decides if  the lexer has overridden the default text or not, but don't let that stop you commenting. Finally, if you are not  changing the text, then don't copy it at all, just use the pointer to the input, which is stored in the token.

The C code is completely flexible, but it is raw C, aimed at being as fast as it can be and does not come for free. I can't help thinking that you have done a lot more work here than you would have done if you had read  through the docs or asked a few more questions even. 

Question: 

In addition, in my tree walker I need to associate associate some data with each node. I could create a std::map<pANTLR3_BASE_TREE, DATA *> but this was slow because of all the thousands of nodes.

Answer:

Yes, because you are performing thousands of new(), another reason I did not write this in C++. You say you  don't understand why it isn't C++ but  in the next breath, you immediately run across one of the problems of doing that in C++ or trying to make the runtime be all encompassing for all purposes; it deliberately isn't.  Reading the comments, you would have seen that I thought of all that and that is why there is a void * that you  can use for anything you like. 

Question:

Alternatively, I could have tried to modify the default node type in tree construction, but I could not find an example to make my life easier, and I am not motivated enough to read and understand "newPoolTree (pANTLR3_ARBORETUM factory)" in antlr3commontree.c.

Answer:

If you are not going to read the code and comments and doxygen, you won't see that there are fields in  the default node that are specifically reserved for holding data. They are also documented in the doxygen docs.

----

void *  u 

Generic void pointer allows the grammar programmer to attach any structure they like to a tree node,  in many  cases saving the need to create their own tree and tree adaptors. 

---

Vectors in C generated code

http://markmail.org/message/6p2wwlrgtan2evgq#query:+page:1+mid:vbzl5saj6qskuenh+state:results

Question 1: I'm trying to implement a SQL parser.  I'm working from the SQLite grammar. I need to return a column type. (e.g. var char)

The problem is that there is no vectors in the ctx structure. I have not been able to find and example of using the vectors.

If I use name = ID+,  then the name is just stomped with char, the var is gone.

Please direct me where I see an example of using the C vectors. How to include them, how to access them, how to make the generated code compile.

Jeffrey Newman

Here is an example on creating and using a vector with the C API:

Code Block
// Initialize vector factorypANTLR3_STRING_FACTORY str_factory =
antlr3StringFactoryNew();pANTLR3_VECTOR_FACTORY vec_factory = *
antlr3VectorFactoryNew*(1);
// create source vectorspANTLR3_VECTOR vector = *vec_factory*->*newVector*
(vec_factory);
// we must free this memory at some point
pANTLR3_STRING temp1 = str_factory->newStr8(str_factory, "one ");
pANTLR3_STRING
temp2 = str_factory->newStr8(str_factory, "two ");
vector->*add*(vector, (void *)temp1, NULL);  vector->*add*(vector, (void
*)temp2, NULL);
int i;for(i = 0; i < vector->size(vector); i++){  pANTLR3_STRING temp =
(pANTLR3_STRING) vector->*get*(vector, i);
  ANTLR3_PRINTF(temp->chars);  ANTLR3_PRINTF("\n");}

Question 2:

I need to figure out where to put in this code sample, and how to call.

On page 132 of Ter's book (pdf version) 

When a rule matches elements repeatedly, translators commonly need to build a list of these elements. As a convenience, ANTLR provides the += label operator that automatically adds all associated elements to an New in v3. ArrayList, whereas the = label operator always refers to the last element matched. The following variation of rule decl captures all identifiers into a list called ids for use by actions:

 

Code Block
decl: type ids+=ID (',' ids+=ID)* ';' ; // ids is list of ID tokens

my rule is a little different (and simplitied)

Code Block
type_name 
    : name+=ID+ (LPAREN size1=signed_number (COMMA size2=signed_number)?
RPAREN)?
      {

      }
    ;

The key point here being name += ID+

The relevant generated code looks like:

Code Block
            	        //
/home/jeffn/Development/antlr/trunk/edu/sqliteGui/SQLite.g:207:11: name+= ID
            	        {
            	            name = (pANTLR3_COMMON_TOKEN) MATCHT(ID,
&FOLLOW_ID_in_type_name1275); 
            	            if  (HASEXCEPTION())
            	            {
            	                goto ruletype_nameEx;
            	            }


            	            if (list_name == NULL)
            	            {
            	                list_name=ctx->vectors->newVector(ctx->vectors);
<<<<<<------- Key point.
            	            }
            	            list_name->add(list_name, name, NULL);


            	        }

The key point here, in the line below (and marked above). is the ctx->vectors.

list_name=ctx->vectors->newVector(ctx->vectors)

There is no "vectors" element in the ctx structure.

SO, MY FOLLOW UP QUESTION ARE:  

HOW DO I PROPERLY ADD THE vectors ELEMENT TO THE THE ctx STRUCTURE?

HOW TO I INIT THE ctx ->vectors TO POINT TO MY NEWLY MINTED vectors function.

AND HOW TO USE IT IN SUBSEQUENT RULES.

(Or more precisely in my case. Since I can simply build a string and pass it back to the calling rule.  How do I access the vector's element (eg the individual string of the compound type name (ie var char) to build a composite (concatenated) string.)

--------

One of my questions is why did the code generator generate code for functions it did not create?

--------

I created a rule;

vectors: ;

This indeed put a vectors element in the ctx structure (I thought I was home free, all I would have to do was initialize with the newly minted vectors function that Stanley gave me.)

So I added an

Code Block
@members {
	myVectors() {}
}

@init {
 ctx->vectors = myVectorr;
}

And I found that the generated code did indeed contain the myVectors routine. And nothing was generated to update the ctx->vectors element. Clearly the my idea of an @init outside of a rule being executed was wrong.

------------

My options are:

Code Block
options {
	language = C;
	k = 4;
}

Answer:

You are trying to use the += operator but you are not producing an AST (output=AST; option). These operators are ONLY for producing ASTs (which you should really be doing anyway if you are doing anything other than something very trivial). So, if you do not want to produce an AST and want to accumulate things like that, then you need to write your own code (though you could use the vectors, hashtables of course). Then general rule us to take all suck code out of actions so that you call an API that builds lists - this means your action code is just the C/C++ code that is required to pass the structures of interest to your library routines.

You could add the output=AST option and ignore the resulting AST too, but this would be pretty wasteful of course.

Also, if you are still learning ANTLR, then unless you don't know Java or C# at all, I would start with one of those targets as the C target has the steepest learning curve. This is not to cast aspersions on your C programming skills, it is just that it is better in general to learn about the ANTLR 'stuff' then move to the C target.

-----------------

You should not specify k unless you have good reason to and you need output=AST if you want to use the vectors, but then you should really use an AST and not try to perform things in the parser as the next problem you will have is tracking things that you malloc when a syntax error is found.

-----------------

Missing MATCHRANGE macro

I'm using ANTLR v3 C runtime and found this macro is missing from the generated *Parser.c and *Parser.h . This macro can be found in *Lexer.c file . Is this a bug ?

My tool version is 3.4 and C rt is libantlr3c-3.4-beta4 , following is the errored code fragment of PLSQLParser.c:

Code Block
...
switch (alt17)
        {
       case 1:
         // PLSQL.g:91:4: '0' .. '9' ( '0' .. '9' )*      // <=

This is my grammar file

Code Block
 	{
          root_0 = (pANTLR3_BASE_TREE)(ADAPTOR->nilNode(ADAPTOR));

          MATCHRANGE('0', '9');                  // <= This is
the missed macro
...

yushang

I think I've found the reason . I've written a rule as follow

Code Block
numeric_literal
    :    '0'..'9' ('0'..'9')*
    |    ('0'..'9')* '.' '0'..'9' ('0'..'9')*
    ;

which will be translated to MATCHRANGE in Parser , if I change it to this :

Code Block
numeric_literal
    :    INT
    |    FLOAT
    ;
INT
    :    '0'..'9' ('0'..'9')*
    ;
FLOAT
    :    ('0'..'9')* '.' '0'..'9' ('0'..'9')*
    ;

Answer 2: 

Better to do this:

Code Block
fragment FLOAT;
INT : '0'..'9'+ ( '.' '0'..'9'+ { $type = FLOAT; } | ) ;

Answer 3: 

Answer 2 REQUIRES at least one digit to the left of the decimal place on FLOAT. which is not what the OP had. but is easily fixed, i believe, as:

Code Block
FLOAT : '.' '0'..'9'+ ;
INT : '0'..'9'+ ( '.' '0'..'9'+ { $type = FLOAT; } )? ;

(note that i also replaced the empty alternative with use of the `?` meta-operator. i think the meta-operator is stylistically clearer, but maybe there is some other reason not to use it?)

Problem with splitting grammars

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:98+mid:ei3oagdtloebioix+state:results

I have one big grammar that needs to be split to reduce compile time (Csource). For ANTLR 3.1.3 there are two problems:

1. The generated source code may change every time I run ANTLR even if I don't change the grammar

http://paste.pocoo.org/show/428505/ shows the difference.

2. The comment probably shouldn't tell about the date I use sed to remove the comment but I don't think it's a good idea. ANTLR shouldn't generate different source code if I don't change the grammar. The date information should belong to the property of the generated files, not the content of these files.

If your grammar has not changed, then your build process should not generate it. However, I think that I may have decided to take out the date sometime after 3.1.3.

If you need to generate smaller C files, then split the grammar up and use the import x.g, y.g, a.g; functionality.

You might also try 3.4-beta4.

SKIP() vs skip() in 'C' runtime

http://antlr.markmail.org/search/?q=Jim%20Idle#query:Jim%20Idle%20from%20list%3Aorg.antlr.antlr-interest%20from%3A%22Jim%20Idle%22+page:100+mid:tyjlnha6lmts4hhu+state:results

Where is the code for SKIP() found in the 'C' runtime? I had SKIP() in my C code version of the parser then I had to move to Java to find some bugs in my grammar. There I had to change SKIP() to skip(). Now I am going back to 'C' but I would like to change the 'C' runtime so that it will accept the lowercase skip().

Alan Condit

Why?

:s/skip\(\)/SKIP()/g

However it is a macro defined in the generated code, all you need do is:

Code Block
#define skip() SKIP()

In an @section that follows the macro definition of SKIP

Startup

...