Antlr3PythonTarget

The Python code generation target

New features currently in development

Features marked with [DEV] will appear with the next official release of ANTLR. Feel free to try them out by getting the current development version from FishEye.

Please note that the Python target is (compared to most other targets) rather young. I would consider it to be in beta state. This means that most parts are working (big exception are template output), but bugs and problems are to be expected and documentation is pretty poor. It still has to prove itself in a real world application (which is currently being done).

Both the runtime module and the code generation templates should now be feature complete and in sync with the Java target, except for the features listed below. But large parts of the runtime are still untested.

WARNING: Currently the runtime library for V3.1 is not compatible with recognizers generated by ANTLR V3.0.x. If you are an application developer, then the suggested way to solve this is to package the correct runtime with your application. Installing the runtime in the global site-packages directory may not be a good idea.

It is still undetermined, if a future release of the V3.1 runtime will be compatible with V3.0.x recognizers or if future runtimes V3.2+ will be compatible with V3.1 recognizers. Sorry for the inconvenience.

See this Example for a working tree-walking grammer.

Please send bugreports, feedback, patches to me or the antlr-interest mailing list

Credits go to Clinton Roy for the code to support output=AST.

Requirements

The following Python versions are supported: 2.4 2.5
(The runtime package and the generated code are probably compatible with Python 2.3, but I started to use decorators in the testsuite and a recent apt-get dist-upgrade purged python2.3 from my system, so I cannot test this at the moment.)

To use generated code, you'll need the Python runtime package antlr3 in your import path. There are no other dependencies beyond the Python standard library.

Usage

Selecting Python output

Just add language=Python; to the options section of your grammar:

grammar T;
options {
    language=Python;
    [other options]
}

...

For a grammar T.g ANTLR3 will then create the files TLexer.py and TParser.py which contain the classes TLexer and TParser (or just one of those, if you have a pure lexer/parser). For tree parsers, ANTLR3 creates T.py containing the class T.

Using the generated classes

To use a grammar T.g:

import antlr3
from TLexer import TLexer
from TParser import TParser

input = '...what you want to feed into the parser...'
char_stream = antlr3.ANTLRStringStream(input)
# or to parse a file:
# char_stream = antlr3.ANTLRFileStream(path_to_input)
# or to parse an opened file or any other file-like object:
# char_stream = antlr3.ANTLRInputStream(file)

lexer = TLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TParser(tokens)
parser.entry_rule()

If you want to access the tokens types in your code, you'll have to import these from the lexer or parser module (in Java these are members of the lexer/parser classes, in Python they are defined on module level):

from TLexer import EOF, INTEGER, FLOAT, IDENTIFIER

Using tree parsers

For grammars T.g (parser and lexer) and TWalker.g (the tree parser):

import antlr3
import antlr3.tree
from TLexer import TLexer
from TParser import TParser
from TWalker import TWalker

char_stream = antlr3.ANTLRStringStream(...)
lexer = TLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TParser(tokens)
r = parser.entry_rule()

# this is the root of the AST
root = r.tree

nodes = antlr3.tree.CommonTreeNodeStream(root)
nodes.setTokenStream(tokens)
walker = TWalker(nodes)
walker.entry_rule()

output=template

The Python target has now support for grammars with the option output set to template. This relies on the Python port of StringTemplate V3.1, which is has not been released officially yet. You can grab the current development version from FishEye (tgz, zip). Unpack it and run setup.py. StringTemplate itself depends on the ANTLR2 runtime, so you need to install that one too. Download antlr-2.7.7.tar.gz, unpack it and run setup.py from antlr-2.7.7/lib/python.

Once you got the runtime in place, use can use templates in your grammar. For example:

grammar T;

options {
  language=Python;
  output=template;
}

a : ID INT
    -> template(id={$ID.text}, int={$INT.text})
       "id=<id>, int=<int>"
  ;

ID : 'a'..'z'+;
INT : '0'..'9'+;
WS : (' '|'\n') {$channel=HIDDEN;} ;

The rule a will then return an instance with a st member, which is a StringTemplate instance that you can expand using its toString() method.
For a detailed explanation of the syntax of template rewrite rules please look at Template Construction.

You'll usually want ANTLR to use a specific StringTemplateGroup which contains your template definition. Simply set the templateLib attribute of your parser to your StringTemplateGroup.

For a bit more detailed example, look at the cminus example.

Using the ANTLRWorks debugger [added in V3.1.3]

This feature is still in early development

I'd appreciate if you try it out and send me feedback, but don't be surprised by bugs or strange behaviour.
Tree parsers are currently not supported, but token parsers should work (with and without AST generation).
I suggest to run ANTLRWorks from a console (java -jar antlrworks.jar) and keep and eye on the output. If the parser sends garbage over the socket, ANTLRWorks stops the debugger, but doesn't show an explicit message to the user that something went wrong.
Also watch the console which runs the parser as a bug may cause an exception which also doesn't trigger a user notification in ANTLRWorks.

When debugging grammars in ANTLRWorks that contain semantic predicates or when you want the actions to be executed during debugging, the grammar has to be generated with the -debug option and executed with some input. It will then communicate with ANTLRWorks over a socket. ANTLRWorks is (not yet) able to directly launch a Python parser, so you have to do it manually.

Here is a basic step-by-step description. It assumes that the antlr jar is in your CLASSPATH and the Python runtime in PYTHONPATH.

  • Open the grammar - let call it T.g - in ANTLRWorks
  • Generate the parser using java org.antlr.Tool -debug T.g
  • Create a file that holds the input that you want to parse (input.txt)
  • Launch the parser using the built-in test harness: python TParser.py --rule=start-rule --port=49100
  • This command should block, seemingly doing nothing (it waits for a connection on port 49100)
  • In ANTLRWorks select Debugger > Debug Remote... from the menu, address should be localhost and port 49100
  • Now you should be able to step through the grammar in ANTLRWorks and watch how it applies the rules or how the AST is constructed (if output=AST).

It is planned to add better support (as it has for Java) for Python debugging to ANTLRWorks.

API documentation

Reference documentation for the runtime package can be found at http://www.antlr.org/api/Python/.

Actions

This target currently supports the action scopes @lexer, @parser and @treeparser for global actions. The following action names are known:

  • header - Will be inserted right after ANTLRs own imports at the top of the generated file. Use it for import statements or any other functions/classes which you need in the module scope.
  • footer - Will be inserted at the bottom of the generated files, right before the final if __name__ == '__main__' block. [New in 3.1]
  • init - Will be inserted at the end of the __init__ method of the lexer/parser. Here you can setup your own instance attributes.
  • members - Will be inserted in the class body of the lexer/parser right after __init__. This is the right place for custom methods and class attributes.

For rules the additional action @decorate is recognized. The contents are placed right before the rule method and can be used for decorators.

r
@decorate { @logThis }
: s ;

will create something like

@logThis
def r(self):
    ...

Caveats

Don't use TABs

Make sure that your editor is using spaces for indention, when you are editing your grammar files. The generated code uses spaces and when your actions are copied into the output, TABs will only cause confusion. A warning should be generated, when ANTLR stumples upon TABs.

% in actions

This is not a Python specific issue, but the % operator is probably used more often in Python than in other languages. See Special symbols in actions for the usage of %, but support for these is not you implemented for the Python target.

In an ANTLR3 grammars % is a special character for StringTemplate and must be escaped, in order to pass a plain % into the generate code. So you'll have to stuff like

...
{ print "hello \%s" \% $t.text }
...

Member functions instead of property names

You may have to replace references like $TOKEN.text with $TOKEN.getText().  I found this was necessary to get the Example working. This is a bug in the CommonTree class in the antlr3 python library.  It will be fixed in a future release. - BCS

 Must supply 'self' for Member Function Calls

When an object calls one of its own methods, Python requires that the object be explicitly specified.  So, while in Java you can just say 'skip()' in Python you must say 'self.skip()'. 

Semicolons after property assignments

If you are assigning a value to a property in an action, it may be required to add a semicolon after the statement.

...
{ $text = "Hello world!"; }  // set text for rule
...
{ $someScope::someMember = value; } // set a scope member
...

ANTLR currently scans the code for a semicolon to detect property assignments. This semicolon is omitted in the generated code. This may lead to some strange code corruption, if ANTLR finds a semicolon in an unexpected place. But I don't know, if this is more than a theoretical problem - sofar I have not run into any issues.

For the curious...

Technically ANTLR does not need to treat assignments differently from expressions, because in Python stuff like $text always translates to some_internal_name.text, whereas Java need setText(...), if it's a LHS, and getText() on the RHS.
So this issue may well be fixed in a later version of ANTLR. In that case, the semicolons that you may now be using, will make it into the generated code, which is fortunately not a syntax error.

Empty alternatives (V3.0.x)

In rules with empty alternatives, ANTLR may generate invalid Python code:

r: ( s | t | ) u;

This will result in an else: without an indented block. Until this issue is resolved, just stick a no-op action into the empty alternative:

r: ( s | t | {pass} ) u;

This is not required in ANTLR V3.1 anymore.

Comments (for ANTLR2 users)

The Python target for ANTLR2 forced you to use C++ style comments inside of action blocks in your grammar. This is now longer true, use plain Python comments.

Invoke parser using built-in test harness

The generated code contains a simple main() function, so many scripts can be tested directly without writing any extra code - just by executing the generated parser as a script.

Usage

The scripts can be invoked with options as listed below and a single optional path to a file. When the path is omitted or explicitly set to -, the script will read its input from standard input (unless the --input option is present).

Command line options
  • --interactive: Start parser in interactive mode. You can one line of input at a time, which will be fed into the lexer, parser or tree parser. Press CTRL-C or CTRL-D to exit.
  • --encoding: Set the encoding of the input file (e.g. utf-8). Default is us-ascii.
  • --input: Use argument as input, don't read from a file or stdin.
  • --lexer: (parsers and tree parsers only) Name of the lexer class (e.g. TLexer). For parsers the default should be correct, but for tree parsers you must set this option.
  • --parser: (tree parsers only) Name of the parser class *(e.g. TParser).
  • --rule: (parsers and tree parsers only) Start rule of the parser or tree parser.
  • --parser-rule: (tree parsers only) Start rule of the parser.
  • --port: The port to use when running a grammar in debug mode.
  • --debug-socket: Dump socket communication to STDERR when running in debug mode.

Examples

The output will of course vary depending on your grammar and the given input.

Lexer

You have a grammar T.g which generated a lexer TLexer.py.

$ python TLexer.py input.txt
[@-1,0:2='foo',<4>,1:0]
[@-1,3:3=' ',<5>,channel=99,1:3]
[@-1,4:6='bar',<4>,1:4]
Parser

You have a grammar T.g which generated a lexer TLexer.py and a parser TParser.py. The parser has a rule a, which should be used as he start rule.

$ python TParser.py --rule=a input.txt
(+ 1 2)

The output depends on the return value of the start rule. If the parser generates an AST (output=AST, then the resulting tree will be printed. Otherwise a simple repr() of the return value will be print (which does not help much, if the rule has multiple return values, because you will then get something like __main__.r_return object at 0xb7c4602c> - this should be fixed somehow).

Tree parser

You have a grammar T.g which generated a lexer TLexer.py and a parser TParser.py, the start rule for the parser should be a. A grammar TWalker.g generated the tree parser TWalker.py with a start rule b.

$ python TWalker.py --lexer=TLexer --parser=TParser --parser-rule=a --rule=b input.txt
(+ 1 2)

The output varies depending on the return values of the tree parser rule, just as for parsers.

Custom main function

If you want to create a custom main() function, you can supply one using a @main section. You have to define a function main with one required argument, which will be the list of command line arguments (sys.argv).

@main {
def main(argv, otherArg=None):
    print "Hello world!"

}

Unsupported features

  • -debug option: mostly useful for integration into ANTLRWorks. Is currently being worked on, see above.