Language Translation Using ANTLR and StringTemplate

From 2004 article

Language Translation Using ANTLR and StringTemplate
Terence Parr, parrt at cs.usfca.edu
University of San Francisco

Introduction
My previous article for Code Generation Network, Generating Java and XML Using StringTemplate, illustrated using StringTemplate with a single model and multiple views to have a Java program dump itself out in multiple forms such as Java and XML. A more common use of template engines, however, involves language or data translation where you need to parse as well as "unparse".

This article demonstrates how to translate a very simple C dialect, which I will call "C-" in honor of the grades I received in college, to three different languages, again using a single model and controller with multiple views. The targets are Java, Python, and bytecodes similar to the Jasmin Java bytecode assembler format. Merely swapping in a new template file generates completely different output without even recompiling the translator! This limited dialect of C is not particularly functional, but is complicated enough to be interesting from a translation point of view.

The goal is to translate C- input such as:

char c;
int x;
int foo(int y, char d) {
  int i;
  for (i=0; i<3; i=i+1) {
    x=3;
    y=5;
  }
}

to (the mostly identical) Java:

class Wrapper {
    char c;
    int x;
    int foo(int y, char d) {
        int i;
        for (i = 0; i < 3; i = i + 1) {
            x = 3;
            y = 5;
        }
    }
}

to Python:

def foo(y, d):
    i = 0
    while ( i < 3 ):
        x = 3
        y = 5
        i = i + 1

and to the radically different pseudo bytecodes (as if we had compiled the Java program):

.class public Wrapper
.super java/lang/Object
.field c C
.field x I

.method foo(IC)I
    .var is i I
    iconst 0
    store i
    start:
    push i
    iconst 3
    lt
    bf exit
    iconst 3
    store x
    iconst 5
    store y
    push i
    iconst 1
    add
    store i
    goto start
    exit:
    return
.end method

The Java target needs some additional output (the class wrapper), the Python target removes/adjusts some constructs (the declarations/FOR loops), and the bytcode target generates multiple instructors per input construct. Just to be clear, the bytecodes described herein are not exactly Java bytecodes (for example, lt doesn't exist)--I have simplified the target for our purposes here.

Translator Architecture

The basic structure of the translator is an ANTLR grammar file (cminus.g), a main program (Main.java) that launches the parser, and three template group files used by the parser: Java.stg, Python.stg, and Bytecode.stg. The translator will read from stdin and write to stdout like a filter.

In object-oriented MVC parlance, the translator's model is the input token stream, the controller is the parser itself, and the view is the template engine + template files. The controller's job is to extract data in a meaningful way from the model and provide it to the view without worrying about the details of the output structure. In a sense, the controller is mapping abstract concepts from the input language to abstract concepts in the output language such as assignment to assignment. These abstract concepts are represented by one or more rules in the parser grammar and one or more templates in the template file. The view's job is simply to say what the abstract concept looks like in the target language.

Anything short of this fully-disentangled approach will not scale to more than a single target. The opposite strategy, that of print statements embedded in the grammar, is clearly impossible to retarget without replacing all of the prints. Further, you must imagine the emergent behavior whereas, with a template-based approach, you can clearly identify what the output structure looks like. Templates are an output grammar; the engine is an "unparser."

The ANTLR grammar is presented without much comment as most readers will be able to get the basic idea and can check out the simple Introduction to ANTLR, which describes ANTLR sufficiently to understand this article.

Parsing C-
The recognition of C- is relatively simple as it is just variable and function declarations. Here is a chunk of the grammar without translation actions:

class CMinusParser extends Parser;
...
program
    :   (declaration)+
    ;

declaration
    :   variable
    |   function
    ;

variable
    :   type declarator SEMI
    ;

type:   "int"
    |   "char"
    |   ID
    ;

declarator
    :   ID
    ;
...

This grammar references tokens such as SEMI and ID that must be defined. An ANTLR lexical grammar handles this easily. Here is a chunk:

class CMinusLexer extends Lexer;
...
SEMI:   ';' ;
COMMA:  ',' ;

ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INT :   ('0'..'9')+
    ;
...

Here is relevant portion of the main program that launches a C- parser on standard input:

CMinusLexer lexer = new CMinusLexer(System.in); // create tokenizer
CMinusParser parser = new CMinusParser(lexer);  // create/attach parser
parser.program(); // invoke the start rule

Translation Using StringTemplate

The simplest possible translator has a bunch of print statements embedded in the recognizer (parser) and does what is called a syntax-directed translation because a single in-order walk of the input syntax is sufficient to direct the translation. Naturally, print statements are pretty inflexible in the sense that you cannot easily retarget the translator without carving out the print statements and putting in different ones. Certainly the translator cannot generate more than one target at once. The model, controller, and the view are entangled. Here, the model is the input C- program stored as a token stream and the controller is the parser.

To get a retargetable code generator, you must detach the view from the model and controller. In other words, you have to get any output string or computed output string out of the model-controller and into a separate specification. The StringTemplate template engine provides a framework for accessing the extracted strings in a formal way.

StringTemplate lets you name pieces of the translation just like an input grammar lets you name pieces of the input such as declaration. The analog of an input grammar rule is a template. A rule or rules will feed data to a template or templates that specify how to format that data and what text to surround it with. The overall look of the Java output is a bunch of variables and functions wrapped in a class called Wrapper. To specify this using StringTemplate, define a template called program:

program(globals,functions) ::= <<
class Wrapper {
    <globals; separator="\n">
    <functions; separator="\n">
}
>>

Note the indentation preceding the <globals> and <functions> references ensures that all text generated from those attribute references will be indented.

Assignment Translation

Usually there will be both a rule and template associated with each abstract language construct like program, declaration, function, etc... For example, to perform a translation, the assignStat rule must fill the lhs ("left-hand-side") and rhs attribute values for use by the assign template. The following grammar rule, which is one of the overall expression matching rules, matches assignments (as well as other expressions):

assignStat
    :   ID ASSIGN expr
    ;	

To make the rule translate the input, it collects the left and right operands of the ASSIGN operator and then sets the appropriate attributes of the output template called assign.

assignStat returns [StringTemplate code=null]
{StringTemplate a=null;}
    :   id:ID ASSIGN a=expr
        {
            code=template("assign");
            code.setAttribute("lhs", id.getText());
            code.setAttribute("rhs", a);
        }
    ;

The assign output template looks the same for both Java and Python except that the Java template has a semicolon on the end:

assign(lhs,rhs) ::= "<lhs> = <rhs>;"

For bytecodes, it is a little more low-level:

assign(lhs,rhs) ::= <<
<rhs>
store <lhs>
>>

Note that rule assignExpr makes use of a helper function that gets an instance of a template with a particular name:

StringTemplate template(String name) {
    return Main.templates.getInstanceOf(name);
}

The general strategy then for the translator is to have every rule return a StringTemplate representing the appropriately translated text. The return value of the start rule will be the entire translation. The main program will call the start rule, program, and print out the return value:

StringTemplate code = parser.program();
System.out.println(code.toString());

Variable Declaration Translation
Python does not have static typing and, hence, does not have variable declarations. How can you set up a single parser rule that both translates declarations and omits declarations? Recall that the parser's job is simply to feed data to the appropriate template. In Java, the output template looks like:

variable(type,name) ::= "<type> <name>;"

but in Python, just make an empty template:

variable(type,name) ::= " "

For bytecodes, the locals and globals look different:

variable(type,name) ::= ".var is <name> <type>"

globalVariable(type,name) ::= ".field <name> <type><\n>"

Here is what the variable/declarator rule duo looks like with translation actions:

variable returns [StringTemplate code=null]
{StringTemplate t=null,d=null;}
    :   t=type d=declarator SEMI
        {
        if ( currentFunctionName==null ) {
            code = template("globalVariable");
        }
        else {
            code = template("variable");
        }
        code.setAttribute("type", t);
        code.setAttribute("name", d);
        }
    ;

declarator returns [StringTemplate code=null]
    :   id:ID {code=text(id.getText());}
    ;

A globalVariable template is used when the parser is not matching a function (currentFunctionName is null), meaning the variable is not a local.

The declarator rule creates a StringTemplate via helper function text:

StringTemplate text(String t) {
    return new StringTemplate(Main.templates,t);
}

which makes an anonymous (i.e., unnamed) template containing the actual text of the variable name. At the lowest level of the translation, identifiers are translated to identifiers (themselves).

Method Argument Translation

The reader may be asking how using StringTemplate is different than just having all rules return String or filling up a StringBuffer. Aside from the model-view separation issues and the nice iteration operations available in StringTemplate, there is a more subtle advantage--the resulting StringTemplate is actually a tree of nested StringTemplates (in a sense, the translator builds a StringTemplate from the parse tree implicitly walked via the call activation stack). In other words, StringTemplate is "live" data not dumb text and, hence, one template may refer to data stuffed into another template.

The translation of C- functions provides a simple case to illustrate how a StringTemplate may treat another StringTemplate like just an attribute to access its embedded attributes. First, consider how functions are translated per the source provided with this article.

Here is the rule that matches a function parameter declaration:

formalParameter returns [StringTemplate code=template("parameter")]
{StringTemplate t=null,d=null;}
    :   t=type d=declarator
        {
        code.setAttribute("type", t);
        code.setAttribute("name", d);
        }
    ;

It creates a parameter template for every declaration it sees, storing the type and name as attributes. For Java, a suitable output template for a parameter would be:

parameter(type,name) ::= "<type> <name>"
The function rule uses formalParameter while matching the whole function:

function returns [StringTemplate code=template("function")]
{StringTemplate t=null, f=null;}
    :   t=type id:ID {currentFunctionName=id.getText();} LPAREN
        (   f=formalParameter
            {code.setAttribute("args", f);}
            (   COMMA f=formalParameter
                {code.setAttribute("args", f);}
            )*
        )?
        RPAREN
        block[code]
        {
        code.setAttribute("type", t);
        code.setAttribute("name", id.getText());
        currentFunctionName=null;
        }
    ;

It stores the result of each formalParameter in a multi-valued attribute called args in a function template. The Java function output template dumps the list of (comma-separated) arguments via the expression:

<args; separator=", ">

The full function template is:

function(type,name,args,locals,stats) ::= <<
<type> <name>(<args; separator=", ">) {
    <locals; separator="\n">
    <stats; separator="\n">
}
>>

The Python version would look similar:

function(type,name,args,locals,stats) ::= <<
def <name>(<args; separator=", ">):
    <stats>
>>

where output template parameter is defined without a static type as:

parameter(type,name) ::= "<name>"

Now, to demonstrate that StringTemplate is more than just a bunch of flat strings, consider a new version of function translation that pulls data out of the parameter StringTemplate object directly:

function(type,name,args,locals,stats) ::= <<
def <name>(<args:{<it.name>}; separator=", ">):
    <stats>
>>

In this case, the parameter template is not used for output per se, it merely holds the type and name attributes for use by the function template. The toString() method (the method used to evaluate and convert a template to text) of template parameter is never actually invoked to spit out text.

The syntax for dumping the list of parameter names:

<args:{<it.name>}; separator=", ">

may look strange so it is worth explaining. Attribute args is one or more StringTemplate objects as created by the formalParameter rule. As such, each StringTemplate object's contained attributes are visible using the "dot" field operator.

A quick aside to explain the StringTemplate iteration notation may prove useful to many readers. Imagine a list of User objects with properties name and phone. A template may access these properties with <user.name> and <user.phone> assuming user is a StringTemplate attribute. If you store multiple values in an attribute, say users, then you can dump them all out simply by referring to the attribute: <users> (assuming User.toString()) is defined. To get a comma separator, use <users; separator=",">. What if you to change how a User is dumped? Apply an anonymous template to each element of users. This template:

<users:{Name: <it.name> Phone: <it.phone>}; separator=",">

dumps text like:

Name: Terence Phone: x5707, Name: Tom Phone: x3221

where <it> is the iterated value that walks through each of the User objects in users.

If you want just a list of comma-separated names, use the following:

<users:{<it>.name}; separator=",">

Summary

When building translators for complicated languages, we often focus on the parsing component such as building the ANTLR grammar, but using a tool for the "unparsing" / code generation component is equally important. This article demonstrates a combined ANTLR+StringTemplate approach that has consistently provided satisfying results. For example, the new ANTLR 3.0 parser generator relies on ANTLR 2.7.5 and StringTemplate for its code generator, which has proven remarkably easy to retarget to different languages.

By separating the output strings (view) from the model (input text) and controller (parser), a translator may be retargeted by swapping in a new view (set of templates)--all without recompiling the translator. The controller acts to map data in the model to the view; in language parlance, the parser maps input language subphrases matched by rules to output language subphrases generated with templates.

While a real translator is much more complicated, it is only in the details. The strategy described in this article will scale easily.

The Software
The software is comprised of the following files:

cminus.g
Main.java
Java.stg
Python.stg
Bytecode.stg

which are in the cminus.tar.gz compressed tarball.

You also need to download ANTLR and StringTemplate or you can use these jar files directly (.class files only):

antlr-2.7.5.jar
stringtemplate-2.1.jar

To build the software, do this:

$ java  -classpath .:antlr-2.7.5.jar:stringtemplate-2.1.jar antlr.Tool cminus.g 
$ javac -classpath .:antlr-2.7.5.jar:stringtemplate-2.1.jar *.java

To run the various targets (Java is the default), do this:

$ java  -classpath .:antlr-2.7.5.jar:stringtemplate-2.1.jar Main < input
$ java  -classpath .:antlr-2.7.5.jar:stringtemplate-2.1.jar Python.stg Main < input
$ java  -classpath .:antlr-2.7.5.jar:stringtemplate-2.1.jar Bytecode.stg Main < input