What makes a language problem hard?

Given a source to target mapping. How can you characterize the difficulty of the translation?

  • Is the set of all input fixed? If you have a fixed set of files to convert, your job is much easier because the set of language construct combinations is fixed. For example, building a general Pascal to Java translator is much harder than building a translator for a set of 50 existing Pascal files.
  • Forward or external references? I.e., multiple passes needed? Pascal has a "forward" reference to handle intra-file procedure references, but references to procedures in other files via the USES clauses etc... require special handling.
  • Is input order of sentences close to output order? Are there multiple files to generate from a single input file or vice versa?
  • Context sensitive lexer? You can't decide what vocabulay symbol to match unless you know what kind of sentence you are parsing.
  • Are delimiters non-fixed for things like strings and comments? That makes it tough to build an efficient lexer.
  • Is language big; like lots of statements?
  • Are the source statements really similar; declarations vs expressions in C++?
  • Column sensitive input? E.g., are newlines significant like lines in a log file and does the position of an item change its meaning?
  • Case sensitivity problems like fortran?
  • Do you need good error recovery? Good reporting?
  • Well defined language or no manual; hacked for ages like gnucc by non-language designers? Is your language VisualBasic-like?
  • How fast does your translator have to be? It is often the case that building lots of translator phases simplifies your problem, but it can slow down the translation.
  • Does your input have comments as you do in programming languages that can occur anywhere in the input and need to go into the output in a sane location?
  • How much semantic information do you need to do the translation? For example, do you need to simply know that something is a type name or do you need to know that it is, say, an array whose indices are a set like (day,week,month) and contains records? Sometimes syntax alone is enough to do translation.
  • Equivalent syntaxes?  In C there are many different ways to dereference pointers.  You can normalize the language to a standard representation, but you might loose the original representation. The choice usually hinges on whether the output will be human-edited or not.  Designing the right tree structure has to incorporate decisions like this.
  • Jurgen Pfundt points out: The considered language might be small, but mapping is targeted for the conversion of huge files and this is really a challenge. An input file with a size of several megabytes restricts the usage of tree parsers or any other kind of memory consuming features. The transformation should be done in one single pass due to performance requirements and an extremly good and comfortable error reporting and error recovery is a must.