compilers: June 2010

Monday, June 21, 2010

Every language displays a structure called its grammar or syntax. For example, a correct sentence always
consists of a subject followed by a predicate, correct here meaning well formed. This fact can be
described by the following formula:
sentence = subject predicate.
If we add to this formula the two further formulas
subject = "John" | "Mary".
predicate = "eats" | "talks".
then we define herewith exactly four possible sentences, namely
John eats                   Mary eats
John talks                  Mary talks
where the symbol | is to be pronounced as or. We call these formulas syntax rules, productions, or simply
syntactic equations. Subject and predicate are syntactic classes. A shorter notation for the above omits
meaningful identifiers:
S = AB.                                           L = {ac, ad, bc, bd}
A = "a" | "b".
B = "c" | "d".
We will use this shorthand notation in the subsequent, short examples. The set L of sentences which can
be generated in this way, that is, by repeated substitution of the left-hand sides by the right-hand sides of
the equations, is called the language.
The example above evidently defines a language consisting of only four sentences. Typically, however, a
language contains infinitely many sentences. The following example shows that an infinite set may very
well be defined with a finite number of equations. The symbol ∅ stands for the empty sequence.
S = A.                                                  L = {∅, a, aa, aaa, aaaa, ... }
A = "a" A | ∅.
The means to do so is recursion which allows a substitution (here of A by "a"A) be repeated arbitrarily
often.
Our third example is again based on the use of recursion. But it generates not only sentences consisting
of an arbitrary sequence of the same symbol, but also nested sentences:
S = A.                                     L = {b, abc, aabcc, aaabccc, ... }
A = "a" A "c" | "b".
It is clear that arbitrarily deep nestings (here of As) can be expressed, a property particularly important in
the definition of structured languages.
Our fourth and last example exhibits the structure of expressions. The symbols E, T, F, and V stand for
expression, term, factor, and variable.
E = T | A "+" T.
T = F | T "*" F.
F = V | "(" E ")".
V = "a" | "b" | "c" | "d".
From this example it is evident that a syntax does not only define the set of sentences of a language, but
also provides them with a structure.

Introduction
Computer programs are formulated in a programming language and specify classes of computing
processes. Computers, however, interpret sequences of particular instructions, but not program texts.
Therefore, the program text must be translated into a suitable instruction sequence before it can be
processed by a computer. This translation can be automated, which implies that it can be formulated as a
program itself. The translation program is called a compiler, and the text to be translated is called source
text (or sometimes source code).
It is not difficult to see that this translation process from source text to instruction sequence requires
considerable effort and follows complex rules. The construction of the first compiler for the language
Fortran (formula translator) around 1956 was a daring enterprise, whose success was not at all assured. It
involved about 18 manyears of effort, and therefore figured among the largest programming projects of
the time.
The intricacy and complexity of the translation process could be reduced only by choosing a clearly
defined, well structured source language. This occurred for the first time in 1960 with the advent of the
language Algol 60, which established the technical foundations of compiler design that still are valid
today. For the first time, a formal notation was also used for the definition of the language's structure
(Naur, 1960).
The translation process is now guided by the structure of the analysed text. The text is decomposed,
parsed into its components according to the given syntax. For the most elementary components, their
semantics is recognized, and the meaning (semantics) of the composite parts is the result of the semantics
of their components. Naturally, the meaning of the source text must be preserved by the translation.
The translation process essentially consists of the following parts:
1. The sequence of characters of a source text is translated into a corresponding sequence of symbols of
the vocabulary of the language. For instance, identifiers consisting of letters and digits, numbers
consisting of digits, delimiters and operators consisting of special characters are recognized in this
phase, which is called lexical analysis.
2. The sequence of symbols is transformed into a representation that directly mirrors the syntactic
structure of the source text and lets this structure easily be recognized. This phase is called syntax
analysis (parsing).
3. High-level languages are characterized by the fact that objects of programs, for example variables and
functions, are classified according to their type. Therefore, in addition to syntactic rules, compatibility
rules among types of operators and operands define the language. Hence, verification of whether these
compatibility rules are observed by a program is an additional duty of a compiler. This verification is
called type checking.
4. On the basis of the representation resulting from step 2, a sequence of instructions taken from the
instruction set of the target computer is generated. This phase is called code generation. In general it is
the most involved part, not least because the instruction sets of many computers lack the desirable
regularity. Often, the code generation part is therefore subdivided further.
A partitioning of the compilation process into as many parts as possible was the predominant technique
until about 1980, because until then the available store was too small to accommodate the entire
compiler. Only individual compiler parts would fit, and they could be loaded one after the other in
sequence. The parts were called passes, and the whole was called a multipass compiler. The number of
passes was typically 4 - 6, but reached 70 in a particular case (for PL/I) known to the author. Typically,
the output of pass k served as input of pass k+1, and the disk served as intermediate storage.
The very frequent access to disk storage resulted in long compilation times.

Thursday, June 17, 2010

What is a compiler?
In order to reduce the complexity of designing and building computers, nearly
all of these are made to execute relatively simple commands (but do so very
quickly). A program for a computer must be built by combining these very
simple commands into a program in what is called machine language. Since
this is a tedious and error-prone process most programming is, instead, done
using a high-level programming language. This language can be very different
from the machine language that the computer can execute, so some means of
bridging the gap is required. This is where the compiler comes in.
A compiler translates (or compiles) a program written in a high-level programming
language that is suitable for human programmers into the low-level
machine language that is required by computers. During this process, the compiler
will also attempt to spot and report obvious programmer mistakes.
Using a high-level language for programming has a large impact on how
fast programs can be developed. The main reasons for this are:
• Compared to machine language, the notation used by programming languages
is closer to the way humans think about problems.
• The compiler can spot some obvious programming mistakes.
• Programs written in a high-level language tend to be shorter than equivalent
programs written in machine language.
Another advantage of using a high-level level language is that the same program
can be compiled to many different machine languages and, hence, be
brought to run on many different machines.