Introduction
Computer programs are formulated in a programming language and specify classes of computing
processes. Computers, however, interpret sequences of particular instructions, but not program texts.
Therefore, the program text must be translated into a suitable instruction sequence before it can be
processed by a computer. This translation can be automated, which implies that it can be formulated as a
program itself. The translation program is called a compiler, and the text to be translated is called source
text (or sometimes source code).
It is not difficult to see that this translation process from source text to instruction sequence requires
considerable effort and follows complex rules. The construction of the first compiler for the language
Fortran (formula translator) around 1956 was a daring enterprise, whose success was not at all assured. It
involved about 18 manyears of effort, and therefore figured among the largest programming projects of
the time.
The intricacy and complexity of the translation process could be reduced only by choosing a clearly
defined, well structured source language. This occurred for the first time in 1960 with the advent of the
language Algol 60, which established the technical foundations of compiler design that still are valid
today. For the first time, a formal notation was also used for the definition of the language's structure
(Naur, 1960).
The translation process is now guided by the structure of the analysed text. The text is decomposed,
parsed into its components according to the given syntax. For the most elementary components, their
semantics is recognized, and the meaning (semantics) of the composite parts is the result of the semantics
of their components. Naturally, the meaning of the source text must be preserved by the translation.
The translation process essentially consists of the following parts:
1. The sequence of characters of a source text is translated into a corresponding sequence of symbols of
the vocabulary of the language. For instance, identifiers consisting of letters and digits, numbers
consisting of digits, delimiters and operators consisting of special characters are recognized in this
phase, which is called lexical analysis.
2. The sequence of symbols is transformed into a representation that directly mirrors the syntactic
structure of the source text and lets this structure easily be recognized. This phase is called syntax
analysis (parsing).
3. High-level languages are characterized by the fact that objects of programs, for example variables and
functions, are classified according to their type. Therefore, in addition to syntactic rules, compatibility
rules among types of operators and operands define the language. Hence, verification of whether these
compatibility rules are observed by a program is an additional duty of a compiler. This verification is
called type checking.
4. On the basis of the representation resulting from step 2, a sequence of instructions taken from the
instruction set of the target computer is generated. This phase is called code generation. In general it is
the most involved part, not least because the instruction sets of many computers lack the desirable
regularity. Often, the code generation part is therefore subdivided further.
A partitioning of the compilation process into as many parts as possible was the predominant technique
until about 1980, because until then the available store was too small to accommodate the entire
compiler. Only individual compiler parts would fit, and they could be loaded one after the other in
sequence. The parts were called passes, and the whole was called a multipass compiler. The number of
passes was typically 4 - 6, but reached 70 in a particular case (for PL/I) known to the author. Typically,
the output of pass k served as input of pass k+1, and the disk served as intermediate storage.
The very frequent access to disk storage resulted in long compilation times.
No comments:
Post a Comment