A compiler is a translator which accepts programs in a source language and converts them into programs in another language which is most often the assembly code of a real (or virtual) microprocessor. Designing real compilers is a complex undertaking and requires formal training in Computer Science and Mathematics.
The compilation process can be divided into a number of subtasks called phases. The different phases involved are
Symbol tables and error handlers are involved in all the above phases. Let us look at these stages.
The lexical analyzer reads the source program and emits tokens. Tokens are atomic units, which represent a sequence of characters. They can be treated as single logical entitities. Identifiers, keywords, constants, operators, punctuation symbols etc. are examples of tokens. Consider the C statement,
return 5;
It has 3 tokens in it. The keyword 'return', the constant 5 and the punctuation semicolon.
Tokens from the lexical analyzer are the input to this phase. A set of rules (also known as productions) will be provided for any programming language. This defines the grammar of the language. The syntax analyzer checks whether the given input is a valid one. ie. Whether it is permitted by the given grammar. We can write a small set of productions.
Sentence -> Noun Verb Noun -> boy | girl | bird Verb -> eats | runs | flies
This is a grammar of no practical importance (and also no meaning). But it gives a rough idea about a grammar.
A parser for a grammar takes a string as input. It can produce a parse tree as output. Two types of parsing are possible. Top down parsing and bottom up parsing. The meaning is clear from the names. A bottom up parser starts from the leaves and traverse to the root of the tree.
In the above grammar if we are given "bird flies" as input, it is possible to trace back to the root for a valid 'Sentence'. One type of bottom-up parsing is a "shift-reduce" parsing. The general method used here is to take the input symbols and push it in a stack until the right side of a production appears on top of the stack. In that case it is reduced with the left side. Thus, it consists of shifting the input and reducing it when possible. An LR (left to right) bottom up parser can be used.
Once the syntactic constructs are determined, the compiler can generate object code for each construct. But the compiler creates an intermediate form. It helps in code optimization and also to make a clear-cut separation between machine independent phases (lexical, syntax) and machine dependent phases (optimization, code generation).
One form of intermediate code is a parse tree. A parse tree may contain variables as the terminal nodes. A binary operator will be having a left and right branch for operand1 and operand2.
Another form of intermediate code is a three-address code. It has got a general structure of A = B op C, where A, B and C can be names, constants, temporary names etc. op can be any operator. Postfix notation is yet another form of intermediate code.
Optimization involves the technique of improving the object code created from the source program. A large number of object codes can be produced from the source program. Some of the object codes may be comparatively better. Optimization is a search for a better one (may not be the best, but better).
A number of techniques are used for the optimization. Arithmetic simplification, Constant folding are a few among them. Loops are a major target of this phase. It is mainly because of the large amount of time spent by the program in inner loops. Invariants are removed from the loop.
The code generation phase converts the intermediate code generated into a sequence of machine instructions. If we are using simple routines for code generation, it may lead to a number of redundant loads and stores. Such inefficient resource utilization should be avoided. A good code generator uses its registers efficiently.
A large number of names (such as variable names) will be appearing in the source program. A compiler needs to collect information about these names and use them properly. A data structure used for this purpose is known as a symbol table. All the phases of the compiler use the symbol table in one way or other.
Symbol table can be implemented in many ways. It ranges from the simple arrays to the complex hashing methods. We have to insert new names and information into the symbol table and also recover them as and when required.
A good compiler should be capable of detecting and reporting errors to the user in a most efficient manner. The error messages should be highly understandable and flexible. Errors can be caused because of a number of reasons ranging from simple typing mistakes to complex errors included by the compiler (which should be avoided at any cost).