A compiler is a special program that processes statements written in a particular programming language and turns them into machine language or “code” that a computer’s processor uses. As a programmer you can write language statements in many languages, but in this case we will analyze what happens to a .c file written in C. The file that is created contains what are called the source statements. The programmer then runs the appropriate language compiler, specifying the name of the file that contains the source statements.
Say wha — ?
Let’s start at the beginning and go through a step by step description of the compilation process to make it friendlier and easier to understand by us (kind of what the compiler does, but inverted — first clue of what a compiler does- ha!).
As stated above, the compiler will process a file written in readable language and turn it into machine language. Us, as humans (assuming you, reader, are human) we write in the languages we can comprehend. Machines, on the other hand, are not able to understand, interpret and run our languages, but instead have one of their own. This language is called binary, and you’ve probably heard about it. Binary is made entirely of 0’s and 1’s and that is all the computer needs to understand what we expressed, in order to run it or process it.
So, how will a program turn my code, full of words, character, numbers, comments and spaces, into 0’s and 1's?
When executing (let’s start using the technical language — this one means running), the compiler first parses (or analyzes) all of the language statements (lines in your code) one after the other and then, in one or more successive stages or “passes”, builds the output code, making sure that statements that refer to other statements are referred to correctly in the final code.
The compiler will make this transformation by following four steps. Each step will simplify the code and finally get it to a format that the machine understands.
The steps the compiler goes through are: Preprocessor, Compiler, Assembler and Linker.
In this stage, lines starting with a # character are interpreted by the preprocessor as preprocessor commands. These commands form a simple language of their own with its own syntax and semantics. This language is used to reduce repetition in source code by providing functionality to inline files, define macros, and to conditionally omit code.
Before interpreting commands, the preprocessor does some initial processing. This includes joining continued lines (lines ending with a \) and stripping comments.
Therefore, the Preprocessor will:
- Remove comments
- Expand Macros
- Expand the included files.
- Do conditional compilation
The second stage of compilation is confusingly enough called compilation. In this stage, the preprocessed code is translated to assembly instructions specific to the target processor architecture. These form an intermediate human readable language — we will see examples below.
The existence of this step allows for C code to contain inline assembly instructions and for different assemblers to be used.
The output of this stage will be the Assembly Code, which will be parsed through the Assembler.
During this stage, an assembler is used to translate the assembly instructions to object code. Object code does not refer to object as in a “thing” like we use that word, but it refers to machine readable language. The output consists of actual instructions to be run by the target processor. This step will finish the process of turning the code into binary.
The object code generated in the assembly stage is composed of machine instructions that the processor understands but some pieces of the program are out of order or missing. To produce an executable program, the existing pieces must be rearranged and the missing ones filled in. This process is called linking.
The linker will arrange the pieces of object code so that functions in some pieces can successfully call functions in other ones. It will also add pieces containing the instructions for library functions used by the program.
Now, that we covered the theory, let’s get into the how-to.
The command gcc (which stands for GNU Compiler Collection) is the one we will use to compile a file. The file to run with the command will always be a “.c”.
When prompt by itself, “gcc main.c” will run the whole 4 step process to my “main.c” file and output a default file called “a.out.c”. When the output file is not specified, this will always be the compiled file, and if the gcc command is prompted again with another “.c” file, this output file will be overwritten.
To avoid this, we can specify the name of the output file by adding the option “-o” followed by my output file name.
gcc main.c -o main
If we see the contents of my original main.c file, we will see normal, commented code:
Now, let’s look at the contents of the output after running the whole gcc compiler:
Uhm…. Unreadable, right?
Yes! Because we, as humans, are not intended to comprehend this file, but the machine is.
The gcc command does not necessarily have to be run complete, but instead we can run it step by step and see the process of compilation after each step.
To only Preprocess and not Compile, Assemble or Link, we run: gcc -E main.c (attention! C is case sensitive, therefore -E is not the same as -e)
After running the command with the -E option, we can look inside the output file and see the result:
Our code has undoubtedly changed, but it is still sort of readable. We might not understand it as code, but we can still identify the contents, libraries, names and more.
Now to stop right after the Compiler step, we will run gcc with the option -S and look inside the output file:
Once again, the format has changed. This time, it is a bit less readable as code, but we recognize columns, and some names. Other parts of this output file are already unreadable — a combination of characters that does not make a logical sense for us. This is the Assembly code. When ran through the assembler, it will translate it to binary. Let’s do it!
To stop the compilation right before the Linker, we must run: gcc -c (in lowercase!) main.c, and look inside:
As we can see, the output is now almost entirely unreadable, but shorter than the first example of the full compilation. The machine understands this as binary, even though the text editor I’m opening it with does not output exclusively 0’s and 1’s.
And that is it!
To wrap up the whole compilation process and conclude this post, we now know, step by step what happens with our files once run through the compiler, and what happens on every step.
In conclusion, gcc performs the compilation step to build a program, and then it calls other programs to assemble the program and to link the program’s component parts into an executable program that you can run. If needed, this process can be stopped between any two steps.
And now, you know a bit more about C and the compilation process.