Everybody in the programming world has heard of compilers. But how many of the developers actually know how they work? Under the skin of the compilers is a complex parsing mechanism of dizzying complexity.
But let’s start with the basics first:
What is a compiler?
Compiler is a program that converts a high level code into lower level code. Usually the lower level code is assembly or bytecode depending on the compiler in question.
What is a transpiler?
The transpiler is different from compiler in that it converts one language to another but it is still on the same level (example Javascript -> Typescript) rather than machine code
What is a language?
Language is a set of characters that obey rules imposed on them. The rules in question are called grammar.
Parsing source code
The simple structure of the compiler is presented below. The lexical analysis and syntax analysis are called front-end and impose a structure upon the source code. The backend is getting the parsed language and generating the machine code depending on the instruction set and architecture.
Let’s expand a bit on the front end part of the compilator, which deals with parsing the source code into something meaningful.
The lexer is the first step that checks if the source code is syntactically correct. It checks if every character is supported (some older compilers did not support UTF8 and would reject weird symbols). It will try to convert the sequence of characters into a sequence of lexical tokens (strings with an assigned and thus identified meaning).
Below you can see the process of tokenization, where the parser tries to determine the meaning of each word(called lexemes). The lexer does not care if the language makes sense. Noam Chomsky has a famous saying “Colorless green ideas sleep furiously” which is grammatically correct but does not mean anything. The role of checking if the language is semantically correct is done by the parser. This step provides little to no error checking.
The parsed lexemes will be gathered in a symbol table which is basically something that holds the information regarding the different parts of the code.
The parser will take this information and try to see if the tokens fit correctly with one another. This step is the one that takes a long time due to the fact that the parser needs to recognize the meaning of the language and see if it is allowed.
The example below shows a simple assign expression which gives the sp element (it depends on the context what it is) a value of 100. The tokens are simple:
- sp – which as mentioned above is potentially a variable (context specific)
- = – the operator which signifies the assignment
- 100 – the literal value
- ; – usually the end of a statement
Parsing and the Abstract Syntax Tree (AST)
The parser takes this and creates something called AST (abstract syntax tree) which contains all the information needed for a code generation. We have a rule for assignment which fits the criteria that have been given and the parsing is successful. This step also does some error checking, for example we can check if the sp variable allows integer values and stuff like that. The tree structure is useful because the code itself is hierarchical in nature (class -> method -> statements -> lexemes) and there are several algorithms that can quickly traverse the generated tree and generate the machine code.
As we mentioned the language rules are called grammar (you can think of it as regular expressions on steroids) and are often expressed in something called Backus–Naur form.
While it seems scary at first, it is not that difficult to understand. We write the allowed structure and the contents of that structure. The example provides an if statement that has a condition and what happens if that condition is satisfied.
Of course, this is a simple example and when dealing with large programs (10k lines) things can sometimes get tricky. Parsing a non-trivial text/data is usually done by grammars as regular expressions are limited as they lack context.
The business cases for grammars are well known – computer languages but they are not only limited to that. Antlr (a language parser) is used by hibernate to generate its SQL queries. Static analysis done on code is powered by grammars. I personally deal with transpiling old COBOL code to Java and this could not happen without grammar. And of course, it is always useful to know what your language does under the hood, even just for curiosity’s sake.