How to Clang Your Dragon 🐉
Building a Compiler with LLVM - by Harlan Haskins
Building a compiler
Alright. Hi, everyone, I'm Harlan Haskins and I am incredibly excited to introduce all of you to the wonderful world of compilers.
What are we going to be doing here today? Well, we're going to be building a small compiler for a programming language called Kaleidoscope. Kaleidoscope is a pretty simple language, it's just got numbers, variables, binary operators, and function calls and if statements. With all of that you can build what is an actually pretty powerful little language.
Overview of talk:
We're going to start by going over the basic structure of a compiler. Then we're going to build a lexer and a parser for Kaleidoscope. Then we're going to take that parse data and we're going to compile it to LLVM Intermediate Representation.
What does a compiler do?
At its core, a compiler is just a translator. First, you take the input, the whole program as input from the user and feed it into the lexer.
The lexer's job is then to read the characters in the input as a token stream which it then passes to the parser. Tokens are the smallest little bits of syntax in the language like func or if or brace. Then, once we pass it to the parser, the parser recognizes and pulls out specific patterns of tokens. It'll look for things like the “def” keyword, a name, and then a paren, and keep going to parse a function. It'll create what's called an Abstract Syntax Tree which is a structured representation of the whole program.
Now, LLVM itself is a set of libraries for compiler developers and it's centered around its own programming language called; LLVM IR. It's got tons of optimizations and code generation for different target architectures, which is a real problem when you're dealing with compilers.
I know that if I were to write a compiler I wouldn't want to have to dig through the ISA for the, ARM CPU Architecture and the x86 CPU architecture and so on. LLVM really handles all of that so you don't have to think about CPU specific details.
The first step for the Kaleidoscope compiler is going to be the lexer. What does the lexer do? The lexer's job is to read the raw input to the programmer wrote and find specific patterns in that string. Those patterns are called tokens. The lexer will just eat token characters until we hit the end of the input. Sort of looks something like this (**refers to animation in slide**). You can see here that keywords get their own token like def, if, else, and there are cases there for numbers, identifiers and operators.
Compilers usually take this opportunity to record the original location in the source file. That's for things like diagnostics so you can print out the pretty error messages with highlights.
How do we represent these tokens? Well, there's a fixed set of tokens the compiler is going to recognize and they're all different. That's a pretty good case for an enum. In fact, enums are really useful for lexers and parsers because they capture a close set of possibilities.
We’ll need cases for all the little bits of punctuation and all the keywords.
We'll want identifiers, we'll want numbers and that's going to hold a Double because everything in Kaleidoscope is a double.
We're going to have binary operators, Kaleidoscope has six operators and I'll leave the rest of mathematics as an exercise for you all after the fact.
Let's get to the lexer itself.
We’re going to make a class called Lexer, and it'll take the input program as a string and keep track of an index as we're going along.
Then we'll have an initializer that takes and saves the input and grabs a startIndex of the string.
We'll have a couple of helper functions. Since we're going to be eating characters one by one, we want a variable that will store the current character or be nil if we've eaten past the end of the input. We also want a function that will just advance the index. I'm sure you're all keenly aware of the indexing API's after Soroush's talk.
Now, we can get to the actual meat of the lexer. This is going to be one function that will just, given where we are, it will consume all the characters up until the end of the next token in the input and return that. Once we've eaten all the characters in the input then we just return the result.
First, we'll skip all the space characters and if after we've done that we're at the end of the input, then great- we can bail out. For simplicity's sake, I'm just going to make a dictionary that holds all of the tokens that are representable with a single character. This is just probably the easiest way to match all of these things. If the current character that we're on is in this dictionary, then we could just return it and skip the character.
Otherwise, if we're dealing with an alphanumeric character then we'll read the next set of alphanumeric characters, into a string. Try to convert it to a Double using Swift's built-in double parser, and if it works then cool we have a number token.
If it doesn't work, then we'll switch over that identifier to see if we can find any of those keywords that we defined earlier. If all else fails it's just a regular identifier like a variable or a function reference. If we didn't recognize anything at all then just bail out.
Now, to lex a whole program, we eat all the tokens until we hit the end of the input. That's it for the Lexer. We can now take a user's input and break it into tokens.
So next we build the Parser. I know I'm moving a little fast but I want to get to the more important part, which is code generation. Now, that we have those tokens we can parse them into that structured representation of our program, the Abstract Syntax Tree. In this representation there are three things that can be declared in a file.
You get extern definitions, which declare functions that will be linked into the object later. These are things like sign or absolute value in the C Standard Library. This behaves like extern in C.
We also have function definitions, which define a function that takes a set of arguments and a single expression that it'll resolve.
Then expressions. When an expression appears at the top level of a file, we just evaluate it and then print its result.
To represent external definitions, we'll use a struct that we're going to call Prototype. A prototype is comprised of a name, which is a string and a list of arguments, which are themselves strings.
Function definitions is another struct called Definition,
Functions have a prototype, so we're going to reuse the Prototype. They also have an expression that references those arguments by name. For example, this function takes a parameter called n and then returns n + 1.
Then we'll represent expressions with another enum. There are five different kinds of expressions and they're mutually recursive, which means that some expressions have other expressions inside, so we declare it as indirect. There are numbers, variables, there's binary operators that have a left hand side expression, an operator and a right hand side expression. Function calls that have a name and then an array of arguments. And if-else expressions that behave like the turnery operator in Swift.
The parser is going to receive a list of tokens from the lexer, it's going to be what's called a “Recursive Descent” parser, which means that parsing functions will call into each other and small parsers will build up larger parsers. For example, the parser for Definitions is gonna need to use the parser for Prototype and the parser for Expressions because they're sub-objects.
Your end goal is to create the Abstract Syntax Tree until we've eaten all the tokens.
We'll also want a structure that will hold all the things that are going to be in a File.
We’ll have an array for the extern definitions, we'll have an array for the functions, and an array for the expressions.
We're also going to maintain a lookup table of function declarations that we can look up by name. We'll see a reason for this when we get to code generation.
We'll make some methods for adding externs and functions and then registering them with that lookup table.
And then a function for adding an expression.
The way that we'll use this, is we'll create an instance of the file and call addExtern, addDefinition, addExpression, as we see them in the input.
Building the parser
Now, we can start building our parser.
First, because parsing can fail we'll make a small enum that's got some error cases. The parser will be structured very similarly to the lexer. It keeps an array of tokens and then eats them one by one. We'll have a var for the current token, and a function for consuming the current token.
Then next, I'll make a quick method that consumes a specific token, you pass it a token that you expect like if or def, and if the current token is different it will throw an error.
Now, we're going to build a little parse function for each bit of syntax in the language. Let's start with function Prototype’s. I'll be using a couple of helper functions that should be self-explanatory, so prototypes are just an identifier and then a comma separated list of identifiers in parens. For example, you have def foo and then a parenthesis, and then n.
In Extern, we can now use that Prototype parser because we need to eat the extern token, grab the prototype and then eat a semi-colon. This is sort of the structure that the rest of the parsers are going to work with, we'll consume explicitly a token, keyword or punctuation, and then we'll call other parse functions that will parse specific patterns we already know how to parse.
Now, we're going to do the most complex parsing, that's Expressions.
Well, first, let's make sure we are not at the end of the token stream, and once we are passed the end and we know that we have a token we can switch over it. This is where we’re going to start actually matching patterns in the input. First, if we see a number token, then eat it and return a number expression.
If we see a left paren then we're parsing an expression wrapped in parens. We eat that left paren token, parse the inner expression, and then we eat the right paren token. Next, if we see an identifier there are two possibilities, either we're dealing with a function call or it's just a regular variable.
The way we know is by looking ahead at the next token, if the next token is a leftParen, then we know we're dealing with a function call. Then, we can eat that function call argument separated by commas. Otherwise, just return a variable expression.
Then finally, if we see if, it's an if-else expression. In order, we'll parse the condition part of it, we'll parse the then token, we'll parse the true expression, we'll parse the else token and then we'll parse the false expression.
Now, if we see any other token we'll throw an error. We've got the basic expressions down but there's one thing leftover; binary operators.
We’ve got this Expression that we just parsed and so if we see after that expression an operator, then we grab that operator token, parse the right hand side. Otherwise, we just return the Expression that we parsed.
Now, we can parse Prototypes and Expressions.
Well, remember function Definition is just a prototype and an expression, so we can write this too, just parse a prototype and parse an expression and then eat a semi-colon.
Then to bring it all together, then while there are still tokens in the input, switch on the current token. If you see the extern keyword then parse an extern and add it to the file. If you see def, parse a definition and add it to the file, and then otherwise you try to parse it as an expression. If that works, then you try to grab a semi-colon at the end and add the expression to the file.
All right, that's our parser.
Code generation to LLVM IR
You can breathe for a bit. I know that we moved really fast through that, but I want to talk now about the real meat of this compiler; Code generation to LLVM IR. Well, what is LLVM IR really?
Well, LLVM IR is a high level assembly language, that means it has the semantics of an assembly.
There's no looping constructs like for-loops or while-loops. There's just goto’s and there's a lot of explicit types everywhere. You can see double is just all over, repeated everywhere through that. The reason that happens is because most of the time you don't need to think about raw bits and bytes with LLVM. They've abstracted that away, you can talk in these primitive types.
It's structured in what's called a Static Single Assignment form.
In Static Single Assignment or SSA form, each operation’s value is stored in one of an infinite number of registers, unlike a CPU where there's a finite number of registers and you have the possibility to mutate them and clobber existing values and all sorts of garbage that you don't have to think about with LLVM.
These registers are referenced with that percent sign notation. In this example, we're going to perform the fadd or floating point addition instruction with two arguments. The value in the register n and the constant 1, and we're going to store that in the numbered register 0.
Now, we'll take the result from the numbered register 0, perform an fmul or floating point multiplication by the constant 2, and stored in a new numbered register, 1. Same thing, we take the value from 1 perform a floating point division by the constant pi and store it in a register 2. Then we can return the value in register 2.
Notice that we are not allowed to mutate these. LLVM is much better able to analyze the dependencies between values in certain registers. That helps it significantly when it's optimizing for CPUs that for some reason have a finite number of registers.
How do we use LLVM from Swift?
Now, we need to know; how do we use LLVM from Swift? Right, because LLVM is a C++ library and C++ and Swift don't talk much. Well, fortunately the LLVM maintainers foresaw this and they wrote a C API, Swift likes C, we all like C, I think. As we know C APIs are the slightest bit unsightly from Swift.
Fortunately, I along with my good friend, Robert Widmann, who you might know on Twitter as @CodaFi have been working on an open-source library that makes LLVM significantly easier to generate from Swift. In fact, it's so easy to generate from Swift that we're going to make a class that generates LLVM IR right now. LLVM uses what are called modules to represent the functions and global variables that can exist inside an object file.
In essence, this module will hold the LLVM version of whatever you declared in Kaleidoscope. You use what's called an IRBuilder to build instructions and functions into that module. It's responsible for actually writing out the LLVM code itself, and this generator is going to generate all the declarations that exist in one of those files that we parsed.
We're also going to have a dictionary where we can look up function parameters by name. We're not going to use this at top level, we're going to use this while we're generating expressions to know if a variable has a definition. This lookup table will let us get the IRValue of a named variable, and IRValue is anything in LLVM that can be used in an instruction.
Functions can be IRValues. Registers can, global variables, other instructions can be IRValues. We'll create an initializer that creates a LLVM module named “kaleidoscope” that's going to parallel our program and we're going to build an IRBuilder into that module. We'll use this builder to build functions and instructions.
Now, we can begin emitting code, by emitting code by the way, I mean translating whatever declaration from Kaleidoscope to its LLVM equivalent.
First, let's talk about emitting function prototypes. LLVM supports declaring functions that will be linked in later from another object. This is the same thing as extern in C.
If we emit the prototype for a function without giving it a body then it just assumes, "Okay, that's going to exist later and we'll let the linker take care of it." That's how we're going to declare C library functions like sign and abs. First, we're going to check to see if we've emitted this prototype before. This will let us cache and avoid duplicate definitions. If we have, then we know that LLVM has a definition for the function and we can just return it.
Otherwise, we'll add a function of the module that matches that prototype. First, we have to declare the types of the arguments. In Kaleidoscope luckily all values are doubles, so we can just create an array of double types that matches how many arguments there are. Now, we'll create the type of the new function. Function types in LLVM consists of a set of argument types and a return type.
The return type of all functions in Kaleidoscope is you guessed it, double. Now, we'll ask the builder to add the function to the module with the prototypes name. In other words we ask the builder to write the LLVM code for this function into the module.
Just to make the emitted code a little bit more readable, we'll assign the names that the user typed in, to the LLVM representation of those parameters and then we return the function. It's going to look at the end a little bit like this:
Now that we figured out how to declare functions, let's work on filling those functions in.
First, we'll emit the prototype of the definition and get back a function object that we can add a body to. Next, we'll go through all the parameters of the function object and put them in our lookup table that we mentioned earlier.
This will allow us to get the value associated with a parameter name that we can use in the function body.
Let's talk about the contents of a function. LLVM's functions are comprised of one or more basic blocks, a basic block is a sequence of instructions that begins with a label.
In this example there are two basic blocks, one is called entry at the beginning of the function and one is called next. Unlike regular assembly LLVM doesn't fall through basic blocks when it's reached the end of one, basic blocks are required to have a terminator instruction, either a return or a branch.
The entry block has a non-conditional branch to the next block, which itself then returns from the function. In this case, we're going to make an entry block and then position the builder to start inserting instructions at the end of this entry block.
Now, we're going to emit the code for the expression in the functions body.
We call emitExpr, which we will define later. Then to finish off the function we build a return instruction passing that expression that we just emitted. Now, that we've emitted the body our function is complete, we can now reset that lookup table because we don't need to reference the parameters to that function.
Now, emitting expressions. Remember this is the most complicated part. This is where the fun really comes in. We'll start with numbers. Numbers are easy. Everything in Kaleidoscope is a double, so if we see a number we ask the double type to give us a constant with that value.
So- variables, this is where that lookup table comes in. We'll use it for the first bit of error checking, if a variable referenced in the expression is not in the lookup table, we have to throw an error because that variable has no definition. Otherwise, if it does exist then we can just return the value of that variable.
Next, function calls. First we ensure that a function exists in the file that matches the prototype that uses the lookup table in the file that we made in our parser, and then we ensure that the number of arguments is correct. These are the kind of correctness things that a compiler has to deal with.
Swift has significantly more of these checks, so now we'll emit the code for each of the arguments and collect their values in an array.
Swift gets a win here, by the way, because map is fantastic.
Now, that we've ensured the function exists, ensured that the expression is valid and emitted the code for the argument values, we just need to build a call instruction to that function with those argument values.
Alright, so binary operators. Well, remember they've got a left hand side, an operator and a right hand side.
We'll emit the code for the left and right hand side, so we have the data to pass into the combining instruction, and then we'll just switch over the operator. If we see the plus operator then we'll build an add instruction. If we see minus we'll build a sub instruction, divide a div and so on. You get it.
The last one is a little bit more tricky. Equals, equality. We're going to build a floating point comparison using the orderedEqual predicate, which is basically just; are these things equal? This is going to return a value that is a one bit integer. True or false, one bit.
It will be 0 if they're not equal and it will be 1 if they are equal. But because expressions need to be floating point, we're going to convert that one bit integer to a floating point number using the IntToFP instruction. That's it for binary operators.
The last piece of the puzzle is those ifelse expressions. These are special because one of the cases will possibly not be executed. The way that we handle making different control flow is by adding more basic blocks.
So like in C the only value in Kaleidoscope that is false is 0. With that in mind let's consider this simple Kaleidoscope code. If n is not 0 then return n+1, otherwise return 0.
This will result in the following simple LLVM code:
We perform a not equal comparison with 0. If that's true we'll branch to a block called then. Otherwise, we branch to a block called else. In the then block, we’ll perform an add instruction with n and 1 and keep the result around in a register called 1.
In the else block, we don't need to do anything special because all it is, is declaring the constant 0. LLVM will propagate the constant down to where it's actually used.
Both blocks will end in a branch to this final merge block. Once we get to the merge block, we'll construct a phi node that will contain the result of the if statement.
A phi node is an instruction whose value depends on where we just came from. It's kind of complicated. If we look at this phi node its got two arguments. Each of them looks like an array with two objects. This is just saying that if you came from the then block you'll have the value inside the register 1. If you came from the else block you'll have the value 0.
Alright, so let's write the Swift that will create those if statements.
The first thing we'll do is emit code for the condition expression of the if statement. That means we'll emit the code for the expression, then compare it with a constant 0 with an FCmp instruction, using the NotEqual predicate to determine if the expression is actually true as in non-zero. Then, we'll add those basic blocks up front; then, else and merge. We'll build a conditional branch, if the check condition is true then branch to the then block, otherwise branch to the else block.
We'll position the builder so it's at the end of the then block and then emit whatever was inside the then case of the if statement. Then we'll branch to the merge block and we'll do the same thing for the else block. This is just creating exactly that structure that we saw before.
Then we'll move now to the bottom of the merge block, make that phi node and give it an incoming pair with the then value coming from the then block and the else value coming from the else block. Then we return the phi node and we are done with expressions.
The last bit of special behavior we'll need is a way to handle expressions that are at the file level. In Kaleidoscope all expressions at file level are executed one by one and then their values are printed. To achieve that, we will need a main function that will just call Printf with each expression’s value.
This parallel is the Main function that you write in languages like C. We'll declare main as a function with no arguments that returns void and add it to the module. Then we'll ask the builder to make us a global string. That's a format string that you would normally pass into Printf. We'll also emit a declaration for Printf that we can call.
Then we just go through the expressions one by one, emit the code for each expression, build a call to Printf with the format string and the expression’s value as the two arguments, then return void and we're done with main.
Now, the last bit, the absolute last bit is putting all of this together.
So we'll emit all the externs in the file, we'll emit all the functions in the file, and then we'll emit that main.
We have just built a full compiler in 26 minutes.
Now, I'd love to give you a demo. Alright, so I have a pre-baked Kaleidoscope compiler in the oven, and we are going to execute it. First, we'll look at the Kaleidoscope code that we're writing.
This is, we have a function definition for factorial.
It's a standard recursive factorial. If the n you pass in is 0- then return 1, otherwise return n times the factorial of n-1. Then, we're just going to print the factorials from 1 to 10. All right, so we're going to run ksc and pass in fact.ks.
This is going to emit both the LLVM IR that we're supposed to print and it's going to just go ahead and emit an object file that we can link later.
Let me look at this fact.ll file.
Well, this is the LLVM IR that we just ... it's a very similar structure to what we just talked about. You have the then block, the else block, and the merge block. If we look at this next to the original kaleidoscope code it has a very similar structure.
Really, you can just look at this LLVM IR, and if you get good enough you can figure out what it's doing without having to look back at the original code.
Now, we have this fact.o object file. We're going to use clang to link that object file with the standard library, and then we'll have a binary called fact. If we execute this binary, which is what our brand new compiler has just created for us, we get all the factorial numbers from 1 to 10.
If you'd like to, you can find the full project we built today on the trill-lang GitHub page, Trill is a compiler that I'm working on and we also have LLVMSwift, the LLVM module and some modules for working with Clang up there. That's it. Thank you very much.
If you enjoyed this talk, you can find more info: