1. Numbers

Simplicity is a prerequisite of reliability.
— Edsger Wybe Dijkstra

This chapter builds up the foundation for our compiler. First, it revisits what compilers actually are and proceeds to give some preliminary background on both OCaml^{1 1 A functional programming language that isn’t daunting.} and Assembly^{2 2 This is - intuitively - what a processor would run, with some heavy handwaving. Assembly is just a textual representation, but processors read binary, Ones and Zeros. That is why you’d also need an Assembler, which turns the text into a representation a particular processor would be able to process.}. Second, the chapter demonstrates an implementation for a query-based compiler of an oversimplified programming lanugage consisting only of numbers. Third, it is shown how to implement a language server on top of the compiler to get basic tooling working.

Without further ado, let us start the journey.

Programming in Ocaml

This section will set up Ocaml, install relevant packages, and set up a bare-bones project.

Installing Ocaml

First and foremost, this book uses the Ocaml programming language. I more or less assume that you use some kind of Linux. More often than not, packages are quite outdated in the system-wide package manager, so it is best to install Ocaml using the opam package manager, which you can obtain from the package manager of your Linux distribution. For example, you could just run sudo apt install opam or sudo pacman -Syyu opam. After installing opam, install ocaml, some goodies, and the dune build system:

opam install ocaml ounit2 utop batteries dune

ocaml - the Ocaml compiler.
ounit2 - library for running tests.
utop - very nice REPL for Ocaml.
batteries - an add-on to the standard library.
dune - opinionated build system, so that you don’t have to write makefiles.

Setting up an Ocaml Project

Create a new project named toad via the command $ dune init proj toad. This creates a new folder titled toad with the following structure:

.
├── _build
│   └── log
├── bin
│   ├── dune
│   └── main.ml
├── dune-project
├── lib
│   └── dune
├── test
│   ├── dune
│   └── test_toad.ml
└── toad.opam

Snippet 1. The folder structure after project initialization.

Open the file dune-project and delete all but the first line, since the others won’t be needed for this project. The whole file toad.opam can be deleted, it is only relevant if you want to publish a package. The _build directory will contain the binary files after compilation with the Ocaml compiler. bin is the folder containing source files for the main executable. If you don’t like the name and would prefer something like src, you can just rename it without hassle. Likewise, lib contains source files for some library that the executable defined in bin will link to. The split between executable part and library part can make sense to enhance modularity: There can be several different executables that rely on the same core functionality provided by the library. This is useful when developing compilers to also add tools like formatters, language servers, linters, and so on, which all use the same core code. Lastly, test (will) contain tests.^{3 3 That is, if you add some…} Oh, and you may want to grab a proper .gitignore in case you use git for version control, which you should.^{4 4 If you dislike git as much as I do, be invited to use a different tool like pijul or jj.}

dune files instruct the build system what to do and apply relative to the directory they are positioned in. For example:

(executable
 (public_name toad)
 (name main)
 (libraries toad))

Snippet 2. Contents of toad/bin/dune.

This defines an executable at the level of the bin folder, which will pull all *.ml files within it for compilation into an executable. The name of the executable is main and after running dune build, you will find it as toad/_build/default/bin/main.exe. It links with the toad library defined in toad/lib/dune and the public_name is the name of the executable after installation. You can find that renamed version in toad/_build/install/bin/toad. Go ahead and run the binary:

λ ~/toad/ dune exec toad 
Hello, World!

If you forgot to dune build prior to dune exec, you should nevertheless see the same output, since dune will just detect that a (re-)build is necessary. For more details on the dune build system and the Ocaml build process, please refer to the dune documentation. A good start is here: dune.readthedocs.io/en/latest/explanation/ocaml-ecosystem.html.

Open toad/bin/main.ml to see the following content:

let () = print_endline "Hello, World!"

The code is almost self-explanatory. The one thing that may be entirely confusing is the let () = , as print_endline could very well be pseudo-code to print a line. Anyways, Ocaml does not have a main function as entry point, but instead just whatever is defined/used in the global context.^{5 5 Similar to Python.} In this case, there is an implicit pattern match on the return value of print_endline, which returns a value of type unit. You could also replace () in the code with either x or _, where _ is a placeholder for “whatever, I don’t care”, i.e., it ignores the value. If you use x instead, you will encounter an unused value declaration warning, which you can avoid by prefixing x with _. () is used instead, since it is the only available instance of type unit.

Exercise

Try to implement and ultimately run FizzBuzz on the number 1729 and print the result to the standard output.

Programming in Assembly

This section will look at RISC-V assembly and how to build your first RISC-V program.

What is Assembly?

In the real world, physics dictates how objects interact with each other. For a computer, it all comes down to tiny electrical connections, where a lower voltage level represents a Zero and a higher voltage level represents a One. This representation is purely arbitrary, human beings thought it may be useful to interpret it this way. By means of interaction between these little wires, complex physical systems can be built, i.e., computer processors. Computer processors are themselves composed of several other systems, such as an arithmetic logic unit (ALU). Without going too deep into this^{6 6 Go play Turing Complete or do nand2tetris.}, the key take-away to carry on here is that there is some way to tell a processor what to do. Telling someone what to do means to give them instructions and at this physical level, this means to have some sequence of Zeros and Ones. Depending on how the physical processor is set up, the encoding as Zeros and Ones may look different. For the sake of giving an example, here is a RISC-V instruction at the machine level, where a 1 represents high voltage and a 0 represents low voltage:

00000010101000110000001010010011

Needless to say, no one wants to write this for anything practical unless they’ve got a good scotch on their desk^{7 7
I do not endorse the consumption of alcohol.}. There was a time where people did program processors directly this way. But, out of desperation to understand the list of instructions more easily, the need for language was born, which could be easier to understand than literal encodings of electrical signals. This kind of language is referred to as Assembly and is architecture-dependent.^{8 8 It is the language of the processor, so it depends on how the processor is physically constructed.} For example, an Intel processor won’t understand instructions for an ARM processor, so the Assembly also looks different.

Sometimes you may see “instruction” being referred to as “mnemonic”.^{9 9 If you are into remembering decimal places of $\pi$ , you already know what a mnemonic is.
It is merely a more efficient encoding of the same information.
For example, when remembering digits of $\pi$ , you could think about a story and associate the digits with certain objects or rooms that you visit in your story.} A mnemonic represents the concrete sequence of bits that indicate what fundamental operation a processor should do, e.g., addition. Most RISC-V mnemonics have exacty three operands. Here, usually, the first operand describes the destination where the result should be stored, while the remaining arguments describe operands.

We will unlock more and more instructions as we move along our journey.

For the machine code example given earlier, the easier to read representation using the mnemonic is as follows:

addi x5, x6, 42

Translating these mnemonics by hand is error prone, tedious, boring.^{10 10 If you want to investigate the encodings, check out this website: https://luplab.gitlab.io/rvcodecjs} Luckily, it has been established in a preceding paragraph that processors are great at running a list of instructions! So, using an Assembler, the process of translating this list of instructions into sequences of bits can be automated. Of course, at some point in history, someone had to write an assembler by hand using machine code. We will rely on existing assembler implementations in this book, too.

Let us take a step back and look at the instruction addi x5, x6, 42 again.

        ┌─┌─┌─┌─┐┌─┌─┌─┌─┐┌─┌─┌─┌─┐
        │0│0│0│0││0│0│1│0││1│0│1│0│
        ├─└─└─└─┘└─└─└─└─┘└─└─└─└─┤
        └───────────┬─────────────┘
               arg: 42   dest: x5
                          ┌────┴────┐           
        ┌─┌─┌─┌─┌─┐┌─┌─┌─┐├─┌─┌─┌─┌─┤┌─┌─┌─┐┌─┌─┌─┌─┐
        │0│0│1│1│0││0│0│0││0│0│1│1│0││0│0│0││0│1│0│1│
        ├─└─└─└─└─┤├─└─└─┤└─└─└─└─└─┘├─└─└─┘└─└─└─└─┤
        └────┬────┘└──┬──┘           └───────┬──────┘
        arg: x6       └───────────┬──────────┘
                      mnemonic: "addi"

Its two arguments are x6 and 42. 42 is a so-called immediate.

So that is why it’s called addi, huh.

Exactly, though there is a limit to its size. You notice that the instruction has 32 bits and there are only 12 bits to use for the integer constant to be added. The range goes from $-2048$ to $2047$ , since it uses two’s complement.

Ah, what was that again?

The goal is to be able to represent both negative and nonnegative numbers^{11 11 Zero is neither negative nor positive, so we cannot say “negative and positive numbers”.} in binary with a limited amount of bits. For two’s complement, you take the left-most bit and use this as highest possible negative number, the remaining stuff is just the positive numbers added onto it. The negative number should be two to the power of (number of bits - 1). So, if you have 12 bits and your number is 1 0001 0000 000, then you have $-2^{11} + 2^{7}=-2048 + 128=-1920$ . Larger numbers are, for now, not necessary.

As for argument x6 (and destination x5), baselise RISC-V has 32^{12 12 Happens to be $2^5$ .} registers, labeled x0, x1, .., x31. A register is a special name for a kind of memory that is extremely small, but also extremely fast! It is very close (physically) to where actual computation happens, so data needs less time to travel from A to B through the wires. Some registers are special and have alternative names, which will be used later on as well, but can safely be ignored for now.

A RISC-V assembly program is mostly just a straightline list of instructions. However, there are some additions. For our purposes, the entry-point of a RISC-V assembly program uses a reserved name for a label: _start^{13 13 The acquainted reader may feel the urge to point out that you may define the entry-point yourself when building an ELF binary or that there are other entry-points depending on the operating system. Luckily, we agreed that we use something Linux-based.}. Labels don’t have any special effect, they are like tags^{14 14 Labels are very similar to git tags, if you are more familiar with those. They just mark a location with a name.}, giving a name to an address. The entry point needs to be visible for the operating system, i.e., it has to be “exposed” so that external code can refer to it. This can be done with the .global directive.

There are also sections. For now, our sole interest is with the .text section. Anything that is inside of .text is essentially a straightline list of instructions. The data inside this .text section, i.e., the list of instructions, is immutable.

With these bare fundamentals, let us now try to write our first RISC-V assembly program.

Setting up the Assembler

Suppose we want to write a RISC-V program that just exits with value 5. Since we are on a Linux-based system, this means that we want to perform the https://www.man7.org/linux/man-pages/man3/exit.3.html system call. A system call constitutes a change in control flow from our application to the operating system. In our case, we want to request the operating system to terminate our process.

Syscalls work by writing a number to a register that corresponds to the syscall. You need to find a table online that contains these numbers, they are different from one architecture to another. Here is one such table: https://gpages.juszkiewicz.com.pl/syscalls-table/syscalls.html From here, we can identify that the value for both riscv32 and riscv64 is 93. To perform the call, we merely have to write this value to the a7/x17 register and then use the ecall instruction.

Previously, we have seen the addi instruction, which can be used to add an immediate value onto the value of a register and store the result in some register^{15 15 Could be the same, could be another register.}. In order to overwrite the contents of a register with just an immediate value, we can simply use the li instruction.

So, to write 93 to a7, we can just li a7, 93. Where do we put the return value 5?

Looking back at the https://www.man7.org/linux/man-pages/man3/exit.3.html syscall documentation, we see that it has one integer argument. How arguments are passed is defined in the application binary interface (ABI). Now, it suffices to know that we can provide exit the argument by simply writing to the a0 register.

Let me try to write the program now. I think it should be:

.global _start
.section .text 
_start:
  li a0, 5 
  li a7, 93
  ecall

Now, how do I run this?

Looks good so far! In order to run this, there are several options, my preferred one would be to run it natively. Not everyone has the hardware, especially RISC-V hardware, so my second preferred option would be to cross-compile and run it inside an emulator that pretends that you have the hardware. I’ll explain how to set this up now, but if you don’t want to set this up on your personal machine, there are also other options. You could use a web-based emulator, of which I can recommend https://cpulator.01xz.net/?sys=rv32.

For setup on your personal device that does not natively support RISC-V, you need to install a bit of software.^{16 16 Again, for sake of my sanity, I assume you have a Linux-based system, but this should probably work with the Linux substystem on Windows as well..? (maybe someone can contact me and give a confirmation or something)} Install gcc-riscv64-unknown-elf^{17 17
clang is natively a cross compiler, but you need to explicitly point to locations of libraries, whereas with gcc, they are packaged with it.}, which should be available via apt or as riscv64-elf-gcc on distros using pacman. After installation, make a text file called main.as with the above assembly code as content. Now run $ riscv64-elf-as main.as -o main.o in order to assemble the assembly file into an object file. To further process this object file into an executable file, link it using $ riscv64-elf-ld main.o -o main.out. That’s it for the compilation process, main.out is an executable:

λ ~/ file main.out
main.out: ELF 64-bit LSB executable, UCB RISC-V, double-float ABI, version 1 (SYSV), statically linked, not stripped

Running this necessitates additional packages to install, assuming you are not on an ARM system. An easy way to run it is by using qemu, an emulator for a large variety of different systems. Install qemu, qemu-user, qemu-system-riscv. To run the program, simply prefix it with qemu-riscv64:

λ ~/ qemu-riscv64 ./main.out
λ ~/ echo $?
5

Looks like everything worked, nice!

Exercise

Adapt the assembly to force an exit with exit code 13.

Baby’s first Compiler

Finally! After these preliminaries, implementing your first compiler will feel easy, I hope. Sure, it is a little annoying if you don’t know much Ocaml (yet?), since you have to lookup how to read a line from a file or whatever. But, the logic of this compiler is so simple, it will almost feel trivial and you already know all the assembly necessary. But for what, actually? What will the compiler compile? Numbers. Or rather, a number. It simply reads a number from standard input, ensures it is within bounds, and generates the above assembly for a RISC-V program returning that number. Ocaml provides the read_line function in order to read a single line from STDIN. It can throw an End_of_file exception if it cannot read a line. Out of my personal opinion about exceptions and maybe out of practice with Ocaml, let us wrap this behavior into a type that represents a result. The result type has two values: Ok, which carries the actual result, and Error, which carries a string representing some information about what went wrong. With that, simply catch the exception and return Error instead:

(** If possible, reads a single line from standard input. *)
let maybe_read_line () =
  try Ok(read_line())
  with End_of_file -> Error "Empty file"

Snippet 3. Reading from standard input in OCaml.

The try..with.. block is basically try and catch from other languages, if you are more familiar with those. You may also wonder why there are no types in the code. That is because Ocaml makes use of type inference, a process we will implement in a later chapter ourselves. Essentially, it deduces the types from code. For example, () has type unit, no questions asked. Likewise, it knows that read_line has type unit -> string, a function taking a unit and returning a string. Combining the bodies of try and with, the final type of this function is (string, string) result, where the first component indicates the type for results and the second component indicates the types for errors. You can think of result as a “function” that, given two types, gives you another type.

The next step in the compilation pipeline after reading input is to make sure the input has the correct shape. Traditionally, this is done in two steps: Lexing (aka scanning aka tokenizing) and parsing. For this simple language, parsing is trivial in the sense that there is nothing to talk about, because it does nothing, the language is too simple. So, the only work necessary in this step is to lex the input string into a “token”^{18 18 Also known as terminal or lexeme}. Input validation will become more complex as this work progresses, but for now, let us both check and convert the input to an integer. Luckily, Ocaml provides a function named int_of_string_opt which attempts to convert a string into a whole number that fits into an int datatype. Needless to say, not all strings are convertible into an integer and the function will return None in those cases. In our setting, a number should neither be negative nor above 255, so there needs to be appropriate checks for these cases as well:

(** Given a string, make sure it is a valid nonnegative number not larger than 255. *)
let number_of_input (x : string) =
  match int_of_string_opt x with
  | Some x -> begin
    if x < 0 then 
      Error "Input is not nonnegative."
    else if x > 255 then
      Error "Input is too large."
    else 
      Ok x
  end
  | None -> Error (Printf.sprintf "Input \"%s\" is not an integer." x)

Snippet 4. Logic

Due to technical reasons, such as operator precedence in Ocaml, and I suppose philosophical reasons, the Ocaml programming language does not have string interpolation. That is, I cannot easily construct a string as done in Python and other languages where I just "I'm %d years old." years or something, so this is why the above function uses the Printf module. ^{19 19 For just a single format specifier, it is possible to define a custom operator: let (%) = Printf.sprintf in "I'm %d years old." % years.}

The Printf module contains a bunch of predefined functions for printing whose API mimicks that of the old C-style printf interface. In turn, it is effectively possible to get string interpolation that way, a pattern that pops up often.

With these two functions at hand, the only work left to do is to emit respective assembly code that, when executed, terminates with the given number as an exit code. Most the code can be hardcoded by our compiler, since the only thing that really changes is the number. The assembly should look familiar.^{20 20 {| and |} delimit a multi-line string in Ocaml.}

(* Main entry *)
let compile read =
  match process_input read with
  | Ok num -> Printf.sprintf {| 
.global _start 
.section .text
_start:
  li a0, %d 
  li a7, 93
  ecall|} num
  | Error msg -> raise (Invalid_argument (Printf.sprintf "Error: %s\n" msg))

Snippet 5. The actual compiler

We can test our compiler^{21 21 This is an interactive demo, unless you deactivated JavaScript. I recommend that you turn it on to play around. As a trust assurance and out of respect, I don’t use analytics. So, maybe just let me know if you read this.}:

Exercise

Compile, assemble, link, and run a program that returns the number 17. Test it and verify that the returned exit code is, in fact, 17.

Something more Practical

This section will extend the compiler to implement a first very basic language server.

…

1. Numbers

Programming in Ocaml

Installing Ocaml

Setting up an Ocaml Project

Exercise

Programming in Assembly

What is Assembly?

Setting up the Assembler

Exercise

Baby’s first Compiler

Exercise

Something more Practical

Footnotes