How to make a compiler for your own programming language – Luca Guagliardo

Introduction

Compilers are the magical back boxes that turn your code into run-able executables. But have you ever wondered what happens under the hood?

Motivation

When working on story focused games, like visual novels, the story can take many twists and turns. To assist in creating these kinds of games there are several tools available, such as Twine and Ink. They allow story writers to write stories that can flow in several directions while mixing in code to add complexity through variables that can change the path a player can take.

In my free time I like contributing to an open source project focused on creating a 2d game engine. One of it's major target groups are visual novel developers. Initially it was planned to have integration with Ink, but was decided against, leaving a gap for a narrative scripting language.

Creating a language

For the language I wanted to target people who did not have too much experience with programming, to keep it simple I made use of Ink's structure as a base for my current iteration. I created a few examples of grammar structures used in the language for special behavior and used them in a questionnaire that I handed out to visual novel writers.

For the scope of this research I focused on 3 parts that together form the fiber of my language: Choices, Jumps and Variables.

I also want to find out what tool(s) they currently use for writing their stories and how experienced they are with programming.

In the questionnaire I used a mixture of open and closed questions, In order to determine preferences for syntax styles I gave three examples where they can rank them to their preferences.

The questionnaire can be viewed at: https://forms.gle/o3KUM4KkuoErE8hP8

After handing out the questionnaire to my target group I found out that:

  • Twine is a tool that is used often in my target group.
  • The range of programming skills varies.
  • There was no clear favorite for choice syntax. When asked for why they chose their favorites it was clear however that being able to separate the choice from the response was more desirable due to being easier to read.
  • It should be possible to hide or show the choice the reader has to make in a story, However there was no common consensus if it should be included by default or not.
  • After the story jumps to a different part, the expected behavior is that it runs to completion.
  • Jumps should be able to be given arguments that can be used to make it more dynamic.
  • Ink style jumps (->target for making the jump, == target == for where the jump leads to) seem to be more favorable on average due to it's ease of understanding given as a common reason.
  • It is quite unanimous that type safety should be enforced.
  • Pascal style type declaration (variable : type) seems to be the favorite.

Brute forced prototype

For my first proof of concept I made use of "Let's Build a Compiler"(1), an old series of blogs showing the ropes of how to create a compiler step by step. I was able to create a simple compiler able to compile simple math equations into a set of instructions by reading characters sequentially and directly forming a syntax tree.

Restructuring

The old compiler had several flaws in its design, it was able to construct code for a simple grammar. But as the need for more rules grew, it quickly becomes unmanagable. Therefore I had to start from scratch, but this time with a better grasp of the steps in between.

The full compilation process can be divided into several steps, by separating them out it becomes easier to work on expanding the language in steps, as the coupling has become looser.

Lexing

By carefully crafting a lexer grammar you are able to cleanly weave contexts in a manner that allows the script to be read and structured in a simple and lean way.

The lexer looks through a stream of characters to structure them into tokens. These tokens can range from numbers to symbols of characters and the resulting token can even change based on the current reading context. These tokens are then passed to the parser to create a parse tree, a structure of tokens and grammars representing the script.

A simple example of reading a token can be a number. Whenever you encounter a number, you continue to read the characters until you hit a white space. You then check if all characters read form a valid number.

Parsing into a parse tree

Once the lexer has given tokens to the parser, it is able to construct a syntax tree based on grammars. These grammars consist of a sequence of expected tokens and even other grammers. An example would be the import grammar

It consists of several tokens to mark the statement, followed by a string literal, characters wrapped by quotation marks. By scanning the tokens it is able to determine if the script was using an import.

With the example line of:

@import "my_import.nvl"

The parser is able to create the following parse tree:

With a more expanded grammar I was able to transform the following script:

@import "external_script.nvl" // imports functions and variables

@characterHappiness : number // declares a variable

// create a code block to mutate variables
@code {
    characterHappiness = 6
}

// raw text will be displayed into the story when called for
Hello you, this is an example of embedding variables into text
@characterHappiness

// stories can have branching with choices
what do you want?
@choice {
    food:
        I want a hamburger.
    money:
        I want a small loan of a million dollars.
}

Into a large parse tree: (click to enlarge)

Although the tree might look imposing, it is important to know that by expanding on the grammar and adding content into the script will result in bigger trees.

Abstract syntax tree

Parse trees are complex structures and are not the easiest to transform freely. As an extra step compilers strip parse trees into a more digestible abstract syntax tree, a simplified representation consisting of statements and values.

For example the following set of statements:

x=1
y=2
3*(x+y)

Can be represented as the following parse tree:

(2)

This can then be expressed as a simplified abstract syntax tree:

(2)

Now that the tree has been simplified it becomes easier to apply transformations to it based on its leaf nodes.

With a simple tree it becomes easier to find places where you can apply several optimizations, suchs as constant folding of leaf nodes in an expression branch. This is done by looking at the leaf nodes to so see if both nodes contain a constant value, it can then replace the expression node with the result. This allows for a smaller code size and a quicker runtime, as there are less computations to execute.

An example of constant folding can be seen with the following syntax tree:

By analyzing the tree, it is possible to see that the 5 and 6 nodes are added together without any variables. By calculating the result and replacing the + node the tree will now look like:

Emitting instructions

In order to emit code, you take the abstract syntax tree and recursively traverse it from top to bottom. Every time you come across a node you generate a specific set of instructions. Depending on why you want to make the compiler, you can either emit direct machine code or instructions for a runtime to execute.

For example the following statement could be emitted into assembly:

function AddSix(x : int) : int 
return x + 6 

A possible representation in an assembly language could look like:

lea eax, [edx+6]
return

This would load the x argument and add 6 and store it into the eax register and returns to the calling location. In reality, these instructions would be serialized into the output file as codes rather than plain text. A runtime can then read these instructions and quickly know exactly know what to do next.

Runtime behavior

Using the following script as an example:

@import "external_script.nvl" // imports functions and variables

@characterHappiness : number // declares a variable

// create a code block to mutate variables
@code {
    characterHappiness = 6
}

// raw text will be displayed into the story when called for
Hello you, this is an example of embedding variables into text
@characterHappiness

// stories can have branching with choices
what do you want?
@choice {
    food:
        I want a hamburger.
    money:
        I want a small loan of a million dollars.
}

The flow of the story can be go as follows:

Conclusion

I was able to finish a first prototype of my compiler, but was unsatisfied with how rough it was to expand on the grammar. Rewriting it proved to be a bigger task than I first imagined, as the grammar grew more complex it became more difficult to parse. I was able to finish the lexing and the parsing steps of my rewritten compiler and do plan on continuing to finish the compiler to complete the whole process. I also want to do more research into ANTLR, a parser generator, to get a more robust and expandable grammar for my language.

References

  1. Crenshaw J. (1988). Let’s Build a Compiler. https://compilers.iecc.com/crenshaw/
  2. Singh, P. (2017, June 6). PowerShell: Tokenization and Abstract Syntax Tree. Geekeefy. https://geekeefy.wordpress.com/2017/06/07/powershell-tokenization-and-abstract-syntax-tree/
  3. Compiler Design - Syntax Analysis. (n.d.). https://www.tutorialspoint.com/compiler_design/compiler_design_syntax_analysis.htm
  4. Compiler Design - Phases of Compiler. (n.d.). https://www.tutorialspoint.com/compiler_design/compiler_design_phases_of_compiler.htm
  5. Compiler Design - Code Generation. (n.d.). https://www.tutorialspoint.com/compiler_design/compiler_design_code_generation.htm

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts