Building a programming language on GraalVM (Part 1)
If you’ve followed recent developments in the Java ecosystem, you must know GraalVM. Many blog posts cover how you can build native executables from Java source code. This is indeed exciting: your program starts faster, and consumes less memory. Other popular topics include building Java applications that run code in other languages. You can run JavaScript, Python, Ruby or R code in the JVM. But there’s more: in fact you can run any language inside the JVM! Are you curious how? Continue reading…
Introduction
Out of the box, GraalVM ships with the capability to run JavaScript, Ruby, Python, R, and even C/C++. Many of them (excluding C/C++) run faster inside GraalVM than their native implementations. See Java Developer’s Introduction to GraalVM by Oleg Šelajev at JavaZone 2018 for some numbers.
These implementations build upon a new language implementation framework called Truffle. Truffle lets you build language interpreters that are both simple and highly performant. It does so by leveraging partial evaluation. Regular interpreted languages only produce optimised machine code after your code has ran a few times. By then, the runtime knows which paths will be taken through the code. Partial evaluation, instead, lets the interpreter make assumptions about types and data structures in your program. It will then produce optimised machine code for that. If those assumptions fail, it will go back to interpreted mode, and update its assumption based on what happens.
Do-It-Yourself
GraalVM already supports quite some languages, but what if you want to run some language that is not included? You can implement that yourself!
Let’s take a very small language as an example: Brainfuck. Adding the numbers 5 and 2, and printing the result requires the following code:
++>+++++[<+>-]++++++++[<++++++>-]<.
If you wonder how this works, check out the annotated version.
This language is hard to interpret or write, but easy to implement. After all, it has only 8 commands. Also its machine model is simple: it has a 30,000 byte array, a movable data pointer and one input and one output stream. This makes it an ideal candidate to learn how to implement other programming languages on GraalVM!
Step 1: Lexing
The first step in transforming source code to machine executable code is lexing.
Here, we read the supplied source code and convert its content into a sequence of tokens.
Think of tokens here as characters with an identified meaning.
For example, the three letters for have no meaning in themselves, but a lexer might recognise this as the for
keyword.
In the same sense do we need to recognise the eight characters that have a meaning in Brainfuck: >
, <
, +
, -
, .
, ,
, [
and ]
.
All other characters in a Brainfuck program should be ignored, so that makes our job easy.
First, we define an enumeration of all tokens that the language consists of:
public enum TokenType {
DECREMENT_BYTE('-'),
DECREMENT_DATA_POINTER('<'),
INCREMENT_BYTE('+'),
INCREMENT_DATA_POINTER('>'),
INPUT_BYTE(','),
JUMP_BACKWARD(']'),
JUMP_FORWARD('['),
OUTPUT_BYTE('.');
private char command;
TokenType(final char command) {
this.command = command;
}
public static Optional<TokenType> findType(final char command) {
return Arrays.stream(TokenType.values())
.filter(token -> token.command == command)
.findAny();
}
}
Next, we iterate through the source code.
We can discard all characters that yield an empty Optional
.
The result is a sequence of tokens that we know are valid Brainfuck commands:
final String source = "++>+++++[<+>-]++++++++[<++++++>-]<.";
final Stream<BrainfuckToken> tokens = IntStream.range(0, source.length())
.mapToObj(index -> TokenType.findType(source.charAt(index)).map(BrainfuckToken::new))
.filter(Optional::isPresent)
.map(Optional::get);
Now the bytes in our input file became meaningful: they form a computer program.
Step 2: Parsing
Now that we know the statements that our program is going to run, it’s time to move to the next phase: parsing. In this phase, we take the sequence of tokens and produce a data structure. The data structure that we need to build is an abstract syntax tree (AST).
This is a bit trickier than our first step, for two reasons:
- we need to perform syntax analysis.
- we need a design for our AST.
1. Syntax analysis
Let’s take a look at the [
and ]
commands in Brainfuck.
If the program reaches [
, it might need to jump to the first command after the corresponding ]
.
Likewise, if the program reaches ]
, it might need to jump back to the first command after the corresponding [
.
But what if there is no corresponding square bracket?
Our program would be invalid and it would be impossible to ever execute it.
One of the responsibilities of a parser is that it must check if all rules of the grammar are obeyed. Here, grammar is the same as in natural languages: the rules that describe how a language can be used. In case of a Brainfuck parser, we can implement it as follows:
// We keep a stack of jumps that our program performs.
final Deque<BFJumpNode> jumps = new ArrayDeque<>();
for (final BrainfuckToken token : tokens) {
if (token.token == JUMP_BACKWARD) {
if (jumps.isEmpty()) {
throw parseError("Found ] without matching [");
} else {
jumps.pop();
continue;
}
}
// Parsing continues
final BFCommandNode command = ...
if (command instanceof BFJumpNode) {
jumps.push((BFJumpNode) command);
}
}
if (!jumps.isEmpty()) {
throw parseError("Found [ without matching ]");
}
This code makes sure that you can only have a ]
if there was at least one [
preceding it.
Also, at the end, it checks whether there have been an exact equal amount of [
and ]
.
If this doesn’t yield any errors, we are sure that the input can be executed.
Note that this doesn’t mean the program is correct: there can still be bugs in the code.
Also, it doesn’t mean that the program will terminate: there might be an endless loop in the code.
2. Designing an AST
As you might’ve seen, the lexical tokens distinguish between jumping back and forth, but our AST nodes do not. This brings us to the second challenge: thinking of the right structure for our AST. In my experience, having a bad design will lead to very strange bugs when running your program.
For example, in earlier versions of my Brainfuck project there were different nodes for jumping back and forth. Those nodes would point to each other. Depending on the jump condition, they would invoke each other:
BFNode correspondingNode = ...
if (/* jump condition */) {
correspondingNode.execute();
} else {
for (BFNode child : children) {
child.execute();
}
}
For simple programs this seemed to work, but more complex programs failed miserably.
In this archive you’ll find all kinds of programs.
They calculate Pi, print the 99 Bottles of Beer song or draw Sierpinski triangles.
Such programs quickly led to an StackOverflowError
, which typically means that your program recurses too deeply.
In hindsight, this was no surprose, since every time the loop would re-execute, it would put a new frame on the stack.
I ended up with a BFJumpNode
class.
This class models the pair of conditional jumps including all statements between them.
So for the simple program ++[-+-]+.
, the AST would look like this.
To my knowledge, there is no well-defined process for this step. I think its a matter of trial and error to get the design right.
Integrating with Truffle
With GraalVM comes Truffle, a library for building programming language implementations. Now the next thing we need to do is make this all work with Truffle. That is enough work for another blog post, so stay tuned for more!
Some practical notes
GraalVM comes in two flavours: a Community Edition (CE) and an Enterprise Edition (EE). The EE claims to have even better performance, but should be functionally equal to the CE. As of now, GraalVM CE is only available for Linux and macOS. It extends the Java 8 platform, although work is underway to support Java 11 as well. Also, it only runs in 64-bit environments. GraalVM on Windows is under development, and you can find a preview at Oracle Labs.