Notes on implementing a Lox tree walking interpreter in Rust

2021-04-07

Overview

I've programmed a lot in interpreted languages and I have been curious to learn more about how they worked. Sure I knew the basic of scanning, parsing, do what the AST says to do, but I didn't have a firm grasp of that last part. I came accross the Crafting Interpreters book and thought it would be a good oppurtunity to learn more. This document is about my thoughts from following along with the first half of the book where you implement a tree walking interpreter for the language Lox. I chose to implement my interpreter in Rust, but the book uses Java.

Things I learned

A lot of what I learned is about how to implement those internal workings of an interpreter. I won't really specify them here since the Crafting Interpreters book can explain them better than I can. I will describe some small issues I encountered and thought were notable that the book doesn't go into. Mostly details from how my implementation differs from the Java implementation in the book.

I was already familiar with scanning and parsing since I've written compiler front ends in the past, but I really enjoyed using Rust to do it. I chose not to use any external crates and just hand wrote a recursive descent parser following the book example. I enjoyed using algebraic data types in Rust to handle tokens and AST nodes. Since the book was implementing the interpreter in Java they did the whole visitor pattern to walk the AST. Originally I started doing the a similar thing by creating a Visitor trait, but I ended up ditching it and just using match. After this experience, I love algebraic data types and match expressions. It makes it so easy to encode your logic into the type system. Now whenever I use any other language I really miss those Rust features.

At other times I tried to encode business logic into the type system and failed. Consider the following pseudo-Lox code:

var x = 5;
{
	var x = 2;
	print x;
}

After parsing, but before interpreting we have a resolver phase. Its a semantic pass over the AST where resolve each use of a variable to which exact version it should resolve to, ex 2 should be printed not 5. I was familiar with a basic approach doing this where you walk up a linked list of scopes, but in this Lox language we also added closures and other trickier things so the resolver is a bit more complicated. So it was fun and interesting to learn about how to implement those features with a semantic pass.

After the resolver, the interpreter uses a map the resolver created when it comes accross a variable usage. Our resolver created a map of tokens to values. In Rust its a HashMap. There ended up being a lot of lines like map.get(...).unwrap() in when using this map. One thing I wanted to do, was use Rusts type system to better represent that the get call would always return a resolved value, but I didn't end up doing that. Maybe its not decidable or something like that. Since I know I wrote the resolver correctly, I know those get calls will return Some, not None, so ideally the rust type system would be able to enforce that.

Issues of ownership

In Rust memory is managed with ownership. Every variable in Rust has an owner.

Who owns the lexeme? input can come from a file or from standard in if the user is using the REPL mode. We need to save lexemes for example identifiers need the lexeme to figure out what they are referring to. Originally I just had tokens store the lexeme as reference to a slice of the input (file or standard in), however this became an issue with allowing the REPL to work.

Consider the case where a user types in one line of code into the REPL. The interpreter needs to keep track of what that line did because maybe it declared a variable and later lines of input will use that variable. however my implementation was roughly like:

let interpreter = Interpreter::new()
loop {
	let input = read_input()
	interpreter.interpret(&input)	
}

This mean that input was dropped after one read-evaluate-print cycle. So I ended up having the tokens store a full on String for the lexeme. Originally I was trying to keep it a reference because references are faster to copy around. But then again, this was just a toy tree walking interpreter. I was trying to learn how interpreters work not make a performant one. And if I was, I should have been following the second half of the book where you implement a byte code interpreter. And even though I wasn't, it was just guesswork on my part that keeping the String would be slower than &str. As they say: premature optimization without profiling is bad.

`clone()` everywhere?

The tree walking interpreter described in the first half of the Crafting Interpreters book is written in Java and the author uses the Java runtime in order for implementation objects to be garbage collected. This Rust implementation does a lot of copying and clone()-ing. I would need to do some rearchitecting in order to avoid these unnecessary copies, but I chose not to do that since this is a toy interpreter and I was following along with the book. The goal of this project was to learn more about interpeters, not to make a fast interpeter.

For example, functions ending up having unnecessary clone() everywhere with how I implemented them. Since this is a tree walking interpreter, when a function call takes place, the interpreter visits the AST that was inside the function declaration.

ex pseudo code

fn visit_function_call(f: LoxFunction) {
	visit(f.declaration_ast);
}

When a function declaration is visited it creates that LoxFunction and since it needs the AST of the declaration, I ended up cloning that portion of the AST and assigning it as member of the LoxFunction struct. This could involve a lot of copying if the function declared is very large. Since each AST nodes can reference a bunch of other AST nodes, you could end up copying a huge subtree of the AST. Since I was follwing along with a Java program this was a straightforward approch to take, but looking back at it, the Java implementation was just copying Object references. I should have made the LoxFunction just hold a reference to an AST node. Of course then I would have had to make sure all the lifetimes match up correctly etc, since LoxFunction itself is clone-able etc (since in Lox functions are first class values).

Another interesting case of translating from Java was how the interpreter dealt with certain values. For example, I knew that the super variable would always refer to a class value, but I had to do something like this:

let superclass = match super {
            Value::ClassValue(c) => c,
            _ => {
                // should never occur
                return Err(LoxError {})
            }
        };

I knew that super would always be Value::ClassValue and none of the other possibilities of Value, but I was just storing all variables as Values I couldn't encode that special logic about super. In the book, the Java implementation was something like

LoxClass klass = (LoxClass)v;

which would also be a runtime error if the cast ever failed.

Limitations

Instances are reference counted. So its easy for 2 instances to reference each other and form a cycle and they will not get cleaned up until the program finishes execution.

class X{}

var x1 = X();
var x2 = X();
x1.x = x2;
x2.x = x1;

// x2 and x1 will not be dropped until the program ends

Future work

In the second half of the book, the author implements a byte code interpreter written in C and I believe they create a garbage collector/runtime. The garbage collector would avoid the issue of instances being reference counted in this interpreter. If I ever read and follow along that part of the book, I will try to implement the byte code interpreter in Rust with a garbage collector and hopefully avoid the many copies that this interpreter makes. I'm curious to see if Rust makes writing a garbage collector hard or not. Can I do it in safe Rust? Can I do it with the built allocator that Rust uses, like how it allocates when I do Box::new or do I have to drop into something more low level? I'm curious to find out.