By the end of this guide we'll have a minimal, working implementation
of a small part of Lua from scratch. It will be able to run the
following program (among others):
function fib(n)
if n < 2 then
return n;
end
local n1 = fib(n-1);
local n2 = fib(n-2);
return n1 + n2;
end
print(fib(30));
This is my second project in Rust and only the third time I've
invented an instruction set so don't take my style as gospel. However,
I have found some Rust parsing tutorials overly complex so I'm hoping
you'll find this one simpler.
All source code is available on Github.
Entrypoint
Running cargo init will give the boilerplate necessary. In
src/main.rs we'll accept a file name from the command line, perform
lexical analysis to retrieve all tokens from the file, perform grammar
analysis on the tokens to retrieve a tree structure, compile the tree
to a linear set of virtual machine instructions, and finally interpret
the virtual machine instructions.
mod eval;
mod lex;
mod parse;
use std::env;
use std::fs;
fn main() {
let args: Vec<String> = env::args().collect();
let contents = fs::read_to_string(&args[1]).expect("Could not read file");
let raw: Vec<char> = contents.chars().collect();
let tokens = match lex::lex(&raw) {
Ok(tokens) => tokens,
Err(msg) => panic!("{}", msg),
};
let ast = match parse::parse(&raw, tokens) {
Ok(ast) => ast,
Err(msg) => panic!("{}", msg),
};
let pgrm = eval::compile(&raw, ast);
eval::eval(pgrm);
}
Easy peasy. Now let's implement lex.
Lexical analysis
Lexical analysis drops whitespace (Lua is not whitespace
sensitive) and chunks all source code characters into their
smallest possible meaningful pieces like commas, numbers, identifiers,
keywords, etc.
In order to have useful error messages, we'll keep track of state in
the file with a Location struct that implements increment and
debug.
This goes in src/lex.rs.
#[derive(Copy, Clone, Debug)]
pub struct Location {
col: i32,
line: i32,
index: usize,
}
The increment function will update line and column numbers as well
as the current index in the file.
impl Location {
fn increment(&self, newline: bool) -> Location {
if newline {
Location {
index: self.index + 1,
col: 0,
line: self.line + 1,
}
} else {
Location {
index: self.index + 1,
col: self.col + 1,
line: self.line,
}
}
}
And the debug function will dump the current line with a pointer in
text to the current column along with a message.
pub fn debug<S: Into<String>>(&self, raw: &[char], msg: S) -> String {
let mut line = 0;
let mut line_str = String::new();
// Find the whole line of original source
for c in raw {
if *c == '\n' {
line += 1;
// Done discovering line in question
if !line_str.is_empty() {
break;
}
continue;
}
if self.line == line {
line_str.push_str(&c.to_string());
}
}
let space = " ".repeat(self.col as usize);
format!("{}\n\n{}\n{}^ Near here", msg.into(), line_str, space)
}
}
The smallest individual unit after lexical analysis is a token which
is either a keyword, number, identifier, operator, or syntax. (This
implementation is clearly skipping lots of real Lua syntax like
strings.)
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
Identifier,
Syntax,
Keyword,
Number,
Operator,
}
#[derive(Debug, Clone)]
pub struct Token {
pub value: String,
pub kind: TokenKind,
pub loc: Location,
}
The top-level lex function will iterate over the file and call a lex
helper for each kind of token, returning an array of all tokens on
success. In between lexing it will "eat whitespace".
pub fn lex(s: &[char]) -> Result<Vec<Token>, String> {
let mut loc = Location {
col: 0,
index: 0,
line: 0,
};
let size = s.len();
let mut tokens: Vec<Token> = vec![];
let lexers = [
lex_keyword,
lex_identifier,
lex_number,
lex_syntax,
lex_operator,
];
'outer: while loc.index < size {
loc = eat_whitespace(s, loc);
if loc.index == size {
break;
}
for lexer in lexers {
let res = lexer(s, loc);
if let Some((t, next_loc)) = res {
loc = next_loc;
tokens.push(t);
continue 'outer;
}
}
return Err(loc.debug(s, "Unrecognized character while lexing:"));
}
Ok(tokens)
}
Whitespace
Eating whitespace is just incrementing the location while we see a
space, tab, newline, etc.
fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
let mut c = raw[initial_loc.index];
let mut next_loc = initial_loc;
while [' ', '\n', '\r', '\t'].contains(&c) {
next_loc = next_loc.increment(c == '\n');
if next_loc.index == raw.len() {
break;
}
c = raw[next_loc.index];
}
next_loc
}
Numbers
Lexing numbers iterates through the source starting at a position
until it stops seeing decimal digits (this implementation only
supports integers).
fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_digit(10) {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
If there are no digits in the string then this is not a number.
if !ident.is_empty() {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Number,
},
next_loc,
))
} else {
None
}
}
Identifiers
Identifiers are any collection of alphabet characters, numbers, and
underscores.
fn lex_identifier(raw: &Vec<char>, initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_alphanumeric() || c == '_' {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
But they cannot start with a number.
// First character must not be a digit
if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Identifier,
},
next_loc,
))
} else {
None
}
}
Keywords
Keywords are alphabetical like identifiers are but they cannot be
reused as variables by the user.
fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = ["function", "end", "if", "then", "local", "return"];
let mut next_loc = initial_loc;
let mut value = String::new();
'outer: for possible_syntax in syntax {
let mut c = raw[initial_loc.index];
next_loc = initial_loc;
while c.is_alphanumeric() || c == '_' {
value.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
let n = next_loc.index - initial_loc.index;
if value != possible_syntax[..n] {
value = String::new();
continue 'outer;
}
}
// Not a complete match
if value.len() < possible_syntax.len() {
value = String::new();
continue;
}
// If it got to this point it found a match, so exit early.
// We don't need a longest match.
break;
}
if value.is_empty() {
return None;
}
Aside from matching a list of strings we have to make sure
there is a complete match. For example function1 is not a keyword,
it's a valid identifier. Whereas function 1 is a valid set of tokens
(the keyword function and the number 1), even if it's not a valid
Lua grammar.
// If the next character would be part of a valid identifier, then
// this is not a keyword.
if next_loc.index < raw.len() - 1 {
let next_c = raw[next_loc.index];
if next_c.is_alphanumeric() || next_c == '_' {
return None;
}
}
Some((
Token {
value: value,
loc: initial_loc,
kind: TokenKind::Keyword,
},
next_loc,
))
}
Syntax
Syntax (in this context) is just language junk that isn't
operators. Things like commas, parenthesis, etc.
fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = [";", "=", "(", ")", ","];
for possible_syntax in syntax {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character syntax bits like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Syntax,
},
next_loc,
));
}
}
None
}
Operators
Operators are things like plus, minus, and less than
symbols. Operators are syntax but it helps us later on to break these
out into a seperate type of token.
fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let operators = ["+", "-", "<"];
for possible_syntax in operators {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character operators like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Operator,
},
next_loc,
));
}
}
None
}
And now we're all done lexing!
Grammar analysis
Parsing finds grammatical (tree) patterns in a flat list of
tokens. This is called a syntax tree or abstract syntax tree (AST).
The boring part is defining the tree. Generally speaking (and
specifically for this project), the syntax tree is a list of
statements. Statements can be function definitions or expression
statements or if statements or return statements or local
declarations.
This goes in src/parse.rs.
#[derive(Debug)]
pub enum Statement {
Expression(Expression),
If(If),
FunctionDeclaration(FunctionDeclaration),
Return(Return),
Local(Local),
}
pub type Ast = Vec<Statement>;
There's almost nothing special at all about the rest of the tree
definitions.
#[derive(Debug)]
pub enum Literal {
Identifier(Token),
Number(Token),
}
#[derive(Debug)]
pub struct FunctionCall {
pub name: Token,
pub arguments: Vec<Expression>,
}
#[derive(Debug)]
pub struct BinaryOperation {
pub operator: Token,
pub left: Box<Expression>,
pub right: Box<Expression>,
}
#[derive(Debug)]
pub enum Expression {
FunctionCall(FunctionCall),
BinaryOperation(BinaryOperation),
Literal(Literal),
}
#[derive(Debug)]
pub struct FunctionDeclaration {
pub name: Token,
pub parameters: Vec<Token>,
pub body: Vec<Statement>,
}
#[derive(Debug)]
pub struct If {
pub test: Expression,
pub body: Vec<Statement>,
}
#[derive(Debug)]
pub struct Local {
pub name: Token,
pub expression: Expression,
}
#[derive(Debug)]
pub struct Return {
pub expression: Expression,
}
And that's it for the AST!
Some helpers
Lastly before the fun part, we'll define a few helpers for validating
each kind of token.
fn expect_keyword(tokens: &[Token], index: usize, value: &str) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Keyword && t.value == value
}
fn expect_syntax(tokens: &[Token], index: usize, value: &str) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Syntax && t.value == value
}
fn expect_identifier(tokens: &[Token], index: usize) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Identifier
}
Now on to the fun part, actually detecting these trees!
Top-level parse
The top-level parse function and it's major helper,
parse_statement, dispatch very similarly to the top-level lex
function. For each statement in the file we look for function
declarations, if statements, return statements, local declarations,
and expression statements.
fn parse_statement(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
let parsers = [
parse_if,
parse_expression_statement,
parse_return,
parse_function,
parse_local,
];
for parser in parsers {
let res = parser(raw, tokens, index);
if res.is_some() {
return res;
}
}
None
}
pub fn parse(raw: &[char], tokens: Vec<Token>) -> Result<Ast, String> {
let mut ast = vec![];
let mut index = 0;
let ntokens = tokens.len();
while index < ntokens {
let res = parse_statement(raw, &tokens, index);
if let Some((stmt, next_index)) = res {
index = next_index;
ast.push(stmt);
continue;
}
return Err(tokens[index].loc.debug(raw, "Invalid token while parsing:"));
}
Ok(ast)
}
Expression statements
Expression statements are just a wrapper for the Rust type
system. They call parse_expression (which we'll define shortly),
expect a semicolon afterward, and wrap the expression in a statement.
fn parse_expression_statement(
raw: &[char],
tokens: &[Token],
index: usize,
) -> Option<(Statement, usize)> {
let mut next_index = index;
let res = parse_expression(raw, tokens, next_index)?;
let (expr, next_next_index) = res;
next_index = next_next_index;
if !expect_syntax(tokens, next_index, ";") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected semicolon after expression:")
);
return None;
}
next_index += 1; // Skip past semicolon
Some((Statement::Expression(expr), next_index))
}
Expressions
Expressions in this minimal Lua are only one of function calls,
literals (numbers, identifiers), or binary operations. To keep things
very simple, binary operations cannot be combined. So instead of 1 +
2 + 3 we'd need to do local tmp1 = 1 + 2; local tmp2 = tmp1 + 3;
and so on.
fn parse_expression(raw: &[char], tokens: &[Token], index: usize) -> Option<(Expression, usize)> {
if index >= tokens.len() {
return None;
}
let t = tokens[index].clone();
let left = match t.kind {
TokenKind::Number => Expression::Literal(Literal::Number(t)),
TokenKind::Identifier => Expression::Literal(Literal::Identifier(t)),
_ => {
return None;
}
};
If what follows the first literal is an open parenthesis then we try
to parse a function call.
let mut next_index = index + 1;
if expect_syntax(tokens, next_index, "(") {
next_index += 1; // Skip past open paren
// Function call
let mut arguments: Vec<Expression> = vec![];
We need to call parse_expression recursively for every possible
argument passed to the function.
while !expect_syntax(tokens, next_index, ")") {
if arguments.is_empty() {
if !expect_syntax(tokens, next_index, ",") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected comma between function call arguments:")
);
return None;
}
next_index += 1; // Skip past comma
}
let res = parse_expression(raw, tokens, next_index);
if let Some((arg, next_next_index)) = res {
next_index = next_next_index;
arguments.push(arg);
} else {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid expression in function call arguments:")
);
return None;
}
}
next_index += 1; // Skip past closing paren
return Some((
Expression::FunctionCall(FunctionCall {
name: tokens[index].clone(),
arguments,
}),
next_index,
));
}
Otherwise if there isn't an opening parenthesis then we could be
parsing either a literal expression or a binary operation. If the
token that follows is an operator token then we know it's a binary
operation.
// Might be a literal expression
if next_index >= tokens.len() || tokens[next_index].clone().kind != TokenKind::Operator {
return Some((left, next_index));
}
// Otherwise is a binary operation
let op = tokens[next_index].clone();
next_index += 1; // Skip past op
if next_index >= tokens.len() {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid right hand side binary operand:")
);
return None;
}
let rtoken = tokens[next_index].clone();
It is at this point that we could (but won't) call
parse_expression recursively. I don't want to deal with operator
precedence right now so we'll just require that the right hand side is
another literal.
let right = match rtoken.kind {
TokenKind::Number => Expression::Literal(Literal::Number(rtoken)),
TokenKind::Identifier => Expression::Literal(Literal::Identifier(rtoken)),
_ => {
println!(
"{}",
rtoken
.loc
.debug(raw, "Expected valid right hand side binary operand:")
);
return None;
}
};
next_index += 1; // Skip past right hand operand
Some((
Expression::BinaryOperation(BinaryOperation {
left: Box::new
←
1
...
56
57
58
59
60
61
62
63
64
65
...
76
→