
Creating a Small Interpreter
The goal is to sketch out a language that’s just barely powerful enough to write some classic programs, notably one that calculates the Fibonacci sequence recursively. You know, the one that you always screw up at job interviews. To this end we’re going to need the usual print and variable assignment statements, arithmetic and boolean expressions, loop and conditional constructs and, most importantly, functions. This language is actually part of Dhaka’s test suite – look at the tests prefixed with ‘chittagong’. Realize that this is not a tutorial on designing a good programming language – it’s just a demonstration of Dhaka’s features.
An Overview of the Grammar
The grammar is too long to describe in a detailed fashion (see test/chittagong/chittagong_grammar.rb), but this is
how the overall structure of a program is declared:
for_symbol(Dhaka::START_SYMBOL_NAME) do program %w| opt_terms main_body opt_terms | end for_symbol('main_body') do single_main_body_statement %w| main_body_statement | multiple_main_body_statements %w| main_body terms main_body_statement | end for_symbol('main_body_statement') do main_body_simple_statement %w| simple_statement | function_definition %w| def function_name ( arg_declarations ) terms function_body terms end | main_body_if_statement %w| if expression terms main_body terms end | main_body_if_else_statement %w| if expression terms main_body terms else terms main_body terms end | main_body_while_statement %w| while expression terms main_body terms end | end
A program is simply a series of main body statements, separated by terms, which are newline characters. Return statements aren’t allowed in the main body. Function body statements are very similar, except they don’t allow function definitions, but do allow return statements.
Further along the same file, we specify the basic expressions allowed in the language. The operators are a subset of the Ruby operators and work more or less like they do in Ruby.
for_symbol('expression') do negation %w| ! expression | equality_comparison %w| expression == expression | greater_than_comparison %w| expression > expression | less_than_comparison %w| expression < expression | addition %w| expression + expression | subtraction %w| expression - expression | multiplication %w| expression * expression | division %w| expression / expression | power %w| expression ^ expression | literal %w| numeric_literal | function_call_expression %w| function_name ( arg_list ) | variable_reference %w| var_name | parenthetized_expression %w| ( expression ) | negated_expression %w| - expression |, :prec => '*' end
When creating a parser for this grammar, the generator reports a whole bunch of shift-reduce conflicts (80 to be exact). This is due to the manner in which we specify the expansions for expression, which is very ambiguous, but has the benefit of being compact. The conflicts that
arise are all resolved by the precedences specified in the grammar.
Creating a Lexer Specification
The lexical structure of this language is complicated enough that writing a tokenizer by hand would be tedious. Instead, we will generate the lexer by specifying the regular expression patterns that correspond to the various tokens:
class ChittagongLexerSpecification < Dhaka::LexerSpecification KEYWORDS = %w| print if else end while def return | for_pattern('==') do create_token('==') end %w| = - ! > < , + * ( ) / ^ |.each do |char| for_symbol(char) do create_token(char) end end for_pattern("\n") do create_token('newline') end for_pattern(' ') do # ignore whitespace end for_pattern('\w+') do if KEYWORDS.include? current_lexeme.value create_token current_lexeme.value else create_token 'word_literal' end end for_pattern('\d*(\.\d+)?') do create_token('numeric_literal') end end
The interesting bit here is the part that handles keywords. When the lexer sees a word token, it transfers control to the block that we defined here:
for_pattern('\w+') do if KEYWORDS.include? current_lexeme.value create_token current_lexeme.value else create_token 'word_literal' end end
The block simply checks if the value of the currently recognized lexeme is a member of the KEYWORDS array – if so, then it creates a keyword token, otherwise a word_literal token with the value of the lexeme.
The Evaluator
To store variables we need a table of some sort, called a frame. Since we have functions in our language, each function scope must have its own frame. In other words, the runtime environment must have a stack of frames. This functionality can be achieved readily with an array of hashes. For storing function names and definitions, a global table suffices, since functions will not be subject to scoping, achieved, once again, with a hash. The last piece is an output stream that the print statement and exceptions can print to. These are either injected or created in the evaluator’s initialize method:
def initialize(stack, output_stream) @stack = stack @function_table = {} @output_stream = output_stream end
Probably the two most interesting evaluation rules are the ones for defining and evaluating functions. In the dice roller and bracket packing examples, the result of evaluating a node depended on the results of evaluating its child nodes. For example, when evaluating addition, we wanted to know the evaluation results of the subtrees of each operand. When processing a function definition, however, we cannot evaluate the body of the function. Instead we must store it, so that it can be evaluated later. This is how function definition statements are processed:
for_function_definition do function_name = evaluate(child_nodes[1]) arg_declarations = evaluate(child_nodes[3]) body = child_nodes[6] @function_table[function_name] = Function.new(arg_declarations, body) ChittagongSuccessResult.new(nil) end
The only trick here is in the third line of the block where we assign the child node directly to the body without wrapping it in a call to evaluate. We’re actually storing the parse tree of the function body in a hash keyed by the function name so we can evaluate it at a later time. Not much can go wrong when you’re processing a function definition – the grammar takes care of enforcing all the rules, so as long as the code parses, it’s okay. The logic for evaluating function calls, however, is more complex, since we need to check for exceptions:
def checked_function_call function_name = evaluate(child_nodes[0]) return ChittagongExceptionResult.new("Undefined function #{function_name}", child_nodes[0]) unless @function_table.has_key?(function_name) arg_values = evaluate(child_nodes[2]) return arg_values if arg_values.exception function = @function_table[function_name] return ChittagongExceptionResult.new( "Wrong number of arguments", child_nodes[1] ) unless function.args.size == arg_values.result.size new_frame = {} function.args.zip(arg_values.result).each do |arg_name, arg_value| new_frame[arg_name] = arg_value end @stack << new_frame result = evaluate(function.body) @stack.pop result end
First we check to see if the function is defined, and if not, raise an exception (note that an exception in our mini-language is not a Ruby exception, just an instance of ChittagongExceptionResult which is returned to the caller). Next, we evaluate the arguments, and if the result is an exception, pass it up to the caller. If not, we check to see if the number of arguments passed to the function call matches the declaration of the function and create a new exception if that is not the case. If everything is okay so far, we create a new stack frame and bind the argument values to the function’s argument declarations on this frame. We push the new frame onto the stack, evaluate the function body, store the result, pop the new frame off the stack and return the result to the caller.
Tying it together
The last component is something that wires the Lexer, Parser and Evaluator together. This component, called the Driver for lack of a better name, checks the result of each step to see if an error occurred, in which case it pretty-prints the exception and the source, indicating where the error was caused. The most important method is run:
def run(program) parse_result = ChittagongParser.parse(ChittagongLexer.lex(program)) case parse_result when Dhaka::TokenizerErrorResult return tokenize_error_message(parse_result.unexpected_char_index, program) when Dhaka::ParseErrorResult return parse_error_message(parse_result.unexpected_token, program) end evaluation_result = ChittagongEvaluator.new([{}], output_stream = []). evaluate(parse_result) if evaluation_result.exception return (output_stream << evaluation_error_message(evaluation_result, program)). join("\n") end return output_stream.join("\n") end
The result of each step, if it contains an error, also contains the information about the portion of the source that the error came from. In the event of an error while lexing the input, a TokenizerErrorResult can be queried for the location in the source of the unexpected character. Similarly, a ParseErrorResult can be queried for the unexpected token. The evaluator we defined returns the parse tree node that it was attempting to evaluate when it encounters an error.
Using this driver, a test for variable scoping looks like this:
def test_variable_scope program = " x = 1 def bar(n) print 999 end def foo(n) bar(x) end foo(2) " assert_equal( "Undefined variable x: x = 1 def bar(n) print 999 end def foo(n) bar(>>>x) end foo(2) ", @driver.run(program)) end
Finally, we can write our doubly-recursive Fibonacci function:
def test_recursive_fibonacci program = " def fib(n) if n == 0 return 1 end if n == 1 return 1 end return fib(n-1) + fib(n-2) end x = 0 while x < 9 print fib(x) x = x + 1 end " assert_equal(["1.0", "1.0", "2.0", "3.0", "5.0", "8.0", "13.0", "21.0", "34.0"].join("\n"), @driver.run(program)) end