Class Dhaka::Tokenizer
In: ../trunk/lib/dhaka/tokenizer/tokenizer.rb
Parent: Object

This abstract class contains a DSL for hand-coding tokenizers. Subclass it to implement tokenizers for specific grammars.

Tokenizers are state machines. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.

 class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer

   digits = ('0'..'9').to_a
   parenths = ['(', ')']
   operators = ['-', '+', '/', '*', '^']
   functions = ['h', 'l']
   arg_separator = [',']
   whitespace = [' ']

   all_characters = digits + parenths + operators + functions + arg_separator + whitespace

   for_state Dhaka::TOKENIZER_IDLE_STATE do
     for_characters(all_characters - (digits + whitespace)) do
       create_token(curr_char, nil)
     for_characters digits do
       create_token('n', '')
       switch_to :get_integer_literal
     for_character whitespace do

   for_state :get_integer_literal do
     for_characters all_characters - digits do
       switch_to Dhaka::TOKENIZER_IDLE_STATE
     for_characters digits do
       curr_token.value << curr_char


For languages where the lexical structure is very complicated, it may be too tedious to implement a Tokenizer by hand. In such cases, it‘s a lot easier to write a LexerSpecification using regular expressions and create a Lexer from that.



tokens  [R]  The tokens shifted so far.

Public Class methods

Define the action for the state named state_name.

Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.

Public Instance methods

Advance to the next character.

Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.

The character currently being processed.

The token currently on top of the stack.

Change the active state of the tokenizer to the state identified by the symbol state_name.