Class Dhaka::LexerSpecification
In: ../trunk/lib/dhaka/lexer/specification.rb
Parent: Object

Abstract base class for lexer specifications.

Use this to specify the transformations that will be performed when the lexer recognizes a given pattern. Actions are listed in descending order of priority. For example in the following lexer specification:

  class LexerSpec < Dhaka::LexerSpecification
    for_pattern 'zz' do
      "recognized two zs"

    for_pattern '\w(\w|\d)*' do
      "recognized word token #{current_lexeme.value}"

    for_pattern '(\d)+(\.\d+)?' do
      "recognized number #{current_lexeme.value}"

    for_pattern ' +' do
      #ignores whitespace

    for_pattern "\n+" do
      "recognized newline"

the pattern ‘zz’ takes precedence over the pattern immediately below it, so the lexer will announce that it has recognized two ‘z‘s instead of a word token.

The patterns are not Ruby regular expressions - a lot of operators featured in Ruby‘s regular expression engine are not yet supported. See for the current syntax. Patterns may be specified using Ruby regular expression literals as well as string literals.

There are a few things to keep in mind with regard to the regular expression implementation:

  • The greediest matching expression always wins. Precedences are only used when the same set of characters matches multiple expressions.
  • All quantifiers are greedy. There is as yet no support for non-greedy modifiers.
  • The lookahead operator "/" can behave in counter-intuitive ways in situations where the pre-lookahead-operator expression and the post-lookahead-operator expression have characters in common. For example the expression "(ab)+/abcd", when applied to the input "abababcd" will yield "ababab" as the match instead of "abab". A good thumb rule is that the pre-lookahead expression is greedy.
  • There is no support for characters beyond those specified in the grammar above. This means that there is no support for extended ASCII or unicode characters.


Public Class methods

Associates blk as the action to be performed when a lexer recognizes pattern. When Lexer#lex is invoked, it creates a LexerRun object that provides the context for blk to be evaluated in. Methods available in this block are LexerRun#current_lexeme and LexerRun#create_token.

Use this to automatically handle escaping for regular expression metacharacters. For example,

  for_symbol('+') { ... }

translates to:

  for_pattern('\+') { ... }