Behavior

A lexer’s behavior is controlled by its mode. In that sense, it is similar to a human’s mood [1], it only differs in that a lexer’s mode is distinct for a given point in time. A mode determines the interpreted language in terms of a mappings from patterns to actions. Changes in interpreting language are reflected by changes of modes. For example, in the TeX typesetting system the math mode is entered and exited from text mode by $ operators. In a fragment like:

Einstein stated that $E = m c^2$ describes the mass energy equivalence.

the floating text is interrupted by $ delimiters. A $ triggers the math mode instruction parsing until the closing $ appears. The parser, then, returns into text parsing mode.

In a Quex source file, the behavior of a lexer is defined by means of the sections mode and define. The initial mode is specified by the assignment

start = NAME;

where start is a top-level keyword and NAME is the name of the intended initial mode with which the interpretation shall start. This statement is only required, if there is more than one mode.

Mode

The top-level keyword mode signalizes the beginning of a mode definition. The following scheme pinpoints a basic mode definition.

mode NAME {
     PATTERN_1    ACTION_1
     PATTERN_2    ACTION_2
     ...
     on_EVENT_1   EVENT_HANDLER_1
     on_EVENT_2   EVENT_HANDLER_2
}

The identifier NAME specifies the mode’s name. The content of a mode definition consists of a list of pairs of patterns and their related actions to be performed upon match. Also events such as ‘end-of-file’ may be associated with event handlers. The behavior of a mode, however, can be customized with some more detail, if the mode’s NAME is followed by a colon, as shown below.

mode NAME : BASE_A, BASE_B, BASE_C
            <tag-1> <tag-2> {
      ...
}

The names following the : express inheritance relationships by means of the mode’s base modes. Mode inheritance provides a means of behavioral composition. For example, a mode COMMON may be defined to handle the events of ‘failure’ and ‘end-of-stream’. For all other modes to behave uniformly with this respect, is is sufficient to mention COMMON as base mode.

Tags are bracketed in < and > brackets. Tags are used to control mode transitions and line and column numbering behavior among other things. The code fragment below provides an impression of a full-fledged lexer mode named STRING.

mode STRING : END_OF_FILE <entry: PROGRAM> <exit: PROGRAM>
{
    on_entry   { string_clear(); }
    on_exit    { self.send_string(QUEX_TKN_STRING, string_get()); }

    "\\\\""    { string_add(Lexeme); }
    "\\\\"     { string_add(Lexeme); }
    "\""       { self.enter_mode(&self, PROGRAM); RETURN; }

    on_failure {
        string_add(Lexeme);
        self.error_code_clear_this(&self, E_Error_OnFailure);
    }
}

Mode STRING is derived from mode END_OF_FILE which, supposedly, has the task to handle the end-of-file event. STRING shall ‘eat’ the content of sections in double quotes. The tags entry and exit restrict the transitions from and to this mode. Mode STRING, as defined here, is only accessible from mode PROGRAM. Supposedly, STRING is entered when PROGRAM detects an opening quote and it is left, when it detects a closing double quote. The remaining pattern-action pairs and event handlers implement a string accumulation and deal with escaped double quotes and backslashes.

The definition of large patterns can add a significant amount of visual noise. A mode proliferated with visual noise can hardly make its behavior transparent. Consider, for example, the following definition for Latin, Thai, Arabic, and Devanagari numbers.

([0-9]+".")?[0-9]+|([:intersection(\P{Block=Arabic},[X660-X6D0]):]+".")?[:intersection(\P{Block=Arabic},[X660-X6D0]):]+|([:intersection(\P{Script=Thai},\G{Nd}):]+".")?[:intersection(\P{Script=Thai},\G{Nd}):]+|([:intersection(\P{Script=Devanagari},\G{Nd}):]+".")?[:intersection(\P{Script=Devanagari},\G{Nd}):]+ => QUEX_TKN_NUMBER(Lexeme);

It is clearly impossible for an unaltered human being to comprise the meaning of this expression [4]. Moreover, errors can hardly be detected. The next section explains how to build those expressions hierarchically and finally come up with a shorthand NUMBER, so that the above rule can be expressed as

{NUMBER} => QUEX_TKN_NUMBER(Lexeme);

A detailed discussion of modes and their syntax is presented in chapter sec:modes.

Definitions

define

Section to define shorthands, macros, and trigger plots and lexeme set displays for regular expressions.

define {
    ...
}

The define keyword is followed by a section in curly brackets. The purpose of this section is to define handy shorthands for patterns to be used in mode definitions. To achieve this, the instructions below are provided.

  • NAME: shorthand definition.

  • \macro: macro definition.

  • \run: running sample data on a DFA.

  • \plot: plotting a DFA.

  • \lexemes: determine lexemes matched by a DFA.

The following paragraphs discuss their use. The most essential part of a define section are shorthand definitions. Those are the output to be used later. An NAME can be anything starting with letters and continuing with letters or numbers plus or the underscore. Formally, a shorthand is defined by:

NAME    RE

where RE is a regular expression. Whenever the NAME appears in curly brackets, it is replaced by the regular expression RE. This can be used to define hierarchies of shorthands, for example:

define {
   DIGIT  [0-9]
   FLOAT  {DIGIT}+(.{DIGIT}*)
}

defines first the shorthand DIGIT and based on that the shorthand for a floating point number FLOAT. It consists of some digits possibly followed by a dot and more digits. A macro is similar to a shorthand, only that it receives arguments. It can be defined by the \macro instruction:

\macro NAME(TYPE0 ARG0[, TYPEn ARGn]+): RE

where NAME specifies the macro’s name. A comma-separated list of arguments is passed in brackets [3]. Each argument is specified with its TYPE and its name. Possible types are

  • dfa for a deterministic finite automaton, i.e. a state machine, or a regular expression, This may also be used to pass a string.

  • set for a character set, and

  • integer for an integer [2].

The RE is the regular expression in which the arguments may be expanded. Arguments act like a shorthand inside the macro’s RE. For example, the macro:

\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)?

expands what is passed as first argument named DIGIT inside the regular expression {DIGIT}+("."{DIGIT}*). Macros are expanded the like shorthands, namely by surrounding curly brackets. When arguments are passed, they are separated by whitespace, i.e. formally:

{NAME ARG0 [ARGn]+}

A macros call implements construction rule for regular expression very concisely. Instead of defining a float number regular expression explicitly for different scripts, the aforementioned macro FLOAT may be used. For convinience, let the SEPERATOR argument customize the separator between the whole number part and the fractional part. The FLOAT macro can then be used conveniently to define English, German, Arabic and Thai floating point numbers, as shown below.

define {
   \macro FLOAT(set DIGIT, dfa SEPERATOR): {DIGIT}+({SEPERATOR}{DIGIT}*)?

   //                   DIGIT:          SEPARATOR:
   ENGLISH_FLOAT {FLOAT [0-9]           "."  /* Decimal dot   */}
   GERMAN_FLOAT  {FLOAT [0-9]           ","  /* German comma  */}
   ARABIC_FLOAT  {FLOAT [۰-۹]           "،"  /* Arabic comma  */}
   THAI_FLOAT    {FLOAT [\X0E50-\X0E59] "."  /* Decimal dot   */}
}

Note

Shorthands and macros do not enter the name space in generated code. They are only available inside the top-level sections mode and define.

Using macros and shorthands in mode definitions makes pattern-action pairs more transparent. Moreover, the hierarchical construction of patterns based on understandable sub-patterns is essential for a robust design–following the paradigm of ‘divide and conquer’ [5]! The remaining commands \plot and \run support the development and verification of regular expressions.

The command \plot may be used to visualize results of regular expressions. It comes in two flavors:

\plot ['(' [e] [FLAGS]* ')'] SHORTHAND

When no flag or the e flag is specified, the \plot command expands the SHORTHAND and writes the according state machine graph into a file named SHORTHAND.dot.

\plot '(' f [FLAGS]* ')' FILE-NAME RE

If the f flag is specified, the \plot command writes a state machine graph of the regular expression RE to a file FILE-NAME in Graphviz format [].

This allows to see quickly the state machine graph of a regular expression’s DFA. For example the definition of the pattern FRUIT can be plotted as shown below.

define {
   FRUIT  apple|pineapple|guapple|pitaya
   \plot  FRUIT
}

plots the state machine implemting the match of abc into the file MINE.dot. If the f flag is given, then the \plot command expects a file name followed by an expression to plot. For example,

define {
   FRUIT    apple|pineapple|guapple|pitaya
   \plot(f) FRUIT.dot {FRUIT}
}

does the exact same thing as the aforementioned example. The resulting graph can be produces with the dot tool of the Graphviz package. That is, given the content above in a file test.qx, the following commands produce what is shown in figure Fig. 9

> quex -i test.qx
> dot FRUIT.dot -Tsvg
../_images/section-define-FRUIT.svg

Fig. 9 Example plot of a regular expression’s DFA.

The flags of the \plot given in table Flags passed to \plot..

Table 3 Flags passed to \plot.

Flag

Meaning

h (default)

annotate edges with hexadecimal numbers.

d

annotate edges with decimal numbers.

c

annotate edges with unicode characters.

t

plot graph to terminal

s (default)

plot shorthand.

f

plot to file.

q

suppress print of location (file, line number)

For investigation of a patterns matching behavior, the \run command may be used.

\run RE TEST-RE

This commands considers all lexemes that match the regular expression TEST-RE as input sequences to a DFA derived from the regular expression RE. For example,:

\run [a-z]+   h[ae]llo

The expression h[ae]llo matches hello and hallo. So the \run command feeds the string "hello" in one run to the DFA derived from [a-z]+. For the second run it feeds the string "hallo". The output of the run command is printed to the standard output. Below, is the example to check the parsing of float numbers as defined above.

define {
    // NOTE: wrong definition where fractional part is not optional
   \macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)

   NUMBER {FLOAT [0-9]}

   \run {NUMBER} "123 "|"123.456;"
}

The \run command takes an expansion of NAMESPACE as the the DFA on which the test sequences will be applied. The test sequences are 123 `` and ``123.456;. Storing the above fragment in a file test.qx and applying:

> quex -i test.qx

the result of the three test sequences is printed to the standard output as shown below.:

test.qx:7: run on '{NUMBER}'
test.qx:7:
test.qx:7:   |1|2|3| |
test.qx:7:           ^
test.qx:7:   |1|2|3|.|4|5|6|;|
test.qx:7:                 A ^
test.qx:7:
test.qx:7: run <terminated>

An A represents the position where acceptance occurred. The ^ denotes the furthest position that the analyzer went. The second sequences 123.456 accepts the floating point number. After the lexer has found the ; it goes back to the last digit and notifies a match. However, the first sequences 123 `` is not accepted, even though it is a valid number. The reason is that the definition of ``FLOAT is wrong. There, the fractional part was not made optional, i.e. the ? was missing. With the correct definition:

\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)?

the output becomes:

test.qx:7: run on '{NUMBER}'
test.qx:7:
test.qx:7:   |1|2|3| |
test.qx:7:         A ^
test.qx:7:   |1|2|3|.|4|5|6|;|
test.qx:7:                 A ^
test.qx:7:
test.qx:7: run <terminated>

which meets the expectations. As shown in this example, the \run command is a convinient tool to find errors in regular expression definitions quickly.

\lexemes '(' [FLAGS]* ')' RE

The \lexemes command determines the set of lexemes possibly matched by a DFA. This is another way to see whether a regular expression does what it should. For example.:

define {
   \lexemes for|forest
}

delivers:

test.qx:2: lexemes for 'for|forest' (max loops=0, sets=False)
test.qx:2:
test.qx:2: f, o, r
test.qx:2: f, o, r, e, s, t
test.qx:2:
test.qx:2: lexemes <terminated>

when a regular expression contains repetition, it makes sense to restrict the maximum number of loop iterations. This is done by a decimal number flag between 1 and 5. For example:

define {
   IDENTIFIER [a-z][0-9a-z]*
   \lexemes (3) {IDENTIFIER}
}

iterates at maximum three times through a loop and delivers:

test.qx:3: lexemes for '{IDENTIFIER}' (max loops=2, sets=False)
test.qx:3:
test.qx:3: a
...
test.qx:3: z
test.qx:3: a, 0
...
test.qx:3: a, 9
...
test.qx:3: z, z, z
test.qx:3:
test.qx:3: lexemes <terminated>

which is, indeed, a very lengthy expression. Instead of printing each possible number, number sets can be printed using the s flag. This shortens the list of printed lines and the result of \lexemes (3s) {IDENTIFIER} becomes:

test.qx:3: lexemes for '{IDENTIFIER}' (max loops=2, sets=True)
test.qx:3:
test.qx:3: {[a, z]}
test.qx:3: {[a, z]}, {[0, 9] [a, z]}
test.qx:3: {[a, z]}, {[0, 9] [a, z]}, {[0, 9] [a, z]}
test.qx:3:
test.qx:3: lexemes <terminated>

The complete list of flags to be passed to the \lexemes command is shown in Table 4.

Table 4 Flags passed to \plot.

Flag

Meaning

1 - 5

maximum number of loop iterations.

h

display hexadecimal numbers.

d

display decimal numbers.

c

display unicode characters.

s

display sets instead of single lexatoms.

q

suppress print of location (file, line number)

Footnotes