Behavior¶
A lexer’s behavior is controlled by its mode. In that sense, it is similar
to a human’s mood [1], it only differs in that a lexer’s mode is distinct
for a given point in time. A mode determines the interpreted language in terms
of a mappings from patterns to actions. Changes in interpreting language are
reflected by changes of modes. For example, in the TeX typesetting system
the math mode is entered and exited from text mode by $
operators. In a
fragment like:
Einstein stated that $E = m c^2$ describes the mass energy equivalence.
the floating text is interrupted by $
delimiters. A $
triggers the
math mode instruction parsing until the closing $
appears. The parser,
then, returns into text parsing mode.
In a Quex source file, the behavior of a lexer is defined by means of the
sections mode
and define
. The initial mode is specified by the
assignment
start = NAME;
where start
is a top-level keyword and NAME
is the name of the intended
initial mode with which the interpretation shall start. This statement is only
required, if there is more than one mode.
Mode¶
The top-level keyword mode
signalizes the beginning of a mode definition.
The following scheme pinpoints a basic mode definition.
mode NAME {
PATTERN_1 ACTION_1
PATTERN_2 ACTION_2
...
on_EVENT_1 EVENT_HANDLER_1
on_EVENT_2 EVENT_HANDLER_2
}
The identifier NAME
specifies the mode’s name. The content
of a mode definition consists of a list of pairs of patterns and their related
actions to be performed upon match. Also events such as ‘end-of-file’ may be
associated with event handlers. The behavior of a mode, however, can be
customized with some more detail, if the mode’s NAME
is followed by
a colon, as shown below.
mode NAME : BASE_A, BASE_B, BASE_C
<tag-1> <tag-2> {
...
}
The names following the :
express inheritance relationships by means of the
mode’s base modes. Mode inheritance provides a means of behavioral composition. For
example, a mode COMMON
may be defined to handle the events of ‘failure’ and
‘end-of-stream’. For all other modes to behave uniformly with this respect, is
is sufficient to mention COMMON
as base mode.
Tags are bracketed in <
and >
brackets. Tags are used to control mode
transitions and line and column numbering behavior among other things. The
code fragment below provides an impression of a full-fledged lexer mode named
STRING
.
mode STRING : END_OF_FILE <entry: PROGRAM> <exit: PROGRAM>
{
on_entry { string_clear(); }
on_exit { self.send_string(QUEX_TKN_STRING, string_get()); }
"\\\\"" { string_add(Lexeme); }
"\\\\" { string_add(Lexeme); }
"\"" { self.enter_mode(&self, PROGRAM); RETURN; }
on_failure {
string_add(Lexeme);
self.error_code_clear_this(&self, E_Error_OnFailure);
}
}
Mode STRING
is derived from mode END_OF_FILE
which, supposedly, has the
task to handle the end-of-file event. STRING
shall ‘eat’ the content of
sections in double quotes. The tags entry
and exit
restrict the
transitions from and to this mode. Mode STRING
, as defined here, is only
accessible from mode PROGRAM
. Supposedly, STRING
is entered when
PROGRAM
detects an opening quote and it is left, when it detects a closing
double quote. The remaining pattern-action pairs and event handlers implement a
string accumulation and deal with escaped double quotes and backslashes.
The definition of large patterns can add a significant amount of visual noise. A mode proliferated with visual noise can hardly make its behavior transparent. Consider, for example, the following definition for Latin, Thai, Arabic, and Devanagari numbers.
([0-9]+".")?[0-9]+|([:intersection(\P{Block=Arabic},[X660-X6D0]):]+".")?[:intersection(\P{Block=Arabic},[X660-X6D0]):]+|([:intersection(\P{Script=Thai},\G{Nd}):]+".")?[:intersection(\P{Script=Thai},\G{Nd}):]+|([:intersection(\P{Script=Devanagari},\G{Nd}):]+".")?[:intersection(\P{Script=Devanagari},\G{Nd}):]+ => QUEX_TKN_NUMBER(Lexeme);
It is clearly impossible for an unaltered human being to comprise the meaning of
this expression [4]. Moreover, errors can hardly be detected. The next section
explains how to build those expressions hierarchically and finally come
up with a shorthand NUMBER
, so that the above rule can be expressed as
{NUMBER} => QUEX_TKN_NUMBER(Lexeme);
A detailed discussion of modes and their syntax is presented in chapter sec:modes.
Definitions¶
- define¶
Section to define shorthands, macros, and trigger plots and lexeme set displays for regular expressions.
define { ... }
The define
keyword is followed by a section in curly brackets. The purpose
of this section is to define handy shorthands for patterns to be used in mode
definitions. To achieve this, the instructions below are provided.
NAME
: shorthand definition.\macro
: macro definition.\run
: running sample data on a DFA.\plot
: plotting a DFA.\lexemes
: determine lexemes matched by a DFA.
The following paragraphs discuss their use. The most essential part of a define
section are shorthand definitions. Those are the output to be used later. An
NAME
can be anything starting with letters and continuing with
letters or numbers plus or the underscore. Formally, a shorthand
is defined by:
NAME RE
where RE
is a regular expression. Whenever the NAME
appears in
curly brackets, it is replaced by the regular expression RE
. This can be
used to define hierarchies of shorthands, for example:
define {
DIGIT [0-9]
FLOAT {DIGIT}+(.{DIGIT}*)
}
defines first the shorthand DIGIT
and based on that the shorthand for a
floating point number FLOAT
. It consists of some digits possibly followed
by a dot and more digits. A macro is similar to a shorthand, only that it
receives arguments. It can be defined by the \macro
instruction:
\macro NAME(TYPE0 ARG0[, TYPEn ARGn]+): RE
where NAME
specifies the macro’s name. A comma-separated list
of arguments is passed in brackets [3]. Each argument is specified with
its TYPE
and its name. Possible types are
dfa
for a deterministic finite automaton, i.e. a state machine, or a regular expression, This may also be used to pass a string.
set
for a character set, and
integer
for an integer [2].
The RE
is the regular expression in which the arguments may
be expanded. Arguments act like a shorthand inside the macro’s RE
.
For example, the macro:
\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)?
expands what is passed as first argument named DIGIT
inside the regular
expression {DIGIT}+("."{DIGIT}*)
. Macros are expanded the like shorthands,
namely by surrounding curly brackets. When arguments are passed,
they are separated by whitespace, i.e. formally:
{NAME ARG0 [ARGn]+}
A macros call implements construction rule for regular expression very
concisely. Instead of defining a float number regular expression explicitly for
different scripts, the aforementioned macro FLOAT
may be used. For
convinience, let the SEPERATOR
argument customize the separator between the
whole number part and the fractional part. The FLOAT
macro can then be used
conveniently to define English, German, Arabic and Thai floating point numbers,
as shown below.
define {
\macro FLOAT(set DIGIT, dfa SEPERATOR): {DIGIT}+({SEPERATOR}{DIGIT}*)?
// DIGIT: SEPARATOR:
ENGLISH_FLOAT {FLOAT [0-9] "." /* Decimal dot */}
GERMAN_FLOAT {FLOAT [0-9] "," /* German comma */}
ARABIC_FLOAT {FLOAT [۰-۹] "،" /* Arabic comma */}
THAI_FLOAT {FLOAT [\X0E50-\X0E59] "." /* Decimal dot */}
}
Note
Shorthands and macros do not enter the name space in generated code. They
are only available inside the top-level sections mode
and
define
.
Using macros and shorthands in mode definitions makes pattern-action pairs more
transparent. Moreover, the hierarchical construction of patterns based on
understandable sub-patterns is essential for a robust design–following the
paradigm of ‘divide and conquer’ [5]! The remaining commands \plot
and
\run
support the development and verification of regular expressions.
The command \plot
may be used to visualize results of regular expressions.
It comes in two flavors:
- \plot ['(' [e] [FLAGS]* ')'] SHORTHAND
When no flag or the
e
flag is specified, the\plot
command expands theSHORTHAND
and writes the according state machine graph into a file namedSHORTHAND.dot
.
- \plot '(' f [FLAGS]* ')' FILE-NAME RE
If the
f
flag is specified, the\plot
command writes a state machine graph of the regular expressionRE
to a fileFILE-NAME
in Graphviz format [].
This allows to see quickly the state machine graph of a regular expression’s DFA.
For example the definition of the pattern FRUIT
can be plotted as shown below.
define {
FRUIT apple|pineapple|guapple|pitaya
\plot FRUIT
}
plots the state machine implemting the match of abc
into the file
MINE.dot
. If the f
flag is given, then the \plot
command
expects a file name followed by an expression to plot. For example,
define {
FRUIT apple|pineapple|guapple|pitaya
\plot(f) FRUIT.dot {FRUIT}
}
does the exact same thing as the aforementioned example. The resulting
graph can be produces with the dot
tool of the Graphviz package. That is,
given the content above in a file test.qx
, the following commands
produce what is shown in figure Fig. 9
> quex -i test.qx
> dot FRUIT.dot -Tsvg
The flags of the \plot
given in table Flags passed to \plot..
Flag |
Meaning |
---|---|
|
annotate edges with |
|
annotate edges with |
|
annotate edges with unicode |
|
plot graph to terminal |
|
plot |
|
plot to |
|
suppress print of location (file, line number) |
For investigation of a patterns matching behavior, the \run
command
may be used.
- \run RE TEST-RE
This commands considers all lexemes that match the regular expression TEST-RE
as input sequences to a DFA derived from the regular expression RE
. For
example,:
\run [a-z]+ h[ae]llo
The expression h[ae]llo
matches hello
and hallo
. So the \run
command feeds the string "hello"
in one run to the DFA derived from
[a-z]+
. For the second run it feeds the string "hallo"
. The output
of the run command is printed to the standard output. Below, is the example
to check the parsing of float numbers as defined above.
define {
// NOTE: wrong definition where fractional part is not optional
\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)
NUMBER {FLOAT [0-9]}
\run {NUMBER} "123 "|"123.456;"
}
The \run
command takes an expansion of NAMESPACE
as the the DFA on
which the test sequences will be applied. The test sequences are 123 `` and
``123.456;
. Storing the above fragment in a file test.qx
and applying:
> quex -i test.qx
the result of the three test sequences is printed to the standard output as shown below.:
test.qx:7: run on '{NUMBER}'
test.qx:7:
test.qx:7: |1|2|3| |
test.qx:7: ^
test.qx:7: |1|2|3|.|4|5|6|;|
test.qx:7: A ^
test.qx:7:
test.qx:7: run <terminated>
An A
represents the position where acceptance occurred. The ^
denotes
the furthest position that the analyzer went. The second sequences 123.456
accepts the floating point number. After the lexer has found the ;
it goes
back to the last digit and notifies a match. However, the first sequences
123 `` is not accepted, even though it is a valid number. The reason is that
the definition of ``FLOAT
is wrong. There, the fractional part was not made
optional, i.e. the ?
was missing. With the correct definition:
\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)?
the output becomes:
test.qx:7: run on '{NUMBER}'
test.qx:7:
test.qx:7: |1|2|3| |
test.qx:7: A ^
test.qx:7: |1|2|3|.|4|5|6|;|
test.qx:7: A ^
test.qx:7:
test.qx:7: run <terminated>
which meets the expectations. As shown in this example, the \run
command is a
convinient tool to find errors in regular expression definitions quickly.
- \lexemes '(' [FLAGS]* ')' RE
The \lexemes
command determines the set of lexemes possibly matched by a
DFA. This is another way to see whether a regular expression does what it should.
For example.:
define {
\lexemes for|forest
}
delivers:
test.qx:2: lexemes for 'for|forest' (max loops=0, sets=False)
test.qx:2:
test.qx:2: f, o, r
test.qx:2: f, o, r, e, s, t
test.qx:2:
test.qx:2: lexemes <terminated>
when a regular expression contains repetition, it makes sense to restrict the maximum number of loop iterations. This is done by a decimal number flag between 1 and 5. For example:
define {
IDENTIFIER [a-z][0-9a-z]*
\lexemes (3) {IDENTIFIER}
}
iterates at maximum three times through a loop and delivers:
test.qx:3: lexemes for '{IDENTIFIER}' (max loops=2, sets=False)
test.qx:3:
test.qx:3: a
...
test.qx:3: z
test.qx:3: a, 0
...
test.qx:3: a, 9
...
test.qx:3: z, z, z
test.qx:3:
test.qx:3: lexemes <terminated>
which is, indeed, a very lengthy expression. Instead of printing each possible
number, number sets can be printed using the s
flag. This shortens the list
of printed lines and the result of \lexemes (3s) {IDENTIFIER}
becomes:
test.qx:3: lexemes for '{IDENTIFIER}' (max loops=2, sets=True)
test.qx:3:
test.qx:3: {[a, z]}
test.qx:3: {[a, z]}, {[0, 9] [a, z]}
test.qx:3: {[a, z]}, {[0, 9] [a, z]}, {[0, 9] [a, z]}
test.qx:3:
test.qx:3: lexemes <terminated>
The complete list of flags to be passed to the \lexemes
command is shown
in Table 4.
Flag |
Meaning |
---|---|
|
maximum number of loop iterations. |
|
display hexadecimal numbers. |
|
display decimal numbers. |
|
display unicode |
|
display sets instead of single lexatoms. |
|
suppress print of location (file, line number) |
Footnotes