Behavior¶
A lexer’s behavior is controlled by its mode. In that sense, it is similar
to a human’s mood [1], it only differs in that a lexer’s mode is distinct
for a given point in time. A mode determines the interpreted language in terms
of a mappings from patterns to actions. Changes in interpreting language are
reflected by changes of modes. For example, in the TeX typesetting system
the math mode is entered and exited from text mode by $ operators. In a
fragment like:
Einstein stated that $E = m c^2$ describes the mass energy equivalence.
the floating text is interrupted by $ delimiters. A $ triggers the
math mode instruction parsing until the closing $ appears. The parser,
then, returns into text parsing mode.
In a Quex source file, the behavior of a lexer is defined by means of the
sections mode and define. The initial mode is specified by the
assignment
start = NAME;
where start is a top-level keyword and NAME is the name of the intended
initial mode with which the interpretation shall start. This statement is only
required, if there is more than one mode.
Mode¶
The top-level keyword mode signalizes the beginning of a mode definition.
The following scheme pinpoints a basic mode definition.
mode NAME {
PATTERN_1 ACTION_1
PATTERN_2 ACTION_2
...
on_EVENT_1 EVENT_HANDLER_1
on_EVENT_2 EVENT_HANDLER_2
}
The identifier NAME specifies the mode’s name. The content
of a mode definition consists of a list of pairs of patterns and their related
actions to be performed upon match. Also events such as ‘end-of-file’ may be
associated with event handlers. The behavior of a mode, however, can be
customized with some more detail, if the mode’s NAME is followed by
a colon, as shown below.
mode NAME : BASE_A, BASE_B, BASE_C
<tag-1> <tag-2> {
...
}
The names following the : express inheritance relationships by means of the
mode’s base modes. Mode inheritance provides a means of behavioral composition. For
example, a mode COMMON may be defined to handle the events of ‘failure’ and
‘end-of-stream’. For all other modes to behave uniformly with this respect, is
is sufficient to mention COMMON as base mode.
Tags are bracketed in < and > brackets. Tags are used to control mode
transitions and line and column numbering behavior among other things. The
code fragment below provides an impression of a full-fledged lexer mode named
STRING.
mode STRING : END_OF_FILE <entry: PROGRAM> <exit: PROGRAM>
{
on_entry { string_clear(); }
on_exit { self.send_string(QUEX_TKN_STRING, string_get()); }
"\\\\"" { string_add(Lexeme); }
"\\\\" { string_add(Lexeme); }
"\"" { self.enter_mode(&self, PROGRAM); RETURN; }
on_failure {
string_add(Lexeme);
self.error_code_clear_this(&self, E_Error_OnFailure);
}
}
Mode STRING is derived from mode END_OF_FILE which, supposedly, has the
task to handle the end-of-file event. STRING shall ‘eat’ the content of
sections in double quotes. The tags entry and exit restrict the
transitions from and to this mode. Mode STRING, as defined here, is only
accessible from mode PROGRAM. Supposedly, STRING is entered when
PROGRAM detects an opening quote and it is left, when it detects a closing
double quote. The remaining pattern-action pairs and event handlers implement a
string accumulation and deal with escaped double quotes and backslashes.
The definition of large patterns can add a significant amount of visual noise. A mode proliferated with visual noise can hardly make its behavior transparent. Consider, for example, the following definition for Latin, Thai, Arabic, and Devanagari numbers.
([0-9]+".")?[0-9]+|([:intersection(\P{Block=Arabic},[X660-X6D0]):]+".")?[:intersection(\P{Block=Arabic},[X660-X6D0]):]+|([:intersection(\P{Script=Thai},\G{Nd}):]+".")?[:intersection(\P{Script=Thai},\G{Nd}):]+|([:intersection(\P{Script=Devanagari},\G{Nd}):]+".")?[:intersection(\P{Script=Devanagari},\G{Nd}):]+ => QUEX_TKN_NUMBER(Lexeme);
It is clearly impossible for an unaltered human being to comprise the meaning of
this expression [4]. Moreover, errors can hardly be detected. The next section
explains how to build those expressions hierarchically and finally come
up with a shorthand NUMBER, so that the above rule can be expressed as
{NUMBER} => QUEX_TKN_NUMBER(Lexeme);
A detailed discussion of modes and their syntax is presented in chapter sec:modes.
Definitions¶
- define¶
Section to define shorthands, macros, and trigger plots and lexeme set displays for regular expressions.
define { ... }
The define keyword is followed by a section in curly brackets. The purpose
of this section is to define handy shorthands for patterns to be used in mode
definitions. To achieve this, the instructions below are provided.
NAME: shorthand definition.\macro: macro definition.\run: running sample data on a DFA.\plot: plotting a DFA.\lexemes: determine lexemes matched by a DFA.
The following paragraphs discuss their use. The most essential part of a define
section are shorthand definitions. Those are the output to be used later. An
NAME can be anything starting with letters and continuing with
letters or numbers plus or the underscore. Formally, a shorthand
is defined by:
NAME RE
where RE is a regular expression. Whenever the NAME appears in
curly brackets, it is replaced by the regular expression RE. This can be
used to define hierarchies of shorthands, for example:
define {
DIGIT [0-9]
FLOAT {DIGIT}+(.{DIGIT}*)
}
defines first the shorthand DIGIT and based on that the shorthand for a
floating point number FLOAT. It consists of some digits possibly followed
by a dot and more digits. A macro is similar to a shorthand, only that it
receives arguments. It can be defined by the \macro instruction:
\macro NAME(TYPE0 ARG0[, TYPEn ARGn]+): RE
where NAME specifies the macro’s name. A comma-separated list
of arguments is passed in brackets [3]. Each argument is specified with
its TYPE and its name. Possible types are
dfafor a deterministic finite automaton, i.e. a state machine, or a regular expression, This may also be used to pass a string.
setfor a character set, and
integerfor an integer [2].
The RE is the regular expression in which the arguments may
be expanded. Arguments act like a shorthand inside the macro’s RE.
For example, the macro:
\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)?
expands what is passed as first argument named DIGIT inside the regular
expression {DIGIT}+("."{DIGIT}*). Macros are expanded the like shorthands,
namely by surrounding curly brackets. When arguments are passed,
they are separated by whitespace, i.e. formally:
{NAME ARG0 [ARGn]+}
A macros call implements construction rule for regular expression very
concisely. Instead of defining a float number regular expression explicitly for
different scripts, the aforementioned macro FLOAT may be used. For
convinience, let the SEPERATOR argument customize the separator between the
whole number part and the fractional part. The FLOAT macro can then be used
conveniently to define English, German, Arabic and Thai floating point numbers,
as shown below.
define {
\macro FLOAT(set DIGIT, dfa SEPERATOR): {DIGIT}+({SEPERATOR}{DIGIT}*)?
// DIGIT: SEPARATOR:
ENGLISH_FLOAT {FLOAT [0-9] "." /* Decimal dot */}
GERMAN_FLOAT {FLOAT [0-9] "," /* German comma */}
ARABIC_FLOAT {FLOAT [۰-۹] "،" /* Arabic comma */}
THAI_FLOAT {FLOAT [\X0E50-\X0E59] "." /* Decimal dot */}
}
Note
Shorthands and macros do not enter the name space in generated code. They
are only available inside the top-level sections mode and
define.
Using macros and shorthands in mode definitions makes pattern-action pairs more
transparent. Moreover, the hierarchical construction of patterns based on
understandable sub-patterns is essential for a robust design–following the
paradigm of ‘divide and conquer’ [5]! The remaining commands \plot and
\run support the development and verification of regular expressions.
The command \plot may be used to visualize results of regular expressions.
It comes in two flavors:
- \plot ['(' [e] [FLAGS]* ')'] SHORTHAND
When no flag or the
eflag is specified, the\plotcommand expands theSHORTHANDand writes the according state machine graph into a file namedSHORTHAND.dot.
- \plot '(' f [FLAGS]* ')' FILE-NAME RE
If the
fflag is specified, the\plotcommand writes a state machine graph of the regular expressionREto a fileFILE-NAMEin Graphviz format [].
This allows to see quickly the state machine graph of a regular expression’s DFA.
For example the definition of the pattern FRUIT can be plotted as shown below.
define {
FRUIT apple|pineapple|guapple|pitaya
\plot FRUIT
}
plots the state machine implemting the match of abc into the file
MINE.dot. If the f flag is given, then the \plot command
expects a file name followed by an expression to plot. For example,
define {
FRUIT apple|pineapple|guapple|pitaya
\plot(f) FRUIT.dot {FRUIT}
}
does the exact same thing as the aforementioned example. The resulting
graph can be produces with the dot tool of the Graphviz package. That is,
given the content above in a file test.qx, the following commands
produce what is shown in figure Fig. 9
> quex -i test.qx
> dot FRUIT.dot -Tsvg
Fig. 9 Example plot of a regular expression’s DFA.¶
The flags of the \plot given in table Flags passed to \plot..
Flag |
Meaning |
|---|---|
|
annotate edges with |
|
annotate edges with |
|
annotate edges with unicode |
|
plot graph to terminal |
|
plot |
|
plot to |
|
suppress print of location (file, line number) |
For investigation of a patterns matching behavior, the \run command
may be used.
- \run RE TEST-RE
This commands considers all lexemes that match the regular expression TEST-RE
as input sequences to a DFA derived from the regular expression RE. For
example,:
\run [a-z]+ h[ae]llo
The expression h[ae]llo matches hello and hallo. So the \run
command feeds the string "hello" in one run to the DFA derived from
[a-z]+. For the second run it feeds the string "hallo". The output
of the run command is printed to the standard output. Below, is the example
to check the parsing of float numbers as defined above.
define {
// NOTE: wrong definition where fractional part is not optional
\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)
NUMBER {FLOAT [0-9]}
\run {NUMBER} "123 "|"123.456;"
}
The \run command takes an expansion of NAMESPACE as the the DFA on
which the test sequences will be applied. The test sequences are 123 `` and
``123.456;. Storing the above fragment in a file test.qx and applying:
> quex -i test.qx
the result of the three test sequences is printed to the standard output as shown below.:
test.qx:7: run on '{NUMBER}'
test.qx:7:
test.qx:7: |1|2|3| |
test.qx:7: ^
test.qx:7: |1|2|3|.|4|5|6|;|
test.qx:7: A ^
test.qx:7:
test.qx:7: run <terminated>
An A represents the position where acceptance occurred. The ^ denotes
the furthest position that the analyzer went. The second sequences 123.456
accepts the floating point number. After the lexer has found the ; it goes
back to the last digit and notifies a match. However, the first sequences
123 `` is not accepted, even though it is a valid number. The reason is that
the definition of ``FLOAT is wrong. There, the fractional part was not made
optional, i.e. the ? was missing. With the correct definition:
\macro FLOAT(set DIGIT): {DIGIT}+("."{DIGIT}*)?
the output becomes:
test.qx:7: run on '{NUMBER}'
test.qx:7:
test.qx:7: |1|2|3| |
test.qx:7: A ^
test.qx:7: |1|2|3|.|4|5|6|;|
test.qx:7: A ^
test.qx:7:
test.qx:7: run <terminated>
which meets the expectations. As shown in this example, the \run command is a
convinient tool to find errors in regular expression definitions quickly.
- \lexemes '(' [FLAGS]* ')' RE
The \lexemes command determines the set of lexemes possibly matched by a
DFA. This is another way to see whether a regular expression does what it should.
For example.:
define {
\lexemes for|forest
}
delivers:
test.qx:2: lexemes for 'for|forest' (max loops=0, sets=False)
test.qx:2:
test.qx:2: f, o, r
test.qx:2: f, o, r, e, s, t
test.qx:2:
test.qx:2: lexemes <terminated>
when a regular expression contains repetition, it makes sense to restrict the maximum number of loop iterations. This is done by a decimal number flag between 1 and 5. For example:
define {
IDENTIFIER [a-z][0-9a-z]*
\lexemes (3) {IDENTIFIER}
}
iterates at maximum three times through a loop and delivers:
test.qx:3: lexemes for '{IDENTIFIER}' (max loops=2, sets=False)
test.qx:3:
test.qx:3: a
...
test.qx:3: z
test.qx:3: a, 0
...
test.qx:3: a, 9
...
test.qx:3: z, z, z
test.qx:3:
test.qx:3: lexemes <terminated>
which is, indeed, a very lengthy expression. Instead of printing each possible
number, number sets can be printed using the s flag. This shortens the list
of printed lines and the result of \lexemes (3s) {IDENTIFIER} becomes:
test.qx:3: lexemes for '{IDENTIFIER}' (max loops=2, sets=True)
test.qx:3:
test.qx:3: {[a, z]}
test.qx:3: {[a, z]}, {[0, 9] [a, z]}
test.qx:3: {[a, z]}, {[0, 9] [a, z]}, {[0, 9] [a, z]}
test.qx:3:
test.qx:3: lexemes <terminated>
The complete list of flags to be passed to the \lexemes command is shown
in Table 4.
Flag |
Meaning |
|---|---|
|
maximum number of loop iterations. |
|
display hexadecimal numbers. |
|
display decimal numbers. |
|
display unicode |
|
display sets instead of single lexatoms. |
|
suppress print of location (file, line number) |
Footnotes