Pattern Matching¶
The following paragraphs discuss three possible causes of confusion with respect to pattern matching, namely:
- Wrong assumptions about matching lexemes.
- Confusion of matching length and progress of input position.
- Usage of buffer sentinels.
Interferring Patterns¶
In the examples mentioned here, matching runs are displayed in the same format
as produced by the \run
command from the define
section
Definitions . An A
indicates the position of
acceptance, and therefore the position of the subsequent analysis step. ^
indicates the furthersest position of treated input. The symbol ␣
is
occasionally used to indicate ‘space’ when it supports clarity. Let the code
below be an attempt to define a lexer, that produces a NEGATION
token when
Not
occurs and an IDENTIFIER
when other words occur.
mode EXAMPLE : <skip: [ ]> {
"Not" => QUEX_TKN_NEGATION();
[A-Za-z]+ => QUEX_TKN_IDENTIFIER(Lexeme);
}
A potential problem might be unawareness that the analysis step does not
necessarily stop after Not
. Given the input NotSomething␣
the pattern
[A-Za-z]+
runs:
|N|o|t|S|o|m|e|t|h|i|n|g|␣|
A ^
but "Not"
runs only:
|N|o|t|S|o|m|e|t|h|i|n|g|␣|
A ^
Thus, the former wins. Instead of:
QUEX_TKN_NEGATION
QUEX_TKN_IDENTIFIER("Something")
the lexer produces the single token:
IDENTIFIER("NotSomething")
In an inadvert design, a pattern may ‘eat’ into a lexeme which is the concern
of another pattern. One way to deal with that is by incorporating delimiters
in a language design, such as space and punctuation in human scriptures. If
this is not an option, \NotBegin{}
and \NotEnd{}
may be applied. The
following defines a FILTER
macro for that purpose.
The macro FILTER
generates a DFA matching all lexemes which are matched by
X
but do not start or end with what is matched by OUT
. On that
account, {IDENTIFIER}
will never eat something at the beginning or the
end that is matched by one of the keywords in {KEYWORDS}
.
Delimters in Pattern¶
It is particularly a bad idea to incorporate delimiters inside a pattern. The
function of a delimiter is to delimit, which is already a concept on the
level of meaning, not simply pattern detection. To treat meaning is the
task of syntax or even semantic analysis. A lexer should be restricted to
identify atomic chunks of meaning without interpreting them. Below, the
pattern for QUEX_TKN_COMPOSITE
matches any number of words separated by ::
as internal delimiter.
mode EXAMPLE <skip: [ \t]+> {
"fish" => QUEX_TKN_FISH();
[a-z]+(::[a-z]+)* => QUEX_TKN_COMPOSITE(Lexeme);
}
The isolated sequence fish
would be detected as QUEX_TKN_FISH
. However,
in the context of ::
such as in fisch::trout
it is absorbed into
QUEX_TKN_COMPOSITE("fisch::trout")
.
In some cases the limits might be fuzzy. A lexer that interpretes trivialized
XML code containing only plain tags without attributes (e.g. <block> ... </block>
),
it may be tempting to define a pattern for matching XML tags as shown below.
define { P_XML_TAG <\/?[A-Za-z!][^>]*> }
That is, it matches anything that starts with <
and ends with >
and
contains some letters. This case, de facto, was reported on the Quex project page
as an issue, when a lexer choked on a text that contained <
as text content,
namely
… interi non negativi e “p” una probabilità (0<p<1). La funzione …
The occurence of <
in (0<p<1)
was considered as opening of an XML tag
from where the lexer searched for the closing >
. Anything between the
lesser sign and the next following greater sign was erroneously interpreted as
XML tag. To cope with this situations such plain text regions must be
protected from a lurking XML tags. The obvious solution is to interpreting them
in a dedicated mode. A ‘workaround’ might look like the following.
define { P_XML_TAG ^[ \t]*<\/?[A-Za-z!][^\n>]*> }
This makes two assumptions. First, it assumes that all documents are well
formed XML documents [], where tags always appear
at the beginning of a line. Further, it is assumed that a <
never
appears in normal text at the beginning of a line. While this approach
avoids the definition of a dedicated text mode, it’s robustness stands and
falls with the two aforementioned assumptions.
Patterns conditioned by begin-of-line or end-of-line appear to be absolute. A
^
(begin-of-line) or $
(end-of-line) in a pattern seems to pretend to
match always when the pattern occurs at the beginning or at the end of a
line. The appearance is deceiving, though, if other patterns eat over the
line borders.
mode EXAMPLE {
^[ ]*hello => QUEX_TKN_HELLO_AT_BEGIN();
hello => QUEX_TKN_HELLO();
[ \n]+ {}
}
Here, the pattern [ \n]+
eats newline and all spaces at the beginning of
the subsequent line. Thus, it does not stop immediately after newline and
the next analysis step does not comply to ‘begin-of-line’. Consequently, in a
line that starts with whitespace, the QUEX_TKN_HELLO
token is produced and not
QUEX_TKN_HELLO_AT_BEGIN
. Even though, the QUEX_TKN_HELLO_AT_BEGIN
pattern starts with ^[
]*
which should consume whitespace at the beginning of a line, it is
never in a position to do so.
Matching Length vs. Input Position Progress¶
In the context of post-contexts, surprises may occur due to the confusion of
matching length and input position progress. For example, let the pattern
"sun"/[a-z]+
match anytime where sun
appears in sun-related words such
as sunset
, sunburn
, or sunray
. Words unrelated to sun, such as
sunken
, shall be exempted.
mode EXAMPLE {
"sun"/[a-z]+ => QUEX_TKN_PREFIX_SUN;
"sunken" => QUEX_TKN_VERB(Lexeme);
[a-z]+ => QUEX_TKN_WORD(Lexeme);
}
When the post context matches, the input position returns to the position
right after where the core context matched. Thus, the pattern "sun"/[a-z]+
runs its match on sunken␣
as follows [#f1]_:
|s|u|n|k|e|n|␣|
A ^
That is, it walks until it reaches whitespace (where ^
points), but then
returns to the position right after the n
of sun
(to where the A
points). When pattern "sunken"
matches, it progresses the input pointer
further, namely after the last n
of sunken
, as can be seen here.:
|s|u|n|k|e|n|␣|
A ^
Since sunken
progresses the input position (A
) further than
sun/[a-z]+
, one might expect that sunken
wins. However, the
post-context of sun/[a-z]+
is included into the matching length (stretch
from begin to ^
). It is the same in both cases. The pattern
"sun"/[a-z]
wins, because it has higher precedence. Instead of the
probably expected single token:
QUEX_TKN_VERB("sunken")
the lexer sends:
QUEX_TKN_PREFIX_SUN
QUEX_TKN_WORD("ken")
Sentinels¶
Another pitfall is related to sentinels, i.e. character codes that the lexer applies to indicate the buffer-limit. The default values for those codes are 0x0 for buffer-limit and 0x1 for path termination. They are chosen to be out of the range for sound regular expressions parsing human written text. If these values are supposed to occur inside patterns, then alternative values need to be specified, as shown in sec:command-line.