Pattern Matching

The following paragraphs discuss three possible causes of confusion with respect to pattern matching, namely:

Wrong assumptions about matching lexemes.
Confusion of matching length and progress of input position.
Usage of buffer sentinels.

Interferring Patterns

In the examples mentioned here, matching runs are displayed in the same format as produced by the \run command from the define section Definitions . An A indicates the position of acceptance, and therefore the position of the subsequent analysis step. ^ indicates the furthersest position of treated input. The symbol is occasionally used to indicate ‘space’ when it supports clarity. Let the code below be an attempt to define a lexer, that produces a NEGATION token when Not occurs and an IDENTIFIER when other words occur.

mode EXAMPLE : <skip: [ ]> {
    "Not"     => QUEX_TKN_NEGATION();
    [A-Za-z]+ => QUEX_TKN_IDENTIFIER(Lexeme);
}

A potential problem might be unawareness that the analysis step does not necessarily stop after Not. Given the input NotSomething␣ the pattern [A-Za-z]+ runs:

|N|o|t|S|o|m|e|t|h|i|n|g|␣|
                        A ^

but "Not" runs only:

|N|o|t|S|o|m|e|t|h|i|n|g|␣|
      A ^

Thus, the former wins. Instead of:

QUEX_TKN_NEGATION
QUEX_TKN_IDENTIFIER("Something")

the lexer produces the single token:

IDENTIFIER("NotSomething")

In an inadvert design, a pattern may ‘eat’ into a lexeme which is the concern of another pattern. One way to deal with that is by incorporating delimiters in a language design, such as space and punctuation in human scriptures. If this is not an option, \NotBegin{} and \NotEnd{} may be applied. The following defines a FILTER macro for that purpose.

The macro FILTER generates a DFA matching all lexemes which are matched by X but do not start or end with what is matched by OUT. On that account, {IDENTIFIER} will never eat something at the beginning or the end that is matched by one of the keywords in {KEYWORDS}.

Delimters in Pattern

It is particularly a bad idea to incorporate delimiters inside a pattern. The function of a delimiter is to delimit, which is already a concept on the level of meaning, not simply pattern detection. To treat meaning is the task of syntax or even semantic analysis. A lexer should be restricted to identify atomic chunks of meaning without interpreting them. Below, the pattern for QUEX_TKN_COMPOSITE matches any number of words separated by :: as internal delimiter.

mode EXAMPLE <skip: [ \t]+> {
    "fish"                => QUEX_TKN_FISH();
    [a-z]+(::[a-z]+)* => QUEX_TKN_COMPOSITE(Lexeme);
}

The isolated sequence fish would be detected as QUEX_TKN_FISH. However, in the context of :: such as in fisch::trout it is absorbed into QUEX_TKN_COMPOSITE("fisch::trout").

In some cases the limits might be fuzzy. A lexer that interpretes trivialized XML code containing only plain tags without attributes (e.g. <block> ...  </block>), it may be tempting to define a pattern for matching XML tags as shown below.

define { P_XML_TAG   <\/?[A-Za-z!][^>]*> }

That is, it matches anything that starts with < and ends with > and contains some letters. This case, de facto, was reported on the Quex project page as an issue, when a lexer choked on a text that contained < as text content, namely

… interi non negativi e “p” una probabilità (0<p<1). La funzione …

The occurence of < in (0<p<1) was considered as opening of an XML tag from where the lexer searched for the closing >. Anything between the lesser sign and the next following greater sign was erroneously interpreted as XML tag. To cope with this situations such plain text regions must be protected from a lurking XML tags. The obvious solution is to interpreting them in a dedicated mode. A ‘workaround’ might look like the following.

define { P_XML_TAG   ^[ \t]*<\/?[A-Za-z!][^\n>]*> }

This makes two assumptions. First, it assumes that all documents are well formed XML documents [], where tags always appear at the beginning of a line. Further, it is assumed that a < never appears in normal text at the beginning of a line. While this approach avoids the definition of a dedicated text mode, it’s robustness stands and falls with the two aforementioned assumptions.

Patterns conditioned by begin-of-line or end-of-line appear to be absolute. A ^ (begin-of-line) or $ (end-of-line) in a pattern seems to pretend to match always when the pattern occurs at the beginning or at the end of a line. The appearance is deceiving, though, if other patterns eat over the line borders.

mode EXAMPLE {
    ^[ ]*hello => QUEX_TKN_HELLO_AT_BEGIN();
    hello      => QUEX_TKN_HELLO();
    [ \n]+     {}
}

Here, the pattern [ \n]+ eats newline and all spaces at the beginning of the subsequent line. Thus, it does not stop immediately after newline and the next analysis step does not comply to ‘begin-of-line’. Consequently, in a line that starts with whitespace, the QUEX_TKN_HELLO token is produced and not QUEX_TKN_HELLO_AT_BEGIN. Even though, the QUEX_TKN_HELLO_AT_BEGIN pattern starts with ^[ ]* which should consume whitespace at the beginning of a line, it is never in a position to do so.

Matching Length vs. Input Position Progress

In the context of post-contexts, surprises may occur due to the confusion of matching length and input position progress. For example, let the pattern "sun"/[a-z]+ match anytime where sun appears in sun-related words such as sunset, sunburn, or sunray. Words unrelated to sun, such as sunken, shall be exempted.

mode EXAMPLE {
    "sun"/[a-z]+ => QUEX_TKN_PREFIX_SUN;
    "sunken"     => QUEX_TKN_VERB(Lexeme);
    [a-z]+       => QUEX_TKN_WORD(Lexeme);
}

When the post context matches, the input position returns to the position right after where the core context matched. Thus, the pattern "sun"/[a-z]+ runs its match on sunken␣ as follows [#f1]_:

|s|u|n|k|e|n|␣|
      A       ^

That is, it walks until it reaches whitespace (where ^ points), but then returns to the position right after the n of sun (to where the A points). When pattern "sunken" matches, it progresses the input pointer further, namely after the last n of sunken, as can be seen here.:

|s|u|n|k|e|n|␣|
            A ^

Since sunken progresses the input position (A) further than sun/[a-z]+, one might expect that sunken wins. However, the post-context of sun/[a-z]+ is included into the matching length (stretch from begin to ^). It is the same in both cases. The pattern "sun"/[a-z] wins, because it has higher precedence. Instead of the probably expected single token:

QUEX_TKN_VERB("sunken")

the lexer sends:

QUEX_TKN_PREFIX_SUN
QUEX_TKN_WORD("ken")

Sentinels

Another pitfall is related to sentinels, i.e. character codes that the lexer applies to indicate the buffer-limit. The default values for those codes are 0x0 for buffer-limit and 0x1 for path termination. They are chosen to be out of the range for sound regular expressions parsing human written text. If these values are supposed to occur inside patterns, then alternative values need to be specified, as shown in sec:command-line.