Pitfalls

The most dangerous pitfall is related to precedence and length. Note, that a pattern that is defined before another pattern has a higher precedence. Also, if a pattern can match a longer chain of characters it wins. Thus, if there are for example two patterns

[A-Z]+     => TKN_IDENTIFIER(Lexeme);
"PRINT"    => TKN_KEYWORD_PRINT;

then the keyword PRINT will never be matched. This is so, because [A-Z] matches also the character chain PRINT and has a higher precedence, because it is defined first. To illustrate the danger of ‘greedy matching’, i.e. the fact that length matters, let two patterns be defined as:

key    => TKN_KEYWORD;
[a-z]+ => TKN_IDENTIFIER;

then the lexeme keyword would produce a single token:

IDENTIFIER   'keyword'

because IDENTIFIER matches a longer lexeme than solely KEY does. If the desired token sequence, though, is:

KEYWORD      'key'
IDENTIFIER   'word'

the analyzer may be considered unapt. This example, highlights a bigger problem related to ‘greedy matching’:

Warning

When specifying a pattern action pair with a high-priority (such as in a base mode) one may not be aware of a pattern of lower priority (e.g. in a derived mode). Lower priority patterns may very well win against high-priority patterns if they match longer lexemes.

Assumptions such as ‘high-priority patterns always win’ or ‘base mode patterns always win’ are not correct for the aforementioned reason.

An unexpected outrunning of lower-priority patterns manifests itself in a lexer that seems to work sometimes and sometimes not. The behavior then often seems to depend on some padding characters such as white space. Quex can help to detect unexpected outruns by the command line options ‘–warning-on-outrun’ or ‘–woo’. When specified with a file containing the aforementioned example, quex will print the following notification:

mine.qx:5:warning: The pattern '[a-z]+' has lower priority but
mine.qx:4:warning: may outrun pattern 'key' as defined here.

The potential outrunning of high priority patterns by low priority patterns can be very subtile. Consider:

'alb|albertikus' => TKN_ALB;
'albert'         => TKN_ALBERT;

When a lexeme “alb” is not followed by ‘ert’, then a token ALB is produced. If the lexeme “albert” is not followed by ‘ikus’, then ALBERT matches and outruns ALB. Again, if “albertikus” appears in the input stream, then ALB matches and is no longer outrun. When complex patterns are applied, there is obviously a potential cause of confusion. The ‘–woo’ command line option mentioned above helps to understand such behavior.

Greedy matching can be tricked, by the usage of anti-patterns. For this, the IDENTIFIER should be expressed in terms of anti-patterns, i.e.:

key           => TKN_KEYWORD;
\A{key}[a-z]+ => TKN_IDENTIFIER;

If more than one KEYWORD are involved, e.g. ‘key’, ‘password’, and ‘secret’, the IDENTIFIER would have to look like:

\A{key|password|secret}[a-z]+ => TKN_IDENTIFIER;

In the very large majority of cases greedy matching is a convienient blessing. Imagine the problem with identifiers, i.e. any chain of alphabetic characters, and a keyword ‘for‘. If there was no greedy matching (longest match), then any lexeme starting with for, such as forest, could not properly be detected, since the first three letters would result in the for-keyword token.

Another pitfall is related to character codes that the lexical analyser uses to indicate the buffer-limit. The values for those codes are chosen to be out of the range for sound regular expressions parsing human written text (0x0 for buffer-limit). If it is intended to parse binary files, and this value is supposed to occur in patterns, then its code need to be changed. Section <<sec-formal-command-line-options>> mentions how to specify the buffer limit code on the command line.

One more pitfall to be mentioned is actually a plain user error. But, since it resulted once in a bug-report [1] it is mentioned at this place. The issue comes into play when patterns span regions, such as HTML-tags.

define {
    ...
    P_XML   <\/?[A-Za-z!][^>]*>
    ...
}

Now, when parsing some larger files or database a perturbing buffer overflow might occur. The reason for this might be a ‘<’ operator where it is not considered as a tag opener, as in the following text:

La funzione di probabilità è data da ove "k" e "r" sono interi non
negativi e "p" una probabilità (0<p<1) La funzione generatrice dei
momenti è: A confronto con le due ...

This occurence of the < in (0<p<1) opens the P_XML pattern and lets the analyzer search for the closing ‘>’. This might never occur. It is anyway inappropriate to consider this as an XML tag. Thus, patterns that span regions must be protected against unintentional region openers. One might use the ^, i.e. begin of line to restrict the possible matches, e.g.

define {
    ...
    P_XML   ^[ \t]*<\/?[A-Za-z!][^>]*>
    ...
}

might restrict the set of possible matches reasonably.

Footnotes

[1]See bug report 2272677. Thanks to Prof. G. Attardi for pinpointing this issue.