Pre- and Post- Conditions

Additionally to the specification of the pattern to be matched quex allows to define conditions on the boundary of the pattern. This happens through pre- and post-conditions. First, the trivial pre- and post-conditions for begin of line and end of line are discussed. Then it is shown how to specify whole regular expressions to express conditions on the surroundings of the pattern to be matched. The traditional characters to condition begin and end of line are:

^R

a regular expression R, but only at the beginning of a line. This condition holds whenever the scan starts at the beginning of the character stream or right after a newline character. This shortcut scans only for a single newline character (hexadecimal 0A) backwards, independent on how the particular operating system codes the newline. In this case, there is no harm coming from different conventions of newline.

R$

a regular expression R, but only at the end of a line and _not_ at the end of the file. Note, that the meaning of this shortcut can be adapted according to the target operating system. Some operating systems, such as DOS and Windows, code a newline as a sequence ‘rn’ (hexadecimal 0D, 0A), i.e. as two characters. If you want to use this feature on those systems, you need to specify the --DOS option on the command line (or in your makefile). Otherwise, $ will scan only for the newline character (hexadecimal 0A).

Note, that for the trivial end-of-line post condition the newline coding convention is essential. If newline is coded as 0D, 0A then the first 0D would discard a pattern that was supposed to be followed by 0A only.

Note, that if the $ shall trigger at the end of the file, it might be advantageous to add a newline at the end of the file by default.

For more sophisticated case ‘real’ regular expressions can be defined to handle pre- and post-conditions. Note, that pre- and post-conditions can only appear at the front and rear of the core pattern. Let R be the core regular expression, Q the regular expression of the pre-condition, and S the regular expression for the post-condition.

R/S

matches an R, but only if it is followed by an S. If the pattern matches the input is set to the place where R. The part that is matched by S is available for the next pattern to be matched. R is post-conditioned. Note, that the case where the end of R matches the beginning of S cannot be treated by Version 0.9.0 [1].

Q/R/

matches R from the current position, but only if it is preceded by a Q. Practically, this means that quex goes backwards in order to determine if the pre-condition Q has matched and then forward to see if R has matched. R is pre-conditioned. Note, with pre-conditions there is no trailing context problem as with post-conditions above.

Q/R/S

matches R from the current position, but only if the preceding stream matches a Q and the following stream matches an S. R is pre- and post-conditioned.

There is an important note to make about post contexts. A post context should never contain an empty path. Because, this would mean that if there is nothing specific following it is acceptable. Thus the remaining definition of the post context is redundant. However, an issue comes with End of File, which is not a character in the stream. Then a post context is required of the type:

X is either followed by nothing, but never followed by Y.

To achieve this, the problem needs to be translated into to pattern matches:

X       => QUEX_TKN_SOME();
X/\A{Y} => QUEX_TKN_SOME();

That is both matches are associated with the same actions. The first matches the plain pattern X. The second matches an X which is followed by something that cannot match Y, i.e. the anti-pattern of Y.

Also, post context cannot be or-ed. Consider the match rule:

A pattern X has to end with Y or is followed by Z.

To achieve this, again two separate rules have to be written that trigger the same action:

XY => QUEX_TKN_SOME(); X/Z => QUEX_TKN_SOME();

The <<EOF>> is not available as post context and $ does not catch an end of file post context either.

Footnotes

[1]The reason for this lies in the nature of state machines. Flex has the exact same problem. To avoid this some type of ‘step back from the end of the post-condition’ must be implemented.