Pre- and Post-Contexts

Previous sections only discussed context-independent matching. To this point, a change of matching behavior can only be achieved by mode transitions triggered by a pattern match. However, a mode transition changes the complete matching behavior of a lexer, i.e. it changes the language. This section elaborates on a more concise way of context-sensitive matching based on what directly precedes or follows a pattern, namely pre- and post-contexts.

The pre-condition ‘begin of stream’ and ‘end of stream’ can be specified as follows [#f1].

<<BOS>> R

Defines the pre-context ‘begin-of-stream’. The pattern R only matches at the beginning of the input stream.

R <<EOS>>

Defines the post-context ‘end-of-stream’. The pattern R only matches at the end of the input stream. This post context must be considered with care in situations where there might be no explicit end-of-stream condition, such as byte loaders based on socket connections.

Notably, there must be white space between the <<BOS>> and the pattern as well as after a pattern which is followed by <<EOS>>. The following constructs express context conditions of begin and end of line.

^R

Matches a regular expression R, but only at the beginning of a line. This condition holds whenever the scan starts right after a newline character or at the beginning of the character stream (i.e. <<BOS>> is implied). It scans only for a single newline character 0x0A ‘\n’ backwards, independent on how the particular operating system encodes the newline.

R$

Matches a regular expression R, but only at the end of a line or at the end of the input stream (i.e. <<EOS>> is implied). Traditionally, a newline can be coded in two ways: the Unix-way with a plain 0x0A n or the DOS-way with the sequence 0x0D 0x0A rn. By default both are considered as post-context. The command line option --no-DOS allows one to waive the consideration of DOS newlines.

Note

Quex tolerates the shorthand $ at the end of an explicit post context. It is then translated into a newline. Thus, the pattern core/pc$ is equivalent to core/pc(n|rn). In that case, however, end of input stream is not implied.

General context-dependencies consider lexatom patterns before and after the focus pattern. The syntactic means to do that are slashes ‘/’. If an expression contains a single slash it separates the core pattern from its post context. If it contains two slashes, then it is subject to pre- and post-context. A pattern which has only a pre-context is specified by two slashes where the post-context is left empty. This is to be clarified in the item list below.

R/S

matches an R, but only if it is followed by an S. Upon match the input is set right after where R matched. S is the post-context of R. For example,:

[a-z]+/[ \t\n]

matches a sequence of lower-case letters, but only if it is followed by space, tabulator or newline. A matching lexeme consists only of letters, not of the whitespace. The next analysis step will start on the first whitespace character.

There is a special circumstance where post-contexts are problematic: the ‘dangerous trailing context’ [] problem [1]. The DFA Cut/Concatenate arithmetic introduced in sec-cut-contatenate-arithmetic enables a precise definition of this problem and a rational solution: the ‘philosophical cut’.

Q/R/

matches R from the current position, but only if it is preceded by a Q. Practically, this means the analyzer goes backwards in order to determine the condition. Q is the pre-context of R. For example,:

[ \t\n]/[a-z]+/

matches a sequence of lower-case letters, but only if it is preceeded by space, tabulator, or newline. A matching lexeme consists only of letters, not of the whitespace that preceeded.

Q/R/S

matches R from the current position, but only if the preceding matches a Q and the following matches an S. Q is the pre-context of R and S is its post-context. For example,:

[ \t\n]/[a-z]+/[ \t\n]

matches a sequence of letters, but only if before and after it there is whitespace. The matching lexeme consists only of letters.

The mechanics of a pre-context rely on walking the input stream backwards with an reversed state machine of the pre-context (figure fig_mechanics_pre_context). A match of this reversed machine indicates the fulfillment of the according pre-context. The mechanics of the post-context rely on walking further after the core pattern has matched along the post context (figure fig_mechanics_post_context). When an acceptance state of the post context is reached, a match is signalized, the input position is set back where the core pattern ended and the post context started.

../figures/mechanics-pre-context.png

Fig. 24 Mechanics of pre-context matching.

../figures/mechanics-post-context.png

Fig. 25 Mechanics of post-context matching.

Pre- and post contexts are the utmost syntactical unit of a regular expression. This means that they cannot be logically or-ed. The following specification is dysfunctional.:

(A/B)|(C/D) => QUEX_TKN_SOME();   // WRONG!

However, the functionality of it can be achieved by splitting the or-ed condition and associating it with the same action as follows.:

A/B  => QUEX_TKN_SOME();          // OK!
C/D  => QUEX_TKN_SOME();          // OK!

Footnotes