Pre- and Post-Contexts¶
Previous sections only discussed context-independent matching. To this point, a change of matching behavior can only be achieved by mode transitions triggered by a pattern match. However, a mode transition changes the complete matching behavior of a lexer, i.e. it changes the language. This section elaborates on a more concise way of context-sensitive matching based on what directly precedes or follows a pattern, namely pre- and post-contexts.
The pre-condition ‘begin of stream’ and ‘end of stream’ can be specified as follows [#f1].
- <<BOS>> R
Defines the pre-context ‘begin-of-stream’. The pattern
R
only matches at the beginning of the input stream.
- R <<EOS>>
Defines the post-context ‘end-of-stream’. The pattern
R
only matches at the end of the input stream. This post context must be considered with care in situations where there might be no explicit end-of-stream condition, such as byte loaders based on socket connections.
Notably, there must be white space between the <<BOS>>
and the pattern as
well as after a pattern which is followed by <<EOS>>
. The following
constructs express context conditions of begin and end of line.
- ^R
Matches a regular expression
R
, but only at the beginning of a line. This condition holds whenever the scan starts right after a newline character or at the beginning of the character stream (i.e.<<BOS>>
is implied). It scans only for a single newline character 0x0A ‘\n’ backwards, independent on how the particular operating system encodes the newline.
- R$
Matches a regular expression R, but only at the end of a line or at the end of the input stream (i.e. <<EOS>> is implied). Traditionally, a newline can be coded in two ways: the Unix-way with a plain 0x0A n or the DOS-way with the sequence 0x0D 0x0A rn. By default both are considered as post-context. The command line option
--no-DOS
allows one to waive the consideration of DOS newlines.
Note
Quex tolerates the shorthand $ at the end of an explicit post context. It is then translated into a newline. Thus, the pattern core/pc$ is equivalent to core/pc(n|rn). In that case, however, end of input stream is not implied.
General context-dependencies consider lexatom patterns before and after the focus pattern. The syntactic means to do that are slashes ‘/’. If an expression contains a single slash it separates the core pattern from its post context. If it contains two slashes, then it is subject to pre- and post-context. A pattern which has only a pre-context is specified by two slashes where the post-context is left empty. This is to be clarified in the item list below.
- R/S
matches an
R
, but only if it is followed by anS
. Upon match the input is set right after whereR
matched.S
is the post-context ofR
. For example,:[a-z]+/[ \t\n]
matches a sequence of lower-case letters, but only if it is followed by space, tabulator or newline. A matching lexeme consists only of letters, not of the whitespace. The next analysis step will start on the first whitespace character.
There is a special circumstance where post-contexts are problematic: the ‘dangerous trailing context’ [] problem [1]. The DFA Cut/Concatenate arithmetic introduced in sec-cut-contatenate-arithmetic enables a precise definition of this problem and a rational solution: the ‘philosophical cut’.
- Q/R/
matches
R
from the current position, but only if it is preceded by aQ
. Practically, this means the analyzer goes backwards in order to determine the condition.Q
is the pre-context ofR
. For example,:[ \t\n]/[a-z]+/
matches a sequence of lower-case letters, but only if it is preceeded by space, tabulator, or newline. A matching lexeme consists only of letters, not of the whitespace that preceeded.
- Q/R/S
matches
R
from the current position, but only if the preceding matches aQ
and the following matches anS
.Q
is the pre-context ofR
andS
is its post-context. For example,:[ \t\n]/[a-z]+/[ \t\n]
matches a sequence of letters, but only if before and after it there is whitespace. The matching lexeme consists only of letters.
The mechanics of a pre-context rely on walking the input stream backwards with an reversed state machine of the pre-context (figure fig_mechanics_pre_context). A match of this reversed machine indicates the fulfillment of the according pre-context. The mechanics of the post-context rely on walking further after the core pattern has matched along the post context (figure fig_mechanics_post_context). When an acceptance state of the post context is reached, a match is signalized, the input position is set back where the core pattern ended and the post context started.
Pre- and post contexts are the utmost syntactical unit of a regular expression. This means that they cannot be logically or-ed. The following specification is dysfunctional.:
(A/B)|(C/D) => QUEX_TKN_SOME(); // WRONG!
However, the functionality of it can be achieved by splitting the or-ed condition and associating it with the same action as follows.:
A/B => QUEX_TKN_SOME(); // OK!
C/D => QUEX_TKN_SOME(); // OK!
Footnotes