Unary Graph Operations¶
Elementary compositions only operated by plugging ε-transitions in the original DFAs and did not process their graphs. This section is dedicated to operations on a single DFAs graph.
Reversible Operations¶
A modification is reversible, if for all possible operands A
it is
possible to find a procedure that re-generates the original A
from its
modification. There are two reversible graph operations, namely reversion
and complement.
- \\R{ RE }
Reversion: Matches any lexeme that can be described as a reversed lexeme matched by
RE
.For example,
\R{World}
is equivalent todlroW
. This feature is useful for definitions of patterns of right-to-left writing systems such as Arabic, Binti and Hebrew. While in left-to-right mode, the words can be written in their original sequence. The reversion operator reorders the characters so that they match in reverse direction. Twofold reversion is equivalent to identity, for any DFAP
it holds:\R{\R{P}} == P
- \\Not{ RE }
Complement: Matches any lexeme not matched by
RE
. The complement of the complement is equivalent to identity, namely for any DFAP
, it holds:\Not{\Not{P}} == P
In both cases, the modifications to the DFAs are non-trivial.
fig:re-modifications-ieee-float
shows a DFA that matches IEEE
floating point numbers. The reversion operation consists:
#1 Swapping the direction of each transition. #2 Connecting all acceptance states via ε-transitions. #3 Transforming the ε-closure of acceptance states into an single initial state. #4 Transforming the initial state into an acceptance state. #5 Power set construction to obtain a DFA.
The result of the reverted IEEE floating point number DFA is displayed in
fig:re-modifications-ieee-float-reverse
. The complement operation is,
actually, implemented as difference between \Universe
and the given
DFA. For that, the intersection operation is required, which is defined
in a later section TODO
. The resulting DFA as seen in
fig:re-modifications-ieee-float-complement
, also, is clearly not
trivially derived from its source.
Irreversible Operations¶
- \\AccFlush{R}
Makes all states in
R
acceptance states. This operation is irreversible, because once all states are acceptance states, all information is lost about what they were before.
- \\AccInv{R}
Makes all non-acceptance states in
R
acceptance states and all acceptance states non-acceptance states. This operation is irreversible because an acceptance state transformed into a non-acceptance state may leave a branch is eliminated, because there is no acceptance at the end. Once a branch is eliminated, all information is lost about what state there was before.
- \\Stem{R}
Generates a DFA with a graph consisting of paths which stop at the first reached acceptance state of
R
. This operation cuts any state beyond the first acceptance, and therefore, is irreversible.
- \\Crown{R}
Generates a DFA with a graph consisting of paths starting from first acceptance states of
R
. This operation cuts any state before the first acceptance, and therefore, is irreversible.
- \\InfDev{R}
Matches the infimum deviation of
R
. This expression accepts when the first character appears that wards offR
from matching. An infimum deviation to not match lexemes which start with lexemes matched byR
. As shown in figure Fig. 22 the infimum deviation of an IEEE floating point number only contains paths right before the first acceptance state. Any sequence beyond would match an IEEE floating point number, even if the tail does not fit completely. As can be seen in Fig. 19, a.
or+
/-
followed by a.
does not constitute a floating point number. However, with the appearance of the first digit[0-9]
, a sequence matches a floating point number. Then, the sequence cannot be part of the infimum deviation.
- \\First{R}
Generates a DFA that consists solely of the first transitions of
R
. For example,\First{123|abc}
is equivalent to[1a]
.
- \\NotFirst{R}
Generates a DFA that consists solely of the complement of the first transitions of
R
. For example,\First{123|abc}
is equivalent to[^1a]
.
Sanity¶
Regular expressions may, actually, describe dysfunctional patterns. A dysfunctional pattern has at least one of the following properties.
It matches the zero-length lexeme. This would make the lexer accept without consuming any further lexatom from the input stream, if no other pattern matches. The lexer would stall in the loop of accepting nothing and not progressing.
There exists a state from where it matches on arbitrary repetitions of any lexatom. If this state is reached the complete input stream would be consumed. Since the matching pattern is longer than any other, all other patterns were outrun and dominated.
The singular DFAs \Nothing
and \Universe
from
sec:re-singular
are the prime example for dysfunctional patterns.
\Nothing
directly implements the first condition. \Universe
implements
the second condition. Notably, \Empty
is not dysfunctional. It does not
accept the empty lexeme and does not consume an arbitrary number of arbitrary
lexatoms. It does what it is expected to do: it never ever matches.
Dysfunctionality can be healed by the \Sanitize
command.
- \\Sanitize{P}
Sanitizes a pattern with regards to two issues. It removes a possible acceptance of the zero-length lexeme. It removes any possible acceptance of of arbitrary repetitions of arbirtrary lexatoms.
The complement of any functional DFA is by definition dysfunctional. To acquire a complement for practical use, it must be sanitized. For this, there exists the following shorthand:
- \\A{ R }
Matches anything not matching
R
but not the zero-length lexeme and not an arbitrary number of repetitions at the end.
Fig. 23 shows the sanitized
complement of an IEEE floating point number. It differs from the complement,
as shown in Fig. 21 only, in
that the initial state does not accept and state 8 does not iterate
on \Any
.
Note
When sanitization is applied, instead of a propper definition, it is
advisable to review the resulting DFA. This can be done within the
define
section with the \plot
command. For example,:
define {
...
RAW_IDENTIFIER {LETTER}{MORE}*
IDENTIFIER \Sanitize{RAW_IDENTIFIER}
\plot (fh) raw.dot {RAW_IDENTIFIER}
\plot (fh) sanitized.dot {IDENTIFIER}
}
creates state machine graphs for the raw and the sanitized implementation
of IDENTIFIER
in the files raw.dot
and sanitized.dot
. With the
graphviz package, the commands:
> dot -Tpdf raw.dot -o raw.pdf
> dot -Tpdf sanitized.dot -o sanitized.pdf