Elementary Compositions¶
This section explains the syntax of the three elementary compositions, as the
are concatenation, union, and arbitrary repetition. The rationale for
these compositions being elementary is given in
sec:math-elementary-compositions
. For the following, let R and S
be two DFAs, respectively the regular expression by which they are described.
Concatenation¶
The predominant way of composing regular expressions is concatenation. Consequently, concatenations are expressed in the most intuitive way, namely as a sequence.
- RS
Concatenation of R and S.
Two regular expressions are concatenated by writing them in sequence. That is
H
matches H
and i
matches i
so that the concatenated regular
expression Hi
matches Hi
. Concatenation is the bases for matching
words. In order to avoid interferences with syntactic operators, concatenated
characters can be listed in double quotes. The following shows three example of
concatenated characters.:
\r\n // matching DOS newline
define // matching keyword 'define'
"is in" // matches the sequence 'is in'
"key\tvalue" // matching keyword 'key', tabulator, keyword 'value'
Note
Syntax operators loose their meaning inside quoted string. Instead
they represent their Unicode code point. For example, a [
stands
for code point 91 (hex. 5B), matches against a [
. They do not
open or close character set expressions.
The ANSI-C escape characters are available via \a
, \b
, \f
,
\n
, \r
, \t
, \v
. Further, there is \\
which represents a
‘real’ backslash and \"
which represents a double quote.
The Unicode property \N{...}
is also available since it results in a
single character. However, other operators such as \P{....}
result
in character sets. They cannot be used inside quoted strings.
- Concatenation
The concatenation of R and S matches any lexeme that can be described as a concatenation of a lexeme that matches R and a lexeme that matches S.
Concatenation may be applied to any kind of composition. The example in
Fig. 16 shows the concatenation of car
and pet
. The
resulting DFA, therefore matches carpet
.
Union¶
Unions express alternative matches. A union is specified by means of a |
(i.e. the logical or-operator in many programming languages)
between two or more regular expressions.
- R|S
Union of
R
andS
.
- Union
The union of two DFAs matches the superset of lexemes matched by each of them.
The resulting state machine is actually an NFA, that is it may be in more than
one state at the same time. To obtain a DFA this operation must be followed by
a powerset construction Rabin:1959:FAD. re-union
shows
the steps of construction of a DFA that matches the union of car
and
cat
.
Repetition¶
The repeated regular expression R match as long as the input can be described as a concatenation of lexemes of R. The arbitrary repetition matches and arbitrary number of concatenations include nothing at all.
- Arbitrary Repetition
The arbitrary repetition of a DFA (Kleene Closure) matches an arbitrary number of concatenation of lexemes matched by the original DFA.
As with the union operation, the general result is an NFA and is followed by a
powerset constuction to obtain a DFA. re-kleene
how a DFA matching
abc
is transformed into a DFA matching the empty lexeme, abc
,
abcabc
, abcabcabc
etc. Arbitrary repetition is expressed with the
*
following the expression to be repeated.
- R*
Arbitrary repetition of R.
Grouping¶
Precedence can be overwritten by explicit grouping of operations using
(
and )
brackets.
- (R)
Grouping ensuring that what is inside the brackets is evaluated before what surrounds it.