Elementary Compositions

This section explains the syntax of the three elementary compositions, as the are concatenation, union, and arbitrary repetition. The rationale for these compositions being elementary is given in sec:math-elementary-compositions. For the following, let R and S be two DFAs, respectively the regular expression by which they are described.

Concatenation

The predominant way of composing regular expressions is concatenation. Consequently, concatenations are expressed in the most intuitive way, namely as a sequence.

RS

Concatenation of R and S.

Two regular expressions are concatenated by writing them in sequence. That is H matches H and i matches i so that the concatenated regular expression Hi matches Hi. Concatenation is the bases for matching words. In order to avoid interferences with syntactic operators, concatenated characters can be listed in double quotes. The following shows three example of concatenated characters.:

\r\n             // matching DOS newline
define           // matching keyword 'define'
"is in"          // matches the sequence 'is in'
"key\tvalue"     // matching keyword 'key', tabulator, keyword 'value'

Note

Syntax operators loose their meaning inside quoted string. Instead they represent their Unicode code point. For example, a [ stands for code point 91 (hex. 5B), matches against a [. They do not open or close character set expressions.

The ANSI-C escape characters are available via \a, \b, \f, \n, \r, \t, \v. Further, there is \\ which represents a ‘real’ backslash and \" which represents a double quote.

The Unicode property \N{...} is also available since it results in a single character. However, other operators such as \P{....} result in character sets. They cannot be used inside quoted strings.

Concatenation

The concatenation of R and S matches any lexeme that can be described as a concatenation of a lexeme that matches R and a lexeme that matches S.

Concatenation may be applied to any kind of composition. The example in Fig. 16 shows the concatenation of car and pet. The resulting DFA, therefore matches carpet.

../_images/state-machine-concatenation.svg

Fig. 16 Union composition on the DFAs for car and cat.

Union

Unions express alternative matches. A union is specified by means of a | (i.e. the logical or-operator in many programming languages) between two or more regular expressions.

R|S

Union of R and S.

Union

The union of two DFAs matches the superset of lexemes matched by each of them.

The resulting state machine is actually an NFA, that is it may be in more than one state at the same time. To obtain a DFA this operation must be followed by a powerset construction Rabin:1959:FAD. re-union shows the steps of construction of a DFA that matches the union of car and cat.

../_images/state-machine-union.svg

Fig. 17 Union composition on the DFAs for car and cat.

Repetition

The repeated regular expression R match as long as the input can be described as a concatenation of lexemes of R. The arbitrary repetition matches and arbitrary number of concatenations include nothing at all.

Arbitrary Repetition

The arbitrary repetition of a DFA (Kleene Closure) matches an arbitrary number of concatenation of lexemes matched by the original DFA.

../_images/state-machine-kleene.svg

Fig. 18 Arbitrary repetition of a DFA matching abc.

As with the union operation, the general result is an NFA and is followed by a powerset constuction to obtain a DFA. re-kleene how a DFA matching abc is transformed into a DFA matching the empty lexeme, abc, abcabc, abcabcabc etc. Arbitrary repetition is expressed with the * following the expression to be repeated.

R*

Arbitrary repetition of R.

Grouping

Precedence can be overwritten by explicit grouping of operations using ( and ) brackets.

(R)

Grouping ensuring that what is inside the brackets is evaluated before what surrounds it.