_sec:syntax-re-elementary-regular-expressions:
Elementary Regular Expressions¶
An elementary regular expression describes an elementary DFA. An elementary DFA consisting of an initial state, an acceptance state and a condition for which the state transits from the former to the latter. Defining a transition condition as lexatom equals value results in a DFA such as on the left side of figure Fig. 13. A condition of the form lexatom is element of set permits one to combine the transitions of the right DFA of Fig. 13 into a single transition [#f1b]. The first subsection introduces syntactic means to express conditions of lexatom equals value. The second part shows how to express conditions of lexatoms is element of set.
Regular expressions describe DFAs in terms of characters and not in terms of lexatoms. The last subsection discusses how elemetary regular expressions may result in non-trivial DFAs when a character cannot be mapped one-to-one into a lexatom.
Single Character Transition DFAs¶
The syntax elements to specify a DFA consisting of a single transition triggered by a single character are the following.
- x (a character in its UTF8 encoding)
matches the character ‘x’, i.e. the code point that it represents. Characters which are syntactic operators–such as
.
,[
, and]
cannot be specified that way.The input encoding of lexical analyzer description files is UTF8. Something referred here as a character must appear in the raw file as the according UTF8 byte sequence.
- \\a \\b \\f \\n \\r \\t \\v \\\\ \\"
ANSI-C escape characters as show in table syntax-re-ansi-escape-characters.
Escape Sequence |
Value (hex) |
Meaning |
---|---|---|
|
0x07 |
Bell |
|
0x08 |
Backspace |
|
0x0C |
Formfeed Page Break |
|
0x0A |
Line Feed |
|
0x0D |
Carriage Return |
|
0x09 |
Horizontal Tab |
|
0x0B |
Vertical Tab |
- \\+ \\* \\? \\/ \\: \\| \\$ \\^ \\- \\. \\[ \\] \\( \\) \\{ \\}
The characters
+
,*
,?
,/
,:
,|
,$
,^
,-
,.
,[
,]
,(
,)
,{
, and}
act as syntactic operators. To represent their original meaning, they must be preceeded by a backslash. For example\\+
represents a+
and does not trigger a syntax operation.
Note
A space character (code point 32) can only be specified in quote or range
boundaries (e.g. as " "
or [ ]
). Outside these boundaries it
is the delimiter to terminates the regular expression.
The backslash does not suppress newline. A pattern must be completely
specified in a single line. Large expression can be avoided by a hierarchic
composition of macros defined in the define
section
Definitions.
- \\U11A0FF
A
\U
followed by six or less hexadecimal digits specify the according numeric value. The expression above specifies 0x11A0FF. A maximum of six hexadecimal digits can be specified. Hexadecimal numbers with less than six digits must either be followed by a non-hex-digit or specified with leading zeroes (i.e. use \U00071F, for hexadecimal 71F). Hexadecimal numbers may can contain be uppercase or lowercase letters from A to F.
- \\X7A27
A
\X
followed by four or less hexadecimal digits specifies the according numeric value. The number format is analogous to U.
- \\x27
A
\x
followed by two or less hexadecimal digits specifies the according numeric value. The number format is analogous to U.
- \\123
the character with octal value 123, a maximum of three digits (from 0 to 7) can follow the backslash. The delimiting rules are analogous to those of U.
- \\0
a NULL character (ASCII/Unicode code point 0). This is to be used with caution. The NULL character is also used a buffer delimiter. Section sec:formal-command-line-options shows how to specify a different value for the buffer limit code.
- \\N{ unicode character name }
the code point of the character with the given Unicode character name. For possible settings of this character see [] on the
Name
property. For example,:\N{MUSICAL SYMBOL G CLEF}
represents the value 0x1D11E representing the musical symbol 𝄞.
Character Set Transition DFAs¶
To this point, only those elementary DFAs have been describe with a transition condition of the form lexatom equals value. This section describes regular expressions matching any lexatom of a set. That is, the according elementary DFA has transition conditions of the form lexatom is element of set. Character sets can be specified in the following ways.
Syntactic operator, namely
.
and\Any
.Definitions bracketed in
[
and]
.Queries such as in
\G{...}
.Operations bracketed in
[:
and:]
.
The .
defines a character set consisting of all characters except newline
(code point 0x0A). This operator is rooted in tradition of lex/flex. Since its
appearance does not suggest what it actually does, it is probably better
avoided [#f0]. The later discussed intersection
operation allows for an
explicit definition of this operation. \Any
matches the complete set of
lexatoms (not lexemes).
Sets can be described in terms of a list of their elements inside a range in square brackets.
- [xyz]
a set of characters consisting of the listed characters. In this example, the a lexatom euqal to
x
,y
, orz
triggers a match. Any elementary single character expression may be used inside the brackets.
- [j-o]
matches a character set given by a range between the code points of two characters left and right to the
-
sign. The character set is equivalent to the list of (integer) code points between the Unicode code points of the first and the last character–both borders included. Sorting orders are therefore irrelevant.j-o
is equivalent tojklmno
.
Lists of characters and ranges may occur together and multiple times inside one
square bracket pair. For example, the expression [A-Za-z0-9_]
matches all
lower and uppercase letters, all numbers and the underscore _
.
- [^ ...]
matches a negated character set. That is, it consists of all characters not mentioned following
^
. For example[^ \t\n]
matches anything but space, tabulator, and newline.
Similar to .
the ^
operator is rooted in the lex/flex tradition. The
later explained complement
operation is a more expressive means to achieve
the same.
Character sets can also be defined as the result of queries. During the
discussion of single character DFAs, the \N{...}
has been introduced
which provided the single character of a given name from the Unicode
database. Further, properties and encodings may be queried by the following
means.
- \\P{ Unicode Property Expression }
the set of characters for which the Unicode Property Expression holds.
Property expressions have the form property-name=value
. For example,:
\P{Name=MUSICAL SYMBOL G CLEF}
delivers a set of characters containing the single element 𝄞. The operation
\N{}
, as mentioned before, is actually a shorthand for quering the property
Name
,
- \\G{ X }
the set of characters withing a specific Unicode General Category. This is a shortcut for
\P{General_Category=X}
.
Possible settings for \P{}
and \G{}
are provided in section
sec-formal-unicode-properties.
- \\E{ Encoding Name }
the subset of Unicode characters which is covered by the given encoding.
Using this is particularly helpful to filter characters from a specific
encoding. \E{iso8859_9}
, for example consists of all characters provided
by ISO8859-9 (8bit encoding for Turkish). None of the query expressions
can be used in quotes, since they result in character sets and not in characters.
Queries are a special kind of character set operation. General character set
operations are bracketted by [:
and :]
.
- [: OP :]
matches a set of characters that result from a character set operation
OP
.
Character set operations query, select, filter or combine character sets.
The character set expression [:alpha:]
, for example, matches
all Latin letters, i.e. all Unicode code points from from a (0x61) to z
(0x7a) and A (0x41) to Z (0x5A). It is part of the set of POSIX bracket
expressions Burns2001real which are listed in
tab:syntax-re-bracket-expressions
.
Expression |
Meaning |
Related Regular Expression |
---|---|---|
|
Alphanumeric characters |
|
|
Alphabetic characters |
|
|
Space and tab |
|
|
Control characters |
|
|
Digits |
|
|
Visible characters |
|
|
Lowercase letters |
|
|
Visible characters and spaces |
|
|
Punctuation characters |
|
|
White space characters |
|
|
Uppercase letters |
|
|
Hexadecimal digits |
|
POSIX bracket expressions are only concerned with the ASCII subset of Unicode
code points. This makes them impractical for a usage beyond the context of
English or Latin. The query expressions \P{}
and \G{}
, in contrast,
provide solutions fit for an international context [3].
Bracket operations can also contain set algebraic expressions
[] in terms of the operations union
, intersection
,
and complement
, as well as the derived operation difference
.
Together with \Empty
as the empty set and \Any
as the universe set
they constitute a set agebraic structure maintaining all fundamental laws
of algebra: Commutativity, Associativity, Distributivity. For intersections
and unions with the empty set and the universe set the rules of Identity
and Complement apply. The set algebraic syntax is shown table
tab:syntax-re-character-set-operations.
Syntax |
Example |
---|---|
|
|
|
|
|
|
|
|
A union
expression generates the union of all sets mentioned inside the
brackets. An intersection
expression results in the intersection of all
sets mentioned. The complement
builds the complementary set of the union of
all mentioned sets. That is, the result is a set of characters which do not
occur in any the given set. The difference between one set and another can be
computed via the difference
function. The difference
determines the
difference operation results in all elements of the first argument minus
the elements of the following arguments. The arguments of those functions
must all be sets, not DFAs. The result of those operations are, of course,
character sets.
At first glance, it seems unnatural to allow argument lists of arbitrary length
as arguments to complement
and difference
. This choice, though, has
been made for the sake of convenience, to spare the user lengthy nested
sequences of two-argument calls to handle more than two sets.
Note
The difference
and intersection
operation can be used conveniently
for filtering. For example
[: difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter}) :]
results in the set of Greek characters except the digits and except the
lowercase letters. To allow only the numbers from the Arabic code block
intersection
can be used as follows:
[: intersection(\P{Block=Arabic}, \G{Nd}) :]
The result of character set expressions is not always easy to foresee. Using Quex’s command line functionality to display the results of regular expressions may provide more insight into the result of an operation. For example, the following command line displays what characters remain if the numbers and lowercase letters are taken out of the set of Greek letters.
> quex --set-by-expression 'difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter})'
The command line query feature is discussed in chapter sec:command-line-queries.
Encoding¶
Elementary regular expressions describe trivial DFAs consisting of an initial state and an acceptance state where a character triggers the transition. Actual generated DFAs, though, trigger on lexatoms. Characters and lexatoms are equivalent as long as the data in the buffer is coded in Unicode (or ASCII). In other words, it holds as long as the input stream is coded in Unicode, or if a lexatom loader is applied that converts the input stream into Unicode.
Quex, however, can generate analyzers that run directly, without conversion, on
a specific encoding. For example, applying the command line argument
--encoding utf8
lets Quex generate a DFA that triggers on byte-wide
lexatoms of UTF8 code units (Option --encoding-list
delivers all available
encodings). The DFA in Fig. 14
shows a DFA corresponding to \G{Zs}
, i.e. the set of whitespace characters without
newline and tabulator. The according character-based DFA consist still of two
states. Applying the encoding UTF8, the resulting DFA contains many more states
and paths of different lengths.
Note
The diagram can be reproduced as a PDF file plot.pdf
applying:
> quex -i plot.qx --encoding utf8
> dot A.dot -Tpdf > plot.pdf
with the file plot.qx
given as
define {
A \G{Zs}
\plot (fh) A.dot {A}
}
So, an elementary regular expression may be translated into a DFA of many states and transitions when the encoding implies that characters are not equivalent to lexatoms. They are still elementary, because, their structure is determined by a single character, or a single character set. DFA structures which are built by means of modification or composition are discussed in the following sections.
Footnotes