_sec:syntax-re-elementary-regular-expressions:

Elementary Regular Expressions¶

An elementary regular expression describes an elementary DFA. An elementary DFA consisting of an initial state, an acceptance state and a condition for which the state transits from the former to the latter. Defining a transition condition as lexatom equals value results in a DFA such as on the left side of figure Fig. 13. A condition of the form lexatom is element of set permits one to combine the transitions of the right DFA of Fig. 13 into a single transition [#f1b]. The first subsection introduces syntactic means to express conditions of lexatom equals value. The second part shows how to express conditions of lexatoms is element of set.

../_images/state-machine-two-states.svg — Fig. 13 Elementary DFAs consisting of an initial state and an acceptance states.¶

Regular expressions describe DFAs in terms of characters and not in terms of lexatoms. The last subsection discusses how elemetary regular expressions may result in non-trivial DFAs when a character cannot be mapped one-to-one into a lexatom.

Single Character Transition DFAs¶

The syntax elements to specify a DFA consisting of a single transition triggered by a single character are the following.

x (a character in its UTF8 encoding)

matches the character ‘x’, i.e. the code point that it represents. Characters which are syntactic operators–such as ., [, and ] cannot be specified that way.

The input encoding of lexical analyzer description files is UTF8. Something referred here as a character must appear in the raw file as the according UTF8 byte sequence.

\\a \\b \\f \\n \\r \\t \\v \\\\ \\": ANSI-C escape characters as show in table syntax-re-ansi-escape-characters.

Table 6 ANSI-C Escape Characters¶
Escape Sequence	Value (hex)	Meaning
`\a`	0x07	Bell
`\b`	0x08	Backspace
`\f`	0x0C	Formfeed Page Break
`\n`	0x0A	Line Feed
`\r`	0x0D	Carriage Return
`\t`	0x09	Horizontal Tab
`\v`	0x0B	Vertical Tab

\\+ \\* \\? \\/ \\: \\| \\$ \\^ \\- \\. \\[ \\] \$ \$ \\{ \\}: The characters +, *, ?, /, :, |, $, ^, -, ., [, ], (, ), {, and } act as syntactic operators. To represent their original meaning, they must be preceeded by a backslash. For example \\+ represents a + and does not trigger a syntax operation.

Note

A space character (code point 32) can only be specified in quote or range boundaries (e.g. as " " or [ ]). Outside these boundaries it is the delimiter to terminates the regular expression.

The backslash does not suppress newline. A pattern must be completely specified in a single line. Large expression can be avoided by a hierarchic composition of macros defined in the define section Definitions.

\\U11A0FF: A \U followed by six or less hexadecimal digits specify the according numeric value. The expression above specifies 0x11A0FF. A maximum of six hexadecimal digits can be specified. Hexadecimal numbers with less than six digits must either be followed by a non-hex-digit or specified with leading zeroes (i.e. use \U00071F, for hexadecimal 71F). Hexadecimal numbers may can contain be uppercase or lowercase letters from A to F.

\\X7A27: A \X followed by four or less hexadecimal digits specifies the according numeric value. The number format is analogous to U.

\\x27: A \x followed by two or less hexadecimal digits specifies the according numeric value. The number format is analogous to U.

\\123: the character with octal value 123, a maximum of three digits (from 0 to 7) can follow the backslash. The delimiting rules are analogous to those of U.

\\0: a NULL character (ASCII/Unicode code point 0). This is to be used with caution. The NULL character is also used a buffer delimiter. Section sec:formal-command-line-options shows how to specify a different value for the buffer limit code.

\\N{ unicode character name }

the code point of the character with the given Unicode character name. For possible settings of this character see [] on the Name property. For example,:

\N{MUSICAL SYMBOL G CLEF}

represents the value 0x1D11E representing the musical symbol 𝄞.

Character Set Transition DFAs¶

To this point, only those elementary DFAs have been describe with a transition condition of the form lexatom equals value. This section describes regular expressions matching any lexatom of a set. That is, the according elementary DFA has transition conditions of the form lexatom is element of set. Character sets can be specified in the following ways.

Syntactic operator, namely . and \Any.

Definitions bracketed in [ and ].

Queries such as in \G{...}.

Operations bracketed in [: and :].

The . defines a character set consisting of all characters except newline (code point 0x0A). This operator is rooted in tradition of lex/flex. Since its appearance does not suggest what it actually does, it is probably better avoided [#f0]. The later discussed intersection operation allows for an explicit definition of this operation. \Any matches the complete set of lexatoms (not lexemes).

Sets can be described in terms of a list of their elements inside a range in square brackets.

[xyz]: a set of characters consisting of the listed characters. In this example, the a lexatom euqal to x, y, or z triggers a match. Any elementary single character expression may be used inside the brackets.

[j-o]: matches a character set given by a range between the code points of two characters left and right to the - sign. The character set is equivalent to the list of (integer) code points between the Unicode code points of the first and the last character–both borders included. Sorting orders are therefore irrelevant. j-o is equivalent to jklmno.

Lists of characters and ranges may occur together and multiple times inside one square bracket pair. For example, the expression [A-Za-z0-9_] matches all lower and uppercase letters, all numbers and the underscore _.

[^ ...]: matches a negated character set. That is, it consists of all characters not mentioned following ^. For example [^ \t\n] matches anything but space, tabulator, and newline.

Similar to . the ^ operator is rooted in the lex/flex tradition. The later explained complement operation is a more expressive means to achieve the same.

Character sets can also be defined as the result of queries. During the discussion of single character DFAs, the \N{...} has been introduced which provided the single character of a given name from the Unicode database. Further, properties and encodings may be queried by the following means.

\\P{ Unicode Property Expression }: the set of characters for which the Unicode Property Expression holds.

Property expressions have the form property-name=value. For example,:

\P{Name=MUSICAL SYMBOL G CLEF}

delivers a set of characters containing the single element 𝄞. The operation \N{}, as mentioned before, is actually a shorthand for quering the property Name,

\\G{ X }: the set of characters withing a specific Unicode General Category. This is a shortcut for \P{General_Category=X}.

Possible settings for \P{} and \G{} are provided in section sec-formal-unicode-properties.

\\E{ Encoding Name }: the subset of Unicode characters which is covered by the given encoding.

Using this is particularly helpful to filter characters from a specific encoding. \E{iso8859_9}, for example consists of all characters provided by ISO8859-9 (8bit encoding for Turkish). None of the query expressions can be used in quotes, since they result in character sets and not in characters.

Queries are a special kind of character set operation. General character set operations are bracketted by [: and :].

[: OP :]: matches a set of characters that result from a character set operation OP.

Character set operations query, select, filter or combine character sets. The character set expression [:alpha:], for example, matches all Latin letters, i.e. all Unicode code points from from a (0x61) to z (0x7a) and A (0x41) to Z (0x5A). It is part of the set of POSIX bracket expressions Burns2001real which are listed in tab:syntax-re-bracket-expressions.

Expression	Meaning	Related Regular Expression
`[:alnum:]`	Alphanumeric characters	`[A-Za-z0-9]`
`[:alpha:]`	Alphabetic characters	`[A-Za-z]`
`[:blank:]`	Space and tab	`[ \t]`
`[:cntrl:]`	Control characters	`[\x00-\x1F\x7F]`
`[:digit:]`	Digits	`[0-9]`
`[:graph:]`	Visible characters	`[\x21-\x7E]`
`[:lower:]`	Lowercase letters	`[a-z]`
`[:print:]`	Visible characters and spaces	`[\x20-\x7E]`
`[:punct:]`	Punctuation characters	[!"#$%&'()*+,-./:;?@[\\\]_`{\|}~]
`[:space:]`	White space characters	`[ \t\r\n\v\f]`
`[:upper:]`	Uppercase letters	`[A-Z]`
`[:xdigit:]`	Hexadecimal digits	`[A-Fa-f0-9]`

POSIX bracket expressions are only concerned with the ASCII subset of Unicode code points. This makes them impractical for a usage beyond the context of English or Latin. The query expressions \P{} and \G{}, in contrast, provide solutions fit for an international context [3].

Bracket operations can also contain set algebraic expressions [] in terms of the operations union, intersection, and complement, as well as the derived operation difference. Together with \Empty as the empty set and \Any as the universe set they constitute a set agebraic structure maintaining all fundamental laws of algebra: Commutativity, Associativity, Distributivity. For intersections and unions with the empty set and the universe set the rules of Identity and Complement apply. The set algebraic syntax is shown table tab:syntax-re-character-set-operations.

Syntax	Example
`union(A0, A1, ...)`	`union([a-z], [A-Z]) = [a-zA-Z]`
`intersection(A0, A1, ...)`	`intersection([0-9], [4-5]) = [4-5]`
`complement(A0, A1, ...)`	`complement([\x40-\5A]) = [\x00-\x3F\x5B-\U12FFFF]`
`difference(A, B0, B1, ...)`	`difference([0-9], [4-5]) = [0-36-9]`

A union expression generates the union of all sets mentioned inside the brackets. An intersection expression results in the intersection of all sets mentioned. The complement builds the complementary set of the union of all mentioned sets. That is, the result is a set of characters which do not occur in any the given set. The difference between one set and another can be computed via the difference function. The difference determines the difference operation results in all elements of the first argument minus the elements of the following arguments. The arguments of those functions must all be sets, not DFAs. The result of those operations are, of course, character sets.

At first glance, it seems unnatural to allow argument lists of arbitrary length as arguments to complement and difference. This choice, though, has been made for the sake of convenience, to spare the user lengthy nested sequences of two-argument calls to handle more than two sets.

Note

The difference and intersection operation can be used conveniently for filtering. For example

[: difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter}) :]

results in the set of Greek characters except the digits and except the lowercase letters. To allow only the numbers from the Arabic code block intersection can be used as follows:

[: intersection(\P{Block=Arabic}, \G{Nd}) :]

The result of character set expressions is not always easy to foresee. Using Quex’s command line functionality to display the results of regular expressions may provide more insight into the result of an operation. For example, the following command line displays what characters remain if the numbers and lowercase letters are taken out of the set of Greek letters.

> quex --set-by-expression 'difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter})'

The command line query feature is discussed in chapter sec:command-line-queries.

Encoding¶

Elementary regular expressions describe trivial DFAs consisting of an initial state and an acceptance state where a character triggers the transition. Actual generated DFAs, though, trigger on lexatoms. Characters and lexatoms are equivalent as long as the data in the buffer is coded in Unicode (or ASCII). In other words, it holds as long as the input stream is coded in Unicode, or if a lexatom loader is applied that converts the input stream into Unicode.

Quex, however, can generate analyzers that run directly, without conversion, on a specific encoding. For example, applying the command line argument --encoding utf8 lets Quex generate a DFA that triggers on byte-wide lexatoms of UTF8 code units (Option --encoding-list delivers all available encodings). The DFA in Fig. 14 shows a DFA corresponding to \G{Zs}, i.e. the set of whitespace characters without newline and tabulator. The according character-based DFA consist still of two states. Applying the encoding UTF8, the resulting DFA contains many more states and paths of different lengths.

Note

The diagram can be reproduced as a PDF file plot.pdf applying:

> quex -i plot.qx --encoding utf8
> dot A.dot -Tpdf > plot.pdf

with the file plot.qx given as

define {
   A  \G{Zs}
   \plot (fh) A.dot {A}
}

../figures/state-machine-general-category-separator-space.svg — Fig. 14 Elementary DFA of general category ‘Separator Space’ encoded in UTF8.¶

So, an elementary regular expression may be translated into a DFA of many states and transitions when the encoding implies that characters are not equivalent to lexatoms. They are still elementary, because, their structure is determined by a single character, or a single character set. DFA structures which are built by means of modification or composition are discussed in the following sections.

Footnotes