Character Set Expressions

Character set expression are a tool to combine, filter and or select character ranges conveniently. The result of a character set expression is a set of characters. Such a set of characters can then be used to express that any of them can occur at a given position of the input stream. The character set expression [:alpha:], for example matches all characters that are letters, i.e. anything from a to z and A to Z. It belongs to the POSIX bracket expressions which are explained below. Further, this section explains how sets can be generated from other sets via the operations union, intersection, difference, and inverse.

POSIX bracket expressions are basically shortcuts for some more regular expressions that would formally look a bit more clumsy. Quex provides those expressions bracketed in [: and :] brackets. They are specified in the table below.

Expression Meaning Related Regular Expression
[:alnum:] Alphanumeric characters [A-Za-z0-9]
[:alpha:] Alphabetic characters [A-Za-z]
[:blank:] Space and tab [ \t]
[:cntrl:] Control characters [\x00-\x1F\x7F]
[:digit:] Digits [0-9]
[:graph:] Visible characters [\x21-\x7E]
[:lower:] Lowercase letters [a-z]
[:print:] Visible characters and spaces [\x20-\x7E]
[:punct:] Punctuation characters [!"#$%&'()*+,-./:;?@[\\\]_`{|}~]
[:space:] White space characters [ \t\r\n\v\f]
[:upper:] Uppercase letters [A-Z]
[:xdigit:] Hexadecimal digits [A-Fa-f0-9]

Caution has to be taken if these expressions are used for non-english languages. They are solely concerned with the ASCII character set. For more sophisticated property processing it is advisable to use Unicode property expressions as explained in section <<formal/ucs-properties>>. In particular, it is advisable to use \P{ID_Start}, \P{ID_Continue}, \P{Hex_Digit}, \P{White_Space}, and \G{Nd}.

Note

If it is intended to use codings different from ASCII, e.g. UTF-8 or other Unicode character encodings, then the ‘–iconv’ flag or ‘–icu’ flag must be specified to enable the appropriate converter. See section Character Encodings.

In the same way as patterns character sets can be defined in a define section and replaced inside the [: ... :] brackets–provided that they are character sets and not complete state machines.

The use of Unicode character set potentially implies the handling of many different properties and character sets. For convenience, quex provides operations on character sets to combine and filter different character sets and create new adapted ones. The basic operations that quex allows are displayed in the following table:

Syntax Example
union(A0, A1, ...) union([a-z], [A-Z]) = [a-zA-Z]
intersection(A0, A1, ...) intersection([0-9], [4-5]) = [4-5]
difference(A, B0, B1, ...) difference([0-9], [4-5]) = [0-36-9]
inverse(A0, A1, ...) inverse([\x40-\5A]) = [\x00-\x3F\x5B-\U12FFFF]

A union expression allows to create the union of all sets mentioned inside the brackets. The intersection expression results in the intersection of all sets mentioned. The difference between one set and another can be computed via the difference function. Note, that difference(A, B) is not equal to difference(B, A). This function takes more than one set to be subtracted. In fact, it subtracts the union of all sets mentioned after the first one. This is for the sake of convenience, so that one has to build the union first and then subtract it. The inverse function builds the complementary set. That is, the result is the set of characters which are not in the given set but in the set of the currently considered codec. This function also takes more than one set, so one does not have to build the union first.

Note, that the difference and intersection operation can be used conveniently to filter different sets. For example

[: difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter} :]

results in the set of Greek characters except the digits and except the lowercase letters. To allow only the numbers from the Arabic code block intersection can be used as follows:

[: intersection(\P{Block=Arabic}, \G{Nd}) :]

The subsequent section elaborates on the concept of Unicode properties. At this point, it is worth mentioning that quex provides a sophisticated query feature. This allows to determine the result of such set operations and view the result sets. For example, to see the results of the former set operation quex can be called the following way:

quex --set-by-expression 'difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter})'

In order to take full advantage of those set arithmetics the use should familiarize himself with Unicode properties and quex’s query mode.