# Character Set Expressions¶

Character set expression are a tool to combine, filter and or select character
ranges conveniently. The result of a character set expression is a set of
characters. Such a set of characters can then be used to express that any of
them can occur at a given position of the input stream. The character set
expression `[:alpha:]`

, for example matches all characters that are
letters, i.e. anything from a to z and A to Z. It belongs to the
POSIX bracket expressions which are explained below. Further, this section
explains how sets can be generated from other sets via the operations *union*,
*intersection*, *difference*, and *inverse*.

POSIX bracket expressions are basically shortcuts for some more
regular expressions that would formally look a bit more clumsy. Quex
provides those expressions bracketed in `[:`

and `:]`

brackets.
They are specified in the table below.

Expression | Meaning | Related Regular Expression |
---|---|---|

`[:alnum:]` |
Alphanumeric characters | `[A-Za-z0-9]` |

`[:alpha:]` |
Alphabetic characters | `[A-Za-z]` |

`[:blank:]` |
Space and tab | `[ \t]` |

`[:cntrl:]` |
Control characters | `[\x00-\x1F\x7F]` |

`[:digit:]` |
Digits | `[0-9]` |

`[:graph:]` |
Visible characters | `[\x21-\x7E]` |

`[:lower:]` |
Lowercase letters | `[a-z]` |

`[:print:]` |
Visible characters and spaces | `[\x20-\x7E]` |

`[:punct:]` |
Punctuation characters | `[!"#$%&'()*+,-./:;?@[\\\]_`{|}~]` |

`[:space:]` |
White space characters | `[ \t\r\n\v\f]` |

`[:upper:]` |
Uppercase letters | `[A-Z]` |

`[:xdigit:]` |
Hexadecimal digits | `[A-Fa-f0-9]` |

Caution has to be taken if these expressions are used for non-english
languages. They are *solely* concerned with the ASCII character set. For more
sophisticated property processing it is advisable to use Unicode property
expressions as explained in section <<formal/ucs-properties>>. In particular,
it is advisable to use `\P{ID_Start}`

, `\P{ID_Continue}`

,
`\P{Hex_Digit}`

, `\P{White_Space}`

, and `\G{Nd}`

.

Note

If it is intended to use codings different from ASCII, e.g. UTF-8 or other Unicode character encodings, then the ‘–iconv’ flag or ‘–icu’ flag must be specified to enable the appropriate converter. See section Character Encodings.

In the same way as patterns character sets can be defined in a `define`

section and replaced inside the `[:`

... `:]`

brackets–provided
that they are character sets and not complete state machines.

The use of Unicode character set potentially implies the handling of many
different properties and character sets. For convenience, quex provides
*operations on character sets* to combine and filter different character sets
and create new adapted ones. The basic operations that quex allows are
displayed in the following table:

Syntax | Example |
---|---|

`union(A0, A1, ...)` |
`union([a-z], [A-Z]) = [a-zA-Z]` |

`intersection(A0, A1, ...)` |
`intersection([0-9], [4-5]) = [4-5]` |

`difference(A, B0, B1, ...)` |
`difference([0-9], [4-5]) = [0-36-9]` |

`inverse(A0, A1, ...)` |
`inverse([\x40-\5A]) = [\x00-\x3F\x5B-\U12FFFF]` |

A `union`

expression allows to create the union of all sets mentioned inside
the brackets. The `intersection`

expression results in the intersection of
all sets mentioned. The difference between one set and another can be computed
via the `difference`

function. Note, that `difference(A, B)`

is not equal
to `difference(B, A)`

. This function takes more than one set to be
subtracted. In fact, it subtracts the union of all sets mentioned after the
first one. This is for the sake of convenience, so that one has to build the
union first and then subtract it. The `inverse`

function builds the
complementary set. That is, the result is the set of characters which are not
in the given set but in the set of the currently considered codec. This
function also takes more than one set, so one does not have to build the union
first.

Note, that the `difference`

and `intersection`

operation can be
used conveniently to filter different sets. For example

```
[: difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter} :]
```

results in the set of Greek characters except the digits and except the
lowercase letters. To allow only the numbers from the Arabic code block
`intersection`

can be used as follows:

```
[: intersection(\P{Block=Arabic}, \G{Nd}) :]
```

The subsequent section elaborates on the concept of Unicode properties. At this point, it is worth mentioning that quex provides a sophisticated query feature. This allows to determine the result of such set operations and view the result sets. For example, to see the results of the former set operation quex can be called the following way:

```
quex --set-by-expression 'difference(\P{Script=Greek}, \G{Nd}, \G{Lowercase_Letter})'
```

In order to take full advantage of those set arithmetics the use should familiarize himself with Unicode properties and quex’s query mode.