Auxiliary Operations

\\DNA{ RE }

Verifies that the content is a nucletoid sequence and expands shorthands.

The \DNA{} modifier checks the regular expression specified in brackets for that it contains only the GATC nucleotide symbols plus the shorthands R, Y, M, K, S, W, H, B, V, D, N following the nomenclature for incompletely specified bases []. The shorthands are expanded to alternative nucleotides as shown in table Symbols for incompletely specified sequences of DNA/RNA..

\\RNA{ ... }

Works respectively to \DNA only that thymin T is replaced by uracil U.

Table 7 Symbols for incompletely specified sequences of DNA/RNA.

Symbol

Represented Set

Mnemonic

R

G, A

puRine

Y

T(≙U), C

pYrimidine

W

A, T

Weak hydrogene bond

S

G, C

Strong hydrogene bond

M

A, C

aMino group (|NH2| group)

K

G, T (≙U)

Ketone

H

A, C, T (≙ ``U``)

All but G (G precedes H)

B

G, T (≙U), C

All but A (A precedes B)

V

G, C, A

All but U(≙T) (U precedes V)

D

G, T (≙U), A

All but C (C precedes D)

N

G, T (≙U), A, U

aNy

\\C{ RE }
\\C(flags){ RE }

Case folding for the given regular expression RE.

The \C{} operator multiple representations of the same characters. In particular, it supports the definition of upper and lower case patterns. For example:

\C{select}

matches:

"SELECT", "select", "sElEcT", ...

The case folding operation produces a result of the same type as its argument. If the input is a DFA, then the output is a DFA. If the case folding is applied in a character set expression, then its input must be a character set expression, i.e.:

[:\C{[a-z]]}:]   // correct!

[a-z] is a character set and the result is a character set which can be used in a character set expression in [: :] brackets. However, \C{[a-z]+]} results in a DFA and cannot be used in character set expressions, i.e.:

[:\C{[a-z]+}:]  // error!

The algorithm for case folding follows Unicode Standard Annex #21 “CASE MAPPINGS”, Section 1.3 []. For example, the character ‘k’ is not only folded to ‘k’ (0x6B) and ‘K’ (0x4B) but also to ‘K’ (0x212A). Additionally, Unicode defines case foldings to multi character sequences, such as:

ΐ   (0390) --> ι(03B9)̈(0308)́(0301)
ʼn   (0149) --> ʼ(02BC)n(006E)
I   (0049) --> i(0069), İ(0130), ı(0131), i(0069)̇(0307)
ff   (FB00) --> f(0066)f(0066)
ffi   (FB03) --> f(0066)f(0066)i(0069)
ﬗ   (FB17) --> մ(0574)խ(056D)

As a speciality of the Turkish language, the ‘i’ with and without the dot are not the same. That is, a dot-less lowercase ‘i’ is folded to a dot-less uppercase ‘I’ and a dotted ‘i’ is mapped to a dotted uppercase ‘İ’. This mapping, though, is mutually exclusive with the ‘normal’ case folding and is not active by default. Table _tab_flags_for_case_folding describes the flags to control the detailed case folding behavior:

Table 8 Flags passed to \C(flags){} case folding.

Flag

Enables

s

simple case folding (no multi-characters).

m

multi-character sequence generation.

t

t urkish case folding.

By default the flags s and m are set (i.e. \C{R}\C(sm){R}) for patterns and s (i.e. \C{R}\C(s){R}) for character sets. Characters that are beyond the scope of the current encoding or input character byte width are cut out.

Some case mappings may be surprising and trigger unexpected notifications. For example the case mapping for ‘C{s}’ consists not only of the letters ‘s’ (0x53) and ‘S’ (0x73) but also of ‘ſ’ (0x17F for Unicode ‘LATIN SMALL LETTER LONG S’). So if ‘C{s}’ is used while the lexatom size is setup to one byte, Quex might warn about the violation of numerical limits.