Analyzer Engine Codec

Adapting the internal codec of the generated lexical analyzer engine happens by means of the command line flag --codec. When a codec is specified the internal engine will no longer run on Unicode, but rather on the specified codec. Consider the example of a lexical analyzer definition for some Greek text to be encoded in ISO-8859-7:

define {
    CAPITAL   [ΆΈΉΊΌΎ-Ϋ]   // Greek capital letters
    LOWERCASE [ά-ώ]        // Greek lowercase letters
    NUMBER    [0-9][0-9.,]+
}

mode X :
<skip: [ \t\n]>
{
    {CAPITAL}{LOWERCASE}+  => QUEX_TKN_WORD(Lexeme);
    {NUMBER}               => QUEX_TKN_NUMBER(Lexeme);
    km2|"%"|{LOWERCASE}+   => QUEX_TKN_UNIT(Lexeme);
}

The resulting state machine in Unicode is shown in Figure Unicode Engine. This engine could either fed by converted file content using a converting buffer filler. Alternatively, it can be converted so that it directly parses ISO8859-7 encoded characters. Specifying on the command line:

> quex (...) --codec iso8859_7 (...)

does the job. Note, that no character encoding name needs to be passed to the constructor, because the generated engine itself inhibits the codec. The character encoding name must be specified as ‘0x0’ (if it is to be specified anyway).

quex::MyLexer    qlex(&my_stream);  // character encoding name = 0x0, by default

Sample lexical analyzer with an internal engine codec ‘Unicode’.

The result is a converted state machine as shown in figure ISO8859-7 Eninge. The state machine basically remains the same, only the transition rules have been adapted.

Sample lexical analyzer with an internal engine codec ‘ISO8859-7’ (Greek).

It is worth mentioning that the command line arguments --codec-info providing information about the list of codecs and --codec-for-language are useful for making decisions about what codec to choose. Note, that in case that the engine is implemented for a specific codec, there is no ‘coding name’ to be passed to the constructor.

Note

When --codec is used, then the command line flag --buffer-element-size (respectively -b), it does not stand for the character’s width. When quex generates an engine that actually understands the codec, this flag specifies the size of the code elements.

For example, UTF8 covers the complete UCS code plane. So, its code points would require at least three bytes (0x0 - 0x10FFFF). However, the code elements of UTF8 are bytes and the internal engine triggers on bytes. Thus for codec utf8 -b 1 must be specified. In the same way, UTF16 covers the whole plain, but its elements consist of two bytes, thus here -b 2 must be specified.

The UTF8 Codec

The UTF8 Codec is different from all previously mentioned codecs in the sense that it encodes Unicode characters in byte sequences of differing lengths. That means that the translation of a lexical analyzer to this codec cannot rely on an adaption of transition maps, but instead, must reconstruct a new state machine that triggers on the byte sequences.

Figure UTF8 State Machine shows the state machine that results from a utf8 state split transformation of the state machine displayed in figure Sample lexical analyzer with an internal engine codec ‘Unicode’.. While the codec adaptations happen on transition level, the main structure of the state machine remains in place. Now however, the state machine undergoes a complete metamorphosis.

Sample lexical analyzer with an internal engine codec ‘UTF8’.

Note

The skipper tag <skip: ...> cannot be used cautiously with the utf8 codec. Only those ranges can be skipped that lie underneath the Unicode value 0x7F. This is so since any higher value requires a multi-value sequence and the skipper is optimized for single trigger values.

Skipping traditional whitespace, i.e. [ \t\n] is still no problem. Skipping Unicode whitespace [:\P{White_Space}:] is a problem since the Unicode property is carried by characters beyond 0x7F. In general, ranges above 0x7F need to be skipped by means of the ‘null pattern action pair’.:

.. code-block:: cpp
... {MyIgnoredRange} { } ...

The UTF16 Codec

Similar to the UTF8 codec some elements of the Unicode set of code points are encoded by two, others by four byte. To handle this type of codec, quex transforms the Unicode state machine into a state machine that runs on triggers of a maximum range of 65536. The same notes and remarks made about UTF8 remain valid. However, they are less critical since only those code points are split into 4 bytes which are beyond 0xFFFF.

There is one important point about UTF16 which is not to be neglected: Byte Order, i.e. little endian or big endian. In order to work properly the analyzer engine requires the buffer to be filled in the byte order which is understood by the CPU. UTF16 has three variants:

  • UTF16-BE for big-endian encoded UTF16 streams.
  • UTF16-LE for little endian encoded UTF16 streams.
  • UTF16 which does not specify the byte order. Instead, a so called ‘Byte Order Mark’ (BOM) must be prepended to the stream. It consists of two bytes indicating the byte order:
    • 0xFE 0xFF precedes a big endian stream, and
    • 0xFF 0xFE precedes a little endian stream.

The analyzer generated by quex does not know about byte orders. It only knows the codec utf16. The provided stream needs to be provided in the byte order appropriate for particular CPU. This may mean that the byte order needs to be reversed during loading. Such a reversion can either passing the information to the constructor.

quex::MyLexer   qlex(fh, 0x0, /* ReverseByteOrderF */True);

Such a usage is appropriate if the codec is contrary to the machine’s codec. If, for example one tries to analyze a UTF16-BE (big endian stream) on an intel pentium (tm) machine, which is little endian, then the reverse byte order flag can be passed to the constructor. If a UTF16 stream is expected which specifies the byte order via a byte order mark (BOM), then the first bytes are to be read before constructor is called, or before a new stream is passed to the analyzer. In any case, the byte order reversion can be observed and adapted with the following member functions.

bool     byte_order_reversion();
void     byte_order_reversion_set(bool Value);

An engine created for codec utf16 can be used for both, little endian and big endian data streams. The aforementioned flags allow to synchronize the byte order of the CPU with the data streams byte order by means of reversion, if necessary.

Note

In the Unicode Standard the code points from 0xD800 to 0xDFFF cannot be assigned to any characters. In general, Quex is forgiving if regular expressions do not exclude them. However, when a UTF16-based engine is specified, then Quex deletes these code points automatically from any pattern. This is necessary, because UTF16 requires this numeric range for lead and trail surrogates.

Since the mentioned code points are not assigned to characters text-oriented applications should not recognize a difference. However, for non-textual applications, such as DNA-analysis or pattern recognition, this might become an issue. In such cases, the range cutting must be taken into consideration, or UTF16 is better not used as codec.

Summary

The command line flag --codec allows to specify the internal coding of the generated lexical analyzer. This enables lexical analyzers that run fast on codecs different from Unicode or ASCII. However, there are two drawbacks. First of all not all possible codecs are supported[#f1]_. Second, once an engine has been created for a particular codec, the codec is fixed and the engine can only run on this codec. Thus subsequent sections focus on the ‘converter approach’ where the internal engine remains running on Unicode, but the buffer filler performs the conversion. It is not run time efficient as the internal engine codec, but more flexible, in case that the generated analyzer has to deal with a wide range of codecs.

Warning

At the time of this writing, the line and column counting for codec-based engines may not work properly for patterns where the length can only be determined at run-time. This is due to the fact that not all characters are necessarily represented by the same number of bytes and the dynamic line and column counter does not reflect on the level of interpreted bytes. That means, that it does not know about UTF8, UTF16, etc. Future versions may very well incorporate an advanced line and column counter for codec-engines.

With this respect, it is advantageous to use a converter with a Unicode based buffer, rather than the more compact and possibly faster codec based approach.