Apapting the internal codec of the generated lexical analyzer engine happens by means of the command line flag --codec. When a codec is specified the internal engine will no longer run on unicode, but rather on the specified codec. Consider the example of a lexical analyzer definition for some greek text to be encoded in ISO-8859-7:
define {
CAPITAL [ΆΈΉΊΌΎ-Ϋ]
LOWERCASE [ά-ώ]
NUMBER [0-9][0-9.,]+
}
mode X :
<skip: [ \t\n]>
{
{CAPITAL}{LOWERCASE}+ => QUEX_TKN_WORD(Lexeme);
{NUMBER} => QUEX_TKN_NUMBER(Lexeme);
km2|"%"|{LOWERCASE}+ => QUEX_TKN_UNIT(Lexeme);
}
The resulting state machine in Unicode is shown in Figure Unicode Engine. This engine could either fed by converted file content using a converting buffer filler. Alternatively, it can be converted so that it directly parses ISO8859-7 encoded characters. Specifying on the command line:
> quex (...) --codec iso8859_7 (...)
does the job. Note, that no character encoding name needs to be passed to the constructor, because the generated engine itself inhibits the codec. The character encoding name must be specified as ‘0x0’ (if it is to be specified anyway).
quex::MyLexer qlex(&my_stream); // character encoding name = 0x0, by default
Sample lexical analyzer with an internal engine codec ‘Unicode’.
The result is a converted state machine as shown in figure ISO8859-7 Eninge. that the state machine basically remains the same, only the transition rules have been adapted.
Sample lexical analyzer with an internal engine codec ‘ISO8859-7’ (greek).
It is worth mentioning that the command line arguments --codec-info providing information about the list of codecs and --codec-for-language are usefule for making decisions about what codec to choose. Note, that in case that the engine is implemented for a specific codec, there is no ‘coding name’ to be passed to the constructor.
Note
When --codec is used, then the command line flag --buffer-element-size (respectively -b), it does not stand for the character’s width. When quex generates an engine that actually understands the codec, this flag specifies the size of the code elements.
For example, UTF8 covers the complete UCS code plane. So, its code points would require at least three bytes (0x0 - 0x10FFFF). However, the code elements of UTF8 are bytes and the internal engine triggers on bytes. Thus for codec utf8 -b 1 must be specified. In the same way, UTF16 covers the whole plain, but its elements consist of two bytes, thus here -b 2 must be specified.
The UTF8 Codec is different from all previously mentioned codecs in the sense that it encodes unicode characters in byte sequences of differing lengths. That means that the translation of a lexical analyzer to this codec cannot rely on an adaption of transition maps, but instead, must reconstruct a new state machine that triggers on the byte sequences.
Figure UTF8 State Machine shows the state machine that results from a utf8 state split transformation of the state machine displayed in figure Sample lexical analyzer with an internal engine codec ‘Unicode’.. Where the codec adaptions happend on transition level, the main structure of the state machine remained in place. Now however, the state machine undergoes a complete metamorphosis.
Sample lexical analyzer with an internal engine codec ‘UTF8’.
Note
The skipper tag <skip: ...> cannot be used cautiously with the utf8 codec. Only those ranges can be skipped that lie undeneath the unicode value 0x7F. This is so since any higher value requires a multi-value sequence and the skipper is optimized for single trigger values.
Skipping traditional whitespace, i.e. [ \t\n] is still no problem. Skipping unicode whitespace [:\P{White_Space}:] is a problem since the unicode property is carried by characters beyond 0x7F. In general, ranges above 0x7F need to be skipped by means of the ‘null pattern action pair’.:
...
{MyIgnoredRange} { /* Do nothing */ }
...
Similar to the UTF8 codec some elements of the unicode set of code points are encoded by two, others by four byte. To handle this type of codec, quex transforms the unicode state machine into a state machine that runs on triggers of a maximum range of 65536. The same notes and remarks made about UTF8 remain valid. However, they are less critical since only those code points are split into 4 bytes which are beyond 0xFFFF.
There is one important point about UTF16 which is not to be neglected: Byte Order, i.e. little endian or big endian. In order to work propperly the analyzer engine requires the buffer to be filled in the byte order which is understood by the CPU. UTF16 has three variants:
The analyzer generated by quex does not know about byte orders. It only knows the codec utf16. The provided stream needs to be provided in the byte order appropriate for particular CPU. This may mean that the byte order needs to be reversed during loading. Such a reversion can either passing the information to the constructor.
quex::MyLexer qlex(fh, 0x0, /* ReverseByteOrderF */True);
Such a usage is appropriate if the codec is inverse to the machines codec. If, for example one tries to analyze a UTF16-BE (big endian stream) on an intel pentium (tm) machine, which is little endian, then the reverse byte order flag can be passed to the constructor. If a UTF16 stream is expected which specifies the byte order via a byte order mark (BOM), then the first bytes are to be read before constructor is called, or before a new stream is passed to the analyzer. In any case, the byte order reversion can be observed and adapted with the following member functions.
bool byte_order_reversion();
void byte_order_reversion_set(bool Value);
An engine created for codec utf16 can be used for both, little endian and big endian data streams. The abovementioned flags allow to synchronize the byte order of the CPU with the data streams byte order by means of reversion, if necessary.
The command line flag --codec allows to specify the internal coding of the generated lexical analyzer. This enables lexical analyzers that run fast on codecs different from Unicode or ASCII. However, there are two drawbacks. First of all not all possible codecs are supported[#f1]_. Second, once an engine has been created for a particular codec, the codec is fixed and the engine can only run on this codec. Thus subsequent sections focuss on the ‘converter approach’ where the internal engine remains running on Unicode, but the buffer filler performs the conversion. It is not run time efficient as the internal engine codec, but more flexible, in case that the generated analyzer has to deal with a wide range of codecs.