Code Generation¶

This section lists the command line options to control code generation.

-i [file name]+¶: The names following -i designate the files containing quex source code to be used as input.

Default: empty list

-o, --analyzer-class [name ::]* name¶

This option defines the name (and possibly name space) of the lexical analyser class that is to be created. The name space can be specified by means of a sequence separated by ::. At the same time, this name also determines the file stem of the output files generated by quex. For example, the invocation

> quex ... -o MySpace::MySubSpace::MySubSubSpace::Lexer

specifies that the lexical analyzer class is Lexer and that it is located in the name space MySubSubSpace which in turn is located MySubSpace which it located in MySpace.

If no name space is specified, the analyzer is placed in name space quex for C++ and the root name space for C. If the analyzer shall be placed in the root name space a :: must be proceeding the class name. For example, the invocation

> quex ... -o ::Lexer

sets up the lexical analyzer in the root namespace and

> quex ... -o Lexer

generates a lexical analyzer class Lexer in default namespace quex.

Default: Lexer

--output-directory, --odir directory¶: directory = name of the output directory where generated files are to be written. This does more than merely copying the sources to another place in the file system. It also changes the include file references inside the code to refer to the specified directory as a base.

--file-extension-scheme, --fes scheme¶

Specifies the filestem and extensions of the output files. The provided argument identifies the naming scheme. The possible values for {scheme} and their result is mentioned in the list below.

C++

No extension for header files that contain only declarations.
.i for header files containing inline function implementation.
.cpp for source files.

C

.h for header files.
.c for source files.

++

.h++ for header files.
.c++ for source files.

pp

.hpp for header files.
.cpp for source files.

cc

.hh for header files.
.cc for source files.

xx

.hxx for header files.
.cxx for source files.

If the option is not provided, then the naming scheme depends on the --language command line option. For C there is currently no different naming scheme supported.

--language, -l name¶

Defines the programming language of the output. name can be

C for plain C code.

C++ for C++ code.

dot for plotting information in graphviz format.

Default: C++

--character-display hex|utf8¶

Specifies how the character of the state transition are to be displayed when –language dot is used.

hex displays the Unicode code point in hexadecimal notation.

utf8 is specified the character will be displayed ‘as is’ in UTF8 notation.

Default: utf8

--normalize¶: If this option is set, the output of ‘–language dot’ will be a normalized state machine. That is, the state numbers will start from zero. If this flag is not set, the state indices are the same as in the generated code.

Default: false (disabled)

--buffer-based, --bb¶: Generates an analyzer that does not read from an input stream, but runs instead only on a buffer.

Default: false (disabled)

--version-id string¶: string = arbitrary name of the version that was generated. This string is reported by the version() member function of the lexical analyser.

Default: 0.0.0-pre-release

--no-mode-transition-check¶: Turns off the mode transition check and makes the engine a little faster. During development this option should not be used. But the final lexical analyzer should be created with this option set.

Default: true (not disabled)

--single-mode-analyzer, --sma¶: In case that there is only one mode, this flag can be used to inform quex that it is not intended to refer to the mode at all. In that case no instance of the mode is going to be implemented. This reduces memory consumption a little and may possibly increase performance slightly.

Default: false (disabled)

--no-string-accumulator, --nsacc¶: Turns the string accumulator option off. This disables the use of the string accumulator to accumulate lexemes. See ‘Accumulator’.

Default: true (not disabled)

--no-include-stack, --nois¶: Disables the support of include stacks where the state of the lexical analyzer can be saved and restored before diving into included files. Setting this flag may speed up a bit compile time

Default: true (not disabled)

--post-categorizer¶: Turns the post categorizer option on. This allows a ‘secondary’ mapping from lexemes to token ids based on their name. See ‘PostCategorizer’.

Default: false (disabled)

--no-count-columns¶: Lets quex generate an analyzer without internal line counting.

Default: true (not disabled)

--no-count-lines¶: Lets quex generate an analyzer without internal column counting.

Default: true (not disabled)

If an independent source package is required that can be compiled without an installation of quex, the following option may be used

--source-package, --sp¶

Creates all source code that is required to compile the produced lexical analyzer. Only those packages are included which are actually required. Thus, when creating a source package the same command line ‘as usual’ must be used with the added –source-package option.

The string that follows is the directory where the source package is to be located.

For the support of derivation from the generated lexical analyzer class the following command line options can be used.

--derived-class, --dc name¶: name = If specified, the name of the derived class that the user intends to provide (see section <<sec-formal-derivation>>). Note, specifying this option signalizes that the user wants to derive from the generated class. If this is not desired, this option, and the following, have to be left out. The namespace of the derived analyzer class is specified analgously to the specification for –analyzer-class, as mentioned above.

--derived-class-file file name¶: file-name = If specified, the name of the file where the derived class is defined. This option only makes sense in the context of optioin --derived-class.

--token-id-prefix prefix¶

prefix = Name prefix to prepend to the name given in the token-id files. For example, if a token section contains the name COMPLEX and the token-prefix is TOKEN_PRE_ then the token-id inside the code will be TOKEN_PRE_COMPLEX.

The token prefix can contain name space delimiters, i.e. ::. In the brief token senders the name space specifier can be left out.

Default: QUEX_TKN_

--token-queue-size number¶: In conjunction with token passing policy ‘queue’, number specifies the number of tokens in the token queue. This determines the maximum number of tokens that can be send without returning from the analyzer function.

Default: 64

--token-queue-safety-border number¶: Specifies the number of tokens that can be sent at maximum as reaction to one single pattern match. More precisely, it determines the number of token slots that are left empty when the token queue is detected to be full.

Default: 16

--token-id-offset number¶: number = Number where the numeric values for the token ids start to count. Note, that this does not include the standard token ids for termination, unitialized, and indentation error.

Default: 10000

Certain token ids are standard, in a sense that they are required for a functioning lexical analyzer. Namely they are TERMINATION and UNINITIALIZED The default values of those do not follow the token id offset, but are 0, 1, and 2. If they need to be different, they must be defined in the ``token { … `` } section, e.g.

token {
    TERMINATION   = 10001;
    UNINITIALIZED = 10002;
    ...
}

A file with token ids can be provided by the option

--foreign-token-id-file file name [[begin-str] end-str]¶

file-name = Name of the file that contains an alternative definition of the numerical values for the token-ids.

Note, that quex does not reflect on actual program code. It extracts the token ids by heuristic. The optional second and third arguments allow to restrict the region in the file to search for token ids. It starts searching from a line that contains begin-str and stops at the first line containing end-str. For example

> quex ... --foreign-token-id-file my_token_ids.hpp   \
                                   yytokentype   '};' \
           --token-prefix          Bisonic::token::

reads only the token ids from the enum in the code fragment yytokentype.

Default: empty list

--foreign-token-id-file-show¶: If this option is specified, then Quex prints out the token ids which have been found in a foreign token id file.

Default: false (disabled)

The following options support the definition of a independently customized token class:

--token-class-file file name¶: file name = Name of file that contains the definition of the token class. Note, that the setting provided here is possibly overwritten if the token_type section defines a file name explicitly.

--token-class, --tc [name ::]+ name¶: name is the name of the token class. Using ‘::’-separators it is possible to defined the exact name space as mentioned for the –analyzer-class command line option.

Default: Token

--token-id-type type name¶: type-name defines the type of the token id. This defines internally the macro QUEX_TYPE_TOKEN_ID. This macro is to be used when a customized token class is defined. The types of Standard C99 ‘stdint.h’ are encouraged.

Default: uint32_t

--token-class-only, --tco¶: When specified, quex only creates a token class. This token class differs from the normally generated token classes in that it may be shared between multiple lexical analyzers (see Shared Token Classes).

Note

When this option is specified, then the LexemeNull is implemented along with the token class. In this case all analyzers that use the token class, shall define --lexeme-null-object according the token name space.

Default: false (disabled)

--lexeme-null-object, --lno variable¶: The ‘name’ is the name of the LexemeNull object. If the option is not specified, then this object is created along with the analyzer automatically. When using a shared token class, then this object must have been created along with the token class. Announcing the name of the lexeme null object prevents quex from generating a lexeme null inside the engine itself.

There may be cases where the characters used to indicate buffer limit needs to be redefined, because the default value appear in a pattern footnote:[As for ‘normal’ ASCII or Unicode based lexical analyzers, this would most probably not be a good design decision. But, when other, alien, non-Unicode codings are to be used, this case is conceivable.]. The following option allows modification of the buffer limit code:

--buffer-limit number¶: Numeric value used to mark buffer borders. This should be a number that does not occur as an input character.

Default: 0

On several occasions quex produces code related to ‘newline’. The coding of newline has two traditions: The Unix tradition which codes it plainly as 0x0A, and the DOS tradition which codes it as 0x0D followed by 0x0A. To be on the safe side by default, quex codes newline as an alternative of both. In case, that the DOS tradition is not relevant, some performance improvements might be achieved, if the ‘0x0D, 0x0A’ is disabled. This can be done by the following flag.

--no-DOS¶: If specified, the DOS newline (0x0D, 0x0A) is not considered whenever newline is required.

Default: true (not disabled)

For Unicode support it is essential to allow character conversion. Currently quex can interact with GNU IConv and IBM’s ICU library. For this the correspondent library must be installed on your system. On Unix systems, the iconv library is usually present. Relying on IConv or ICU lets is a flexible solution. The generated analyzer runs on converted content. The converter can be adapted dynamically.

--iconv¶: Enable the use of the iconv library for character stream decoding. This is equivalent to defining ‘-DQUEX_OPTION_CONVERTER_ICONV_EXT’ as a compiler flag. Depending on your compiler setup, you might have to set the ‘-liconv’ flag explicitly in order to link against the IConv library.

Default: false (disabled)

--icu¶: Enable the use of IBM’s ICU library for character stream decoding. This is equivalent to defining ‘-DQUEX_OPTION_CONVERTER_ICU_EXT’ as a compiler flag. There are a couple of libraries that are required for ICU. You can query those using the ICU tool ‘icu-config’. A command line call to this tool with ‘–ldflags’ delivers all libraries that need to be linked. A typical list is ‘-lpthread -lm -L/usr/lib -licui18n -licuuc -licudata’.

Default: false (disabled)

Alternatively, the engine can run directly on a specific coded, i.e. without a conversion to Unicode. This approach is less flexible, but may be faster.

--encoding encoding name¶: Specifies a encoding for the generated engine. By default the internal engine runs on Unicode code points. That is, the analyzer engine is transformed according to the given encoding before code is generated.

Note

When --encoding is specified the command line flag -b or --buffer-element-size does not represent the number of bytes per character, but the number of bytes per code element. The encoding UTF8, for example, is of dynamic length and its code elements are bytes, thus only -b 1 makes sense. UTF16 triggers on elements of two bytes, while the length of an encoding for a character varies. For UTF16, only -b 2 makes sense.

Default: unicode

--encoding-file file name¶

By means of this option a freely customized encoding can be defined. The file name determines at the same time the file where the encoding mapping is described and the encoding’s name. The encoding’s name is the directory-stripped and extension-less part of the given follower. Each line of such a file must consist of three numbers, that specify ‘source interval begin’, ‘source interval length’, and ‘target interval end. Such a line specifies how a cohesive Unicode character range is mapped to the number range of the customized encoding. For example, the mapping for encoding iso8859-6 looks like the following.

0x000 0xA1 0x00
0x0A4 0x1  0xA4
0x0AD 0x1  0xAD
0x60C 0x1  0xAC
0x61B 0x1  0xBB
0x61F 0x1  0xBF
0x621 0x1A 0xC1
0x640 0x13 0xE0

Here, the Unicode range from 0 to 0xA1 is mapped one to one from Unicode to the encoding. 0xA4 and 0xAD are also the same as in Unicode. The remaining lines describe how Unicode characters from the 0x600-er page are mapped inside the range somewhere from 0xAC to 0xFF.

Note

This option is only to be used, if quex does not support the encoding directly. The options --encoding-info and --encoding-for-language help to find out whether Quex directly supports a specific encoding. If a --encoding-file is required, it is advisable to use --encoding-file-info file-name.dat to see if the mapping is in fact as desired.

The buffer on which a generated analyzer runs is characterized by its size (macro QUEX_SETTING_BUFFER_SIZE), by its element’s size, and their type. The latter two can be specified on the command line.

In general, a buffer element contains what causes a state transition in the analyzer. In ASCII code, a state transition happens on one byte which contains a character. If converters are used, the internal buffer runs on plain Unicode. Here also, a character occupies a fixed number of bytes. The check mark in 4 byte Unicode is coded as as 0x00001327. It is treated as one chunk and causes a single state transition.

If the internal engine runs on a specific encoding (--encoding ) which is dynamic, e.g. UTF8, then state transitions happen on parts of a character. The check mark sign is coded in three bytes 0xE2, 0x9C, and 0x93. Each byte is read separately and causes a separate state transition.

--buffer-element-size, -b, --bes 1|2|4¶

With this option the number of bytes is specified that a buffer element occupies.

The size of a buffer element should be large enough so that it can carry the Unicode value of any character of the desired input coding space. When using Unicode, to be safe ‘-b 4’ should be used except that it is unconceivable that any code point beyond 0xFFFF ever appears. In this case ‘-b 2’ is enough.

When using dynamic sized encodings, this option is better not used. The encodings define their chunks themselves. For example, UTF8 is built upon one byte chunks and UTF16 is built upon chunks of two bytes.

Note

If a character size different from one byte is used, the .get_text() member of the token class does contain an array that particular type. This means, that .text().c_str() does not result in a nicely printable UTF8 string. Use the member .utf8_text() instead.

Default: -1

--buffer-element-type, --bet type name¶

A flexible approach to specify the buffer element size and type is by specifying the name of the buffer element’s type, which is the purpose of this option. Note, that there are some ‘well-known’ types such as uint*_t (C99 Standard), u* (Linux Kernel), unsigned* (OSAL) where the * stands for 8, 16, or 32. Quex can derive its size automatically.

Quex tries to determine the size of the buffer element type. This size is important to determine the target encoding when converters are used. That is, if the size is 4 byte a different Unicode encoding is used then if it was 2 byte. If quex fails to determine the size of a buffer element from the given name of the buffer element type, then the Unicode encoding must be specified explicitly by ‘–converter-ucs-coding-name’.

By default, the buffer element type is determined by the buffer element size.

--endian little|big|<system>¶

There are two types of byte ordering for integer number for different CPUs. For creating a lexical analyzer engine on the same CPU type as quex runs then this option is not required, since quex finds this out by its own. If you create an engine for a different plattform, you must know its byte ordering scheme, i.e. little endian or big endian, and specify it after --endian.

According to the setting of this option one of the three macros is defined in the header files:

QUEX_OPTION_ENDIAN_SYSTEM

QUEX_OPTION_ENDIAN_LITTLE

QUEX_OPTION_ENDIAN_BIG

Those macros are of primary use for character code converters. The converters need to know what the analyser engines number representation is. However, the user might want to use them for his own special purposes (using #ifdef QUEX_OPTION_ENDIAN_BIG ... #endif ).

Default: <system>

The implementation of customized converters is supported by the following options.

--converter-new, --cn function name¶: With the command line option above the user may specify his own converter. The string that follows the option is the name of the converter’s _New function. When this option is set, automatically customized user conversion is turned on.

--converter-ucs-coding-name, --cucn name¶

Determines what string is passed to the converter so that it converters a encoding into Unicode. In general, this is not necessary. But, if a unknown user defined type is specified via ‘–buffer-element-type’ then this option must be specified.

By default it is defined based on the buffer element type.

Template and Path Compression can be controlled with the following command line options:

--template-compression¶: If this option is set, then template compression is activated.

Default: false (disabled)

--template-compression-min-gain number¶: The number following this option specifies the template compression coefficient. It indicates the relative cost of routing to a target state compared to a simple ‘goto’ statement. The optimal value may vary from processor platform to processor platform, and from compiler to compiler.

Default: 0

--path-compression¶: This flag activates path compression. By default, it compresses any sequence of states that allow to be lined up as a ‘path’. This includes states of different acceptance values, store input positions, etc.

Default: false (disabled)

--path-compression-uniform¶: This flag enables path compression. In contrast to the previous flag it compresses such states into a path which are uniform. This simplifies the structure of the correspondent pathwalkers. In some cases this might result in smaller code size and faster execution speed.

Default: false (disabled)

--path-termination number¶

Path compression requires a ‘pathwalker’ to determine quickly the end of a path. For this, each path internally ends with a signal character, the ‘path termination code’. It must be different from the buffer limit code in order to avoid ambiguities.

Modification of the ‘path termination code’ makes only sense if the input stream to be analyzed contains the default value.

Default: 1

The following options control the comment which is added to the generated code:

--comment-state-machine¶

With this option set a comment is generated that shows all state transitions of the analyzer in a comment at the begin of the analyzer function. The format follows the scheme presented in the following example

/* BEGIN: STATE MACHINE
 ...
 * 02353(A, S) <~ (117, 398, A, S)
 *       <no epsilon>
 * 02369(A, S) <~ (394, 1354, A, S), (384, 1329)
 *       == '=' ==> 02400
 *       <no epsilon>
 ...
 * END: STATE MACHINE
 */

It means that state 2369 is an acceptance state (flag ‘A’) and it should store the input position (‘S’), if no backtrack elimination is applied. It originates from pattern ‘394’ which is also an acceptance state and ‘384’. It transits to state 2400 on the incidence of a ‘=’ character.

Default: false (disabled)

--comment-transitions¶

Adds to each transition in a transition map information about the characters which trigger the transition, e.g. in a transition segment implemented in a C-switch case construct

...
case 0x67:
case 0x68: goto _2292;/* ['g', 'h'] */
case 0x69: goto _2295;/* 'i' */
case 0x6A:
case 0x6B: goto _2292;/* ['j', 'k'] */
case 0x6C: goto _2302;/* 'l' */
case 0x6D:
...

The output of the characters happens in UTF8 format.

Default: false (disabled)

--comment-mode-patterns¶

If this option is set a comment is printed that shows what pattern is present in a mode and from what mode it is inherited. The comment follows the following scheme:

       /* BEGIN: MODE PATTERNS
        ...
        * MODE: PROGRAM
        *
        *     PATTERN-ACTION PAIRS:
        *       (117) ALL:     [
]
        *       (119) CALC_OP: "+"|"-"|"*"|"/"
        *       (121) PROGRAM: "//"
        ...
        * END: MODE PATTERNS
        */

This means, that there is a mode PROGRAM. The first three pattern are related to the terminal states ‘117’, ‘119’, and ‘121’. The whitespace pattern of 117 was inherited from mode ALL. The math operator pattern was inherited from mode CALC_OP and the comment start pattern “//” was implemented in PROGRAM itself.

Default: false (disabled)

The comment output is framed by BEGIN: and END: markers. This facilitates the extraction of this information for further processing. For example, the Unix command ‘awk’ can be used:

awk 'BEGIN {w=0} /BEGIN:/ {w=1;} // {if(w) print;} /END:/ {w=0;}' MyLexer.c

When using multiple lexical analyzers it can be helpful to get precise information about all related name spaces. Such short reports on the standard output are triggered by the following option.

--show-name-spaces, --sns¶: If specified short information about the name space of the analyzer and the token are printed on the console.

Default: false (disabled)

Errors and Warnings¶

When the analyzer behaves unexpectedly, it may make sense to ponder over low-priority patterns outrunning high-priority patterns. The following flag supports these considerations.

--warning-on-outrun, --woo¶: When specified, each mode is investigated whether there are patterns of lower priority that potentially outrun patterns of higher priority. This may happen due to longer length of the matching lower priority pattern.

Default: false (disabled)

Some warnings, notes, or error messages might not be interesting or even be disturbing for the user. For such cases, quex provides an interface to avoid prints on the standard output.

--suppress, -s [integer]+¶: By this option, errors, warnings, and notes may be suppressed. The option is followed by a list of integers–each integer represents a suppressed message.

Default: empty list

The following enumerates suppress codes together with their associated messages.

0: Warning if quex cannot find an included file while diving into a ‘foreign token id file’.

1: A token class file (--token-class-file ) may contain a section with extra command line arguments which are reported in a note.

2: Error check on dominated patterns, i.e. patterns that may never match due to higher precedence patterns which cover a superset of lexemes.

3: Error check on special patterns (skipper, indentation, etc.) whether they are the same.

4: Warning or error on ‘outrun’ of special patterns due to lexeme length. Attention: To allow this opens the door to very confusing situations. For example, a comment skipper on “/” may not trigger because a lower precedence pattern matches on “/*” which is longer and therefore wins.

5: Detect whether higher precedence patterns match on a subset of lexemes that a special pattern (skipper, indentation, etc.) matches. Attention: Allowing such behavior may cause confusing situations. If this is allowed a pattern may win against a skipper, for example. It is the expectation, though, that a skipper shall skip –which it cannot if such scenarios are allowed.

6: Warning if no token queue is used while some functionality might not work properly.

7: Warning if token ids are used without being explicitly defined.

8: Warning if a token id is mentioned as a ‘repeated token’ but has not been defined.

9: Warning if a prefix-less token name starts with the token prefix.

10: Warning if there is no ‘on_bad_lexatom’ handler while a encoding different from Unicode is used.

11: Warning a counter setup is defined without specifying a newline behavior.

12: Warning if a counter setup is defined without an \else section.

13: If there is a counter setup without newline defined, quex tries to implement a default newline as hexadecimal 0A or 0D.0A.

14: Same as 13, except with hexadecimal ‘0D’.

15: Warning if a token type has no ‘take_text’ member function. It means, that the token type has no interface to automatically accept a lexeme or and accumulated string.

16: Warning if there is a string accumulator while ‘–suppress 15’ has been used.

Queries¶

The former command line options influenced the procedure of code generation. The options to solely query quex are listed in this section. First of all the two traditional options for help and version information are

--help, -h¶: Reports some help about the usage of quex on the console.

Default: false (disabled)

--version, -v¶: Prints information on the version of quex.

Default: false (disabled)

The following options allow to query on character sets and the result of regular expressions.

--encoding-info, --ci name¶: Displays the characters that are covered by the given encoding’s name. If the name is omitted, a list of all supported encodings is printed. Engine internal character encoding is discussed in section sec-engine-internal-coding.

--encoding-info-file, --cif file name¶: Displays the characters that are covered by the encoding provided in the given file. This makes sense in conjunction with --encoding-file where customized encodings can be defined.

--encoding-for-language, --cil language¶: Displays the encodings that quex supports for the given human language. If the language argument is omitted, all available languages are listed.

--property, --pr property¶: Displays information about the specified Unicode property. The property can also be a property alias. If property is not specified, then brief information about all available Unicode properties is displayed.

Default: empty string

--set-by-property, --sbpr setting¶: Displays the set of characters for the specified Unicode property setting. For query on binary properties only the name is required. All other properties require a term of the form name=value.

--property-match, --prm wildcard-expression¶: Displays property settings that match the given wildcard expression. This helps to find correct identifiers in the large list of Unicode settings. For example, the wildcard-expression Name=*LATIN* gives all settings of property Name that contain the string LATIN.

--set-by-expression, --sbe regular expression¶: Displays the resulting character set for the given regular expression. Character set expressions that are ususally specified in [: ... :] brackets can be specified as expression. To display state machines, it may be best to use the ‘–language dot’ option mentioned in the previous section.

--numeric, --num¶: If this option is specified the numeric character codes are displayed rather then the utf8 characters.

Default: false (disabled)

--intervals, --itv¶: If this option is set, adjacent characters are displayed as intervals. This provides a somewhat more abbreviated display.

Default: false (disabled)

--names¶: If this option is given, resulting characters are displayed by their (lengthy) Unicode name.

Default: false (disabled)