Command Line Options

This chapter sums up all command line options which can be passed to quex (Version 0.65.6) together with their meaning. Most of the options here are alread explained in separate sections. The present enumeration serves the purpose of a quick reference. There are command line options for code generations, but also for queries. Each family of options is described in a separate section.

Code Generation

This section lists the command line options to control code generation.

-i [file name]+

The names following -i designate the files containing quex source code to be used as input.

Default: empty list

-o, --analyzer-class [name ::]* name

This option defines the name (and possibly name space) of the lexical analyser class that is to be created. The name space can be specified by means of a sequence where names are separated by ::. At the same time, this name also determines the file stem of the output files generated by quex. For example, the invocation

> quex ... -o MySpace::MySubSpace::MySubSubSpace::Lexer

specifies that the lexical analyzer class is Lexer and that it is located in the name space MySubSubSpace which in turn is located MySubSpace which it located in MySpace.

If no name space is specified, the analyzer is placed in name space quex for C++ and the root name space for C. If the analyzer shall be placed in the root name space a :: must be proceeding the class name. For example, the invocation

> quex ... -o ::Lexer

sets up the lexical analyzer in the root name space and

> quex ... -o Lexer

generates a lexical analyzer class Lexer in default name space quex.

Default: Lexer

--output-directory, --odir directory

directory = name of the output directory where generated files are to be written. This does more than merely copying the sources to another place in the file system. It also changes the include file references inside the code to refer to the specified directory as a base.

--file-extension-scheme, --fes scheme

Specifies the file stem and extensions of the output files. The provided argument identifies the naming scheme. The possible values for scheme and their result is mentioned in the list below.

C++
  • No extension for header files that contain only declarations.
  • .i for header files containing inline function implementation.
  • .cpp for source files.
C
  • .h for header files.
  • .c for source files.
++
  • .h++ for header files.
  • .c++ for source files.
pp
  • .hpp for header files.
  • .cpp for source files.
cc
  • .hh for header files.
  • .cc for source files.
xx
  • .hxx for header files.
  • .cxx for source files.

If the option is not provided, then the naming scheme depends on the --language command line option. For C there is currently no different naming scheme supported.

--language, -l name

Defines the programming language of the output. name can be

  • C for plain C code.
  • C++ for C++ code.
  • dot for plotting information in graphviz format.

Default: C++

--character-display hex|utf8

Specifies how the character of the state transition are to be displayed when –language dot is used.

  • hex displays the Unicode code point in hexadecimal notation.
  • utf8 is specified the character will be displayed ‘as is’ in UTF8 notation.

Default: utf8

--normalize

If this option is set, the output of ‘–language dot’ will be a normalized state machine. That is, the state numbers will start from zero. If this flag is not set, the state indices are the same as in the generated code.

Default: false (disabled)

--buffer-based, --bb

Generates an analyzer that does not read from an input stream, but runs instead only on a buffer.

Default: false (disabled)

--version-id string

string = arbitrary name of the version that was generated. This string is reported by the version() member function of the lexical analyser.

Default: 0.0.0-pre-release

--no-mode-transition-check

Turns off the mode transition check and makes the engine a little faster. During development this option should not be used. But the final lexical analyzer should be created with this option set.

Default: true (not disabled)

--single-mode-analyzer, --sma

In case that there is only one mode, this flag can be used to inform quex that it is not intended to refer to the mode at all. In that case no instance of the mode is going to be implemented. This reduces memory consumption a little and may possibly increase performance slightly.

Default: false (disabled)

--no-string-accumulator, --nsacc

Turns the string accumulator option off. This disables the use of the string accumulator to accumulate lexemes.

Default: true (not disabled)

--no-include-stack, --nois

Disables the support of include stacks where the state of the lexical analyzer can be saved and restored before diving into included files. Setting this flag may speed up a bit compile time

Default: true (not disabled)

--post-categorizer

Turns the post categorizer option on. This allows a ‘secondary’ mapping from lexemes to token ids based on their name. See ‘PostCategorizer‘.

Default: false (disabled)

--no-count-columns

Lets quex generate an analyzer without internal line counting.

Default: true (not disabled)

--no-count-lines

Lets quex generate an analyzer without internal column counting.

Default: true (not disabled)

If an independent source package is required that can be compiled without an installation of quex, the following option may be used

--source-package, --sp directory

Creates all source code that is required to compile the produced lexical analyzer. Only those packages are included which are actually required. Thus, when creating a source package the same command line ‘as usual’ must be used with the added –source-package option.

The directory name following the option specifies the place where the source package is to be located.

For the support of derivation from the generated lexical analyzer class the following command line options can be used.

--derived-class, --dc name

name = If specified, the name of the derived class that the user intends to provide (see section <<sec-formal-derivation>>). Note, specifying this option signalizes that the user wants to derive from the generated class. If this is not desired, this option, and the following, have to be left out. The name space of the derived analyzer class is specified analogously to the specification for –analyzer-class, as mentioned above.

--derived-class-file file name

file-name = If specified, the name of the file where the derived class is defined. This option only makes sense in the context of option --derived-class.

--token-id-prefix prefix

prefix = Name prefix to prepend to the name given in the token-id files. For example, if a token section contains the name COMPLEX and the token-prefix is TOKEN_PRE_ then the token-id inside the code will be TOKEN_PRE_COMPLEX.

The token prefix can contain name space delimiters, i.e. ::. In the brief token senders the name space specifier can be left out.

Default: QUEX_TKN_

--token-policy, --tp single|queue

Determines the policy for passing tokens from the analyzer to the user. It can be either ‘single’ or ‘queue’.

Default: queue

--token-memory-management-by-user, --tmmbu

Enables the token memory management by the user. This command line option is equivalent to the compile option

QUEX_OPTION_USER_MANAGED_TOKEN_MEMORY

It provides the functions token_queue_memory_switch(...) for token policy ‘queue’ and token_p_swap(...) for token policy ‘single’.

Default: false (disabled)

--token-queue-size number

In conjunction with token passing policy ‘queue’, number specifies the number of tokens in the token queue. This determines the maximum number of tokens that can be send without returning from the analyzer function.

Default: 64

--token-queue-safety-border number

Specifies the number of tokens that can be sent at maximum as reaction to one single pattern match. More precisely, it determines the number of token slots that are left empty when the token queue is detected to be full.

Default: 16

--token-id-offset number

number = Number where the numeric values for the token ids start to count. Note, that this does not include the standard token ids for termination, uninitialized, and indentation error.

Default: 10000

Certain token ids are standard, in a sense that they are required for a functioning lexical analyzer. Namely they are TERMINATION and UNINITIALIZED. The default values of those do not follow the token id offset, but are 0 and 1. If they need to be different, they must be defined in the ``token { ... `` } section, e.g.

token {
    TERMINATION   = 10001;
    UNINITIALIZED = 10002;
    ...
}

A file with token ids can be provided by the option

--foreign-token-id-file file name [[begin-str] end-str]

file-name = Name of the file that contains an alternative definition of the numerical values for the token-ids.

Note, that quex does not reflect on actual program code. It extracts the token ids by heuristic. The optional second and third arguments allow to restrict the region in the file to search for token ids. It starts searching from a line that contains begin-str and stops at the first line containing end-str. For example

> quex ... --foreign-token-id-file my_token_ids.hpp   \
                                   yytokentype   '};' \
           --token-prefix          Bisonic::token::

reads only the token ids from the enum in the code fragment yytokentype.

Default: empty list

--foreign-token-id-file-show

If this option is specified, then Quex prints out the token ids which have been found in a foreign token id file.

Default: false (disabled)

The following options support the definition of a independently customized token class:

--token-class-file file name

file name = Name of file that contains the definition of the token class. The setting provided here is possibly overwritten if the token_type section defines a file name explicitly.

--token-class, --tc [name ::]+ name

name is the name of the token class. Using ‘::’-separators it is possible to defined the exact name space as mentioned for the –analyzer-class command line option.

Default: Token

--token-id-type type name

type-name defines the type of the token id. This defines internally the macro QUEX_TYPE_TOKEN_ID. This macro is to be used when a customized token class is defined. The types of Standard C99 ‘stdint.h’ are encouraged.

Default: uint32_t

--token-class-only, --tco

When specified, quex only creates a token class. This token class differs from the normally generated token classes in that it may be shared between multiple lexical analyzers.

Note

When this option is specified, then the LexemeNull is implemented along with the token class. In this case all analyzers that use the token class, shall define --lexeme-null-object according the token name space.

Default: false (disabled)

--lexeme-null-object, --lno name [:: name]+

This option specifies the name and name space of the LexemeNull object. If the option is not specified, then this object is created along with the analyzer automatically. When using a shared token class, then this object must have been created along with the token class. Announcing the name of the lexeme null object prevents quex from generating a lexeme null inside the engine itself.

There may be cases where the characters used to indicate buffer limit needs to be redefined, because the default value appear in a pattern. For most codecs, such as ASCII and Unicode, the buffer limit codes do not intersect with valid used code points of characters. Theoretically however, the user may define buffer codecs that require a different definition of the limiting codes. The following option allows modification of the buffer limit code:

--buffer-limit number

Defines the value used to mark buffer borders. This should be a number that does not occur as an input character.

Default: 0

On several occasions quex produces code related to ‘newline’. The coding of newline has two traditions: The Unix tradition which codes it plainly as 0x0A, and the DOS tradition which codes it as 0x0D followed by 0x0A. To be on the safe side by default, quex codes newline as an alternative of both. In case, that the DOS tradition is not relevant, some performance improvements might be achieved, if the ‘0x0D, 0x0A’ is disabled. This can be done by the following flag.

--no-DOS

If specified, the DOS newline (0x0D, 0x0A) is not considered whenever newline is required.

Default: true (not disabled)

Input codecs other than ASCII or UTF32 (which map 1:1 to Unicode code points) can be used in two ways. Either on uses a converter that converts the file content into Unicode and the engine still runs on Unicode, or the engine itself is adapted to the require codec.

Currently quex-generated lexers can interact with GNU IConv and IBM’s ICU library as input converters. Using one of those requires, of course, that the correspondent library is installed and available. On Unix systems, the iconv library is usually present. ICU is likely required to be installed but also freely available. Using input converters, such as IConv or ICU lets is a flexible solution. The converter can be adapted dynamically while the internal engine remains running on Unicode.

--iconv

Enable the use of the IConv library for character stream decoding. This is equivalent to defining ‘-DQUEX_OPTION_CONVERTER_ICONV’ as a compiler flag. Depending on the compiler setup the ‘-liconv’ flag must be set explicitly in order to link against the IConv library.

Default: false (disabled)

--icu

Enable the use of IBM’s ICU library for character stream decoding. This is equivalent to defining ‘-DQUEX_OPTION_CONVERTER_ICU’ as a compiler flag. There are a couple of libraries that are required for ICU. You can query those using the ICU tool ‘icu-config’. A command line call to this tool with ‘–ldflags’ delivers all libraries that need to be linked. A typical list is ‘-lpthread -lm -L/usr/lib -licui18n -licuuc -licudata’.

Default: false (disabled)

Alternatively, the engine can run directly on a specific codec, i.e. without a conversion to Unicode. This approach is less flexible, but may be faster.

--codec codec name

Specifies a codec for the generated engine. The codec name specifies the codec of the internal analyzer engine. An engine generated for a specific codec can only analyze input of this particular codec.

Note

When --codec is specified the command line flag -b or --buffer-element-size does not represent the number of bytes per character, but the number of bytes per code element. The codec UTF8, for example, is of dynamic length and its code elements are bytes, thus only -b 1 makes sense. UTF16 triggers on elements of two bytes, while the length of an encoding for a character varies. For UTF16, only -b 2 makes sense.

Default: unicode

--codec-file file name

By means of this option a freely customized codec can be defined. The file name determines at the same time the file where the codec mapping is described and the codec’s name. The codec’s name is the directory-stripped and extension-less part of the given follower. Each line of such a file must consist of three numbers, that specify ‘source interval begin’, ‘source interval length’, and ‘target interval end. Such a line specifies how a cohesive Unicode character range is mapped to the number range of the customized codec. For example, the mapping for codec iso8859-6 looks like the following.

0x000 0xA1 0x00
0x0A4 0x1  0xA4
0x0AD 0x1  0xAD
0x60C 0x1  0xAC
0x61B 0x1  0xBB
0x61F 0x1  0xBF
0x621 0x1A 0xC1
0x640 0x13 0xE0

Here, the Unicode range from 0 to 0xA1 is mapped one to one from Unicode to the codec. 0xA4 and 0xAD are also the same as in Unicode. The remaining lines describe how Unicode characters from the 0x600-er page are mapped inside the range somewhere from 0xAC to 0xFF.

Note

This option is only to be used, if quex does not support the codec directly. The options --codec-info and --codec-for-language help to find out whether Quex directly supports a specific codec. If a --codec-file is required, it is advisable to use --codec-file-info file-name.dat to see if the mapping is in fact as desired.

The buffer on which a generated analyzer runs is characterized by its size (macro QUEX_SETTING_BUFFER_SIZE), by its element’s size, and their type. The latter two can be specified on the command line.

In general, a buffer element contains what causes a state transition in the analyzer. In ASCII code, a state transition happens on one byte which contains a character. If converters are used, the internal buffer runs on plain Unicode. Here also, a character occupies a fixed number of bytes. The check mark in 4 byte Unicode is coded as as 0x00001327. It is treated as one chunk and causes a single state transition.

If the internal engine runs on a specific codec (--codec ) which is dynamic, e.g. UTF8, then state transitions happen on parts of a character. The check mark sign is coded in three bytes 0xE2, 0x9C, and 0x93. Each byte is read separately and causes a separate state transition.

--buffer-element-size, -b, --bes 1|2|4

With this option the number of bytes is specified that a buffer element occupies.

The size of a buffer element should be large enough so that it can carry the Unicode value of any character of the desired input coding space. When using Unicode, to be safe ‘-b 4’ should be used except that it is inconceivable that any code point beyond 0xFFFF ever appears. In this case ‘-b 2’ is enough.

When using dynamic sized codecs, this option is better not used. The codecs define their chunks themselves. For example, UTF8 is built upon one byte chunks and UTF16 is built upon chunks of two bytes.

Note

If a character size different from one byte is used, the .get_text() member of the token class does contain an array that particular type. This means, that .text().c_str() does not result in a nicely printable UTF8 string. Use the member .utf8_text() instead.

Default: -1

--buffer-element-type, --bet type name

A flexible approach to specify the buffer element size and type is by specifying the name of the buffer element’s type, which is the purpose of this option. Note, that there are some ‘well-known’ types such as uint*_t (C99 Standard), u* (Linux Kernel), unsigned* (OSAL) where the * stands for 8, 16, or 32. Quex can derive its size automatically.

Quex tries to determine the size of the buffer element type. This size is important to determine the target codec when converters are used. That is, if the size is 4 byte a different Unicode codec is used then if it was 2 byte. If quex fails to determine the size of a buffer element from the given name of the buffer element type, then the Unicode codec must be specified explicitly by ‘–converter-ucs-coding-name’.

By default, the buffer element type is determined by the buffer element size.

--endian little|big|<system>

There are two types of byte ordering for integer number depending on the CPU. For creating a lexical analyzer engine on the same CPU type as quex runs then this option is not required, since quex finds this out by its own. If you create an engine for a different platform, you must know its byte ordering scheme, i.e. little endian or big endian, and specify it after --endian.

According to the setting of this option one of the three macros is defined in the header files:

  • __QUEX_OPTION_SYSTEM_ENDIAN
  • __QUEX_OPTION_LITTLE_ENDIAN
  • __QUEX_OPTION_BIG_ENDIAN

Those macros are of primary use for character code converters. The converters need to know what the analyser engines number representation is. However, the user might want to use them for his own special purposes (using #ifdef __QUEX_OPTION_BIG_ENDIAN ... #endif ).

Default: <system>

The implementation of customized converters is supported by the following options.

--converter-new, --cn function name

With the command line option above the user may specify his own converter. The string that follows the option is the name of the converter’s _New function. When this option is set, automatically customized user conversion is turned on.

--converter-ucs-coding-name, --cucn name

Determines what string is passed to the converter so that it converters a codec into Unicode. In general, this is not necessary. But, if a unknown user defined type is specified via ‘–buffer-element-type’ then this option must be specified.

By default it is defined based on the buffer element type.

Template and Path Compression ore methods to combine multiple states into one ‘mega state’. The mega state combines in itself the common actions of the states that it represents. The result is a massive reduction in code size. The compression can be controlled with the following command line options:

--template-compression

If this option is set, then template compression is activated.

Default: false (disabled)

--template-compression-uniform

This flag enables template compression. In contrast to the previous flag it compresses such states into a template state which are uniform. Uniform means, that the states do not differ with respect to the actions performed at their entry. In some cases this might result in smaller code size and faster execution speed.

Default: false (disabled)

--template-compression-min-gain number

The number following this option specifies the template compression coefficient. It indicates the relative cost of routing to a target state compared to a simple ‘goto’ statement. The optimal value, with respect to code size and speed, may vary from processor platform to processor platform, and from compiler to compiler.

Default: 0

--path-compression

This flag activates path compression. By default, it compresses any sequence of states that can be lined up as a ‘path’.

Default: false (disabled)

--path-compression-uniform

Same as uniform template compression, only for path compression.

Default: false (disabled)

--path-termination number

Path compression requires a ‘pathwalker’ to determine quickly the end of a path. For this, each path internally ends with a signal character, the ‘path termination code’. It must be different from the buffer limit code in order to avoid ambiguities.

Modification of the ‘path termination code’ makes only sense if the input stream to be analyzed contains the default value.

Default: 1

The following options control the output of comment which is added to the generated code:

--comment-state-machine

With this option set a comment is generated that shows all state transitions of the analyzer in a comment at the begin of the analyzer function. The format follows the scheme presented in the following example

/* BEGIN: STATE MACHINE
 ...
 * 02353(A, S) <- (117, 398, A, S)
 *       <no epsilon>
 * 02369(A, S) <- (394, 1354, A, S), (384, 1329)
 *       == '=' ==> 02400
 *       <no epsilon>
 ...
 * END: STATE MACHINE
 */

It means that state 2369 is an acceptance state (flag ‘A’) and it should store the input position (‘S’), if no backtrack elimination is applied. It originates from pattern ‘394’ which is also an acceptance state and ‘384’. It transits to state 2400 on the incidence of a ‘=’ character.

Default: false (disabled)

--comment-transitions

Adds to each transition in a transition map information about the characters which trigger the transition, e.g. in a transition segment implemented in a C-switch case construct

...
case 0x67:
case 0x68: goto _2292;/* ['g', 'h'] */
case 0x69: goto _2295;/* 'i' */
case 0x6A:
case 0x6B: goto _2292;/* ['j', 'k'] */
case 0x6C: goto _2302;/* 'l' */
case 0x6D:
...

The output of the characters happens in UTF8 format.

Default: false (disabled)

--comment-mode-patterns

If this option is set a comment is printed that shows what pattern is present in a mode and from what mode it is inherited. The comment follows the following scheme:

       /* BEGIN: MODE PATTERNS
        ...
        * MODE: PROGRAM
        *
        *     PATTERN-ACTION PAIRS:
        *       (117) ALL:     [
]
        *       (119) CALC_OP: "+"|"-"|"*"|"/"
        *       (121) PROGRAM: "//"
        ...
        * END: MODE PATTERNS
        */

This means, that there is a mode PROGRAM. The first three pattern are related to the terminal states ‘117’, ‘119’, and ‘121’. The white space pattern of 117 was inherited from mode ALL. The math operator pattern was inherited from mode CALC_OP and the comment start pattern “//” was implemented in PROGRAM itself.

Default: false (disabled)

The comment output is framed by BEGIN: and END: markers. These markers facilitate the extraction of the comment information for further processing. For example, the Unix command ‘awk’ can be used to extract what appears in between BEGIN: and END: the following way:

awk 'BEGIN {w=0} /BEGIN:/ {w=1;} // {if(w) print;} /END:/ {w=0;}' MyLexer.c

When using multiple lexical analyzers it can be helpful to get precise information about all related name spaces. Such short reports on the standard output are triggered by the following option.

--show-name-spaces, --sns

If specified short information about the name space of the analyzer and the token are printed on the console.

Default: false (disabled)

Errors and Warnings

When the analyzer behaves unexpectedly, it may make sense to ponder over low-priority patterns outrunning high-priority patterns. The following flag supports these considerations.

--warning-on-outrun, --woo

When specified, each mode is investigated whether there are patterns of lower priority that potentially outrun patterns of higher priority. This may happen due to longer length of the matching lower priority pattern.

Default: false (disabled)

Some warnings, notes, or error messages might not be interesting or even be disturbing. For such cases, quex provides an interface to prevent messages on the standard output.

--suppress, -s [integer]+

By this option, errors, warnings, and notes may be suppressed. The option is followed by a list of integers–each integer represents a suppressed message.

Default: empty list

The following enumerates suppress codes together with their associated messages.

0

Warning if quex cannot find an included file while diving into a ‘foreign token id file’.

1

A token class file (--token-class-file ) may contain a section with extra command line arguments which are reported in a note.

2

Error check on dominated patterns, i.e. patterns that may never match due to higher precedence patterns which cover a super set of lexemes.

3

Error check on special patterns (skipper, indentation, etc.) whether they are the same.

4

Warning or error on ‘outrun’ of special patterns due to lexeme length. Attention: To allow this opens the door to very confusing situations. For example, a comment skipper on “/” may not trigger because a lower precedence pattern matches on “/*” which is longer and therefore wins.

5

Detect whether higher precedence patterns match on a subset of lexemes that a special pattern (skipper, indentation, etc.) matches. Attention: Allowing such behavior may cause confusing situations. If this is allowed a pattern may win against a skipper, for example. It is the expectation, though, that a skipper shall skip –which it cannot if such scenarios are allowed.

6

Warning if no token queue is used while some functionality might not work properly.

7

Warning if token ids are used without being explicitly defined.

8

Warning if a token id is mentioned as a ‘repeated token’ but has not been defined.

9

Warning if a prefix-less token name starts with the token prefix.

10

Warning if there is no ‘on_codec_error’ handler while a codec different from Unicode is used.

11

Warning a counter setup is defined without specifying a newline behavior.

12

Warning if a counter setup is defined without an \else section.

13

Warning if a default newline is used upon missing newline definition in a counter definition section.

14

Same as 13, except with hexadecimal ‘0D’.

15

Warning if a token type has no ‘take_text’ member function. It means, that the token type has no interface to automatically accept a lexeme or an accumulated string.

16

Warning if there is a string accumulator while ‘–suppress 15’ has been used.

Queries

The former command line options influenced the procedure of code generation. The options to solely query quex are listed in this section. First of all the two traditional options for help and version information are

--help, -h

Reports some help about the usage of quex on the console.

Default: false (disabled)

--version, -v

Prints information on the version of quex.

Default: false (disabled)

The following options allow to query on character sets and the result of regular expressions.

--codec-info, --ci name

Displays the characters that are covered by the given codec’s name. If the name is omitted, a list of all supported codecs is printed.

--codec-info-file, --cif file name

Displays the characters that are covered by the codec provided in the given file. This makes sense in conjunction with --codec-file where customized codecs can be defined.

--codec-for-language, --cil language

Displays the codecs that quex supports for the given human language. If the language argument is omitted, all available languages are listed.

--property, --pr property

Displays information about the specified Unicode property. The property can also be a property alias. If property is not specified, then brief information about all available Unicode properties is displayed.

Default: empty string

--set-by-property, --sbpr setting

Displays the set of characters for the specified Unicode property setting. For query on binary properties only the name is required. All other properties require a term of the form name=value.

--property-match, --prm wildcard-expression

Displays property settings that match the given wildcard expression. This helps to find correct identifiers in the large list of Unicode settings. For example, the wildcard-expression Name=*LATIN* gives all settings of property Name that contain the string LATIN.

--set-by-expression, --sbe regular expression

Displays the resulting character set for the given regular expression. Larger character set expressions that are specified in [: ... :] brackets.

--numeric, --num

If this option is specified the numeric character codes are displayed rather then the characters.

Default: false (disabled)

--intervals, --itv

If this option is set, adjacent characters are displayed as intervals, i.e. in terms of begin and end of domains of adjacent character codes. This provides a concise display.

Default: false (disabled)

--names

If this option is given, resulting characters are displayed by their (lengthy) Unicode name.

Default: false (disabled)