Get Lexical Analyzer Generator Quex at SourceForge.net. Fast, secure and Free Open Source software downloads

Previous topic

Invocation and Compilation

Next topic

Compile Options

Command Line Options

This section lists the command line options to control the behavior of the generated lexical analyzer. Strings following these options must be either without whitespaces or in quotes. Numbers are specified in C-like format as described in sec-number-format.

-i, --mode-files file-list

file-list = list of files of the file containing mode definitions (see sections <<sec-practical-modes>>, <<sec-practical-pattern-action-pairs>>,

and <<sec-formal-generated-class-mode-handling>>).

DEFAULT = <empty>

-o`, `--engine`, `--analyzer-class` name

name = Name of the lexical analyser class that is to be created inside the namespace`quex`. This name also determines the filestem of the output files generated by quex. At the same time, the namespace of the analyzer class can be specified by means of a sequence separated by ‘::’ specifiers, e.g.:

> quex ... --analyzer-class MySpace::MySubSpace::MySubSubSpace::Lexer

specifies that the lexical analyzer class is Lexer and that it is located in the namespace MySubSubSpace which in turn is located MySubSpace which it located in MySpace.

DEFAULT = lexer

--output-dir, --odir directory

directory = name of the output directory where generated files are to be written. This does more than merely copying the sources to another place in the file system. It also changes the include file references inside the code to refer to the specified directory as a base.

--file-extension-scheme``, ``--fes`` ext

Specifies the filestem and extensions of the output files. The provided argument identifies the naming scheme.

``DEFAULT``

If the option is not provided, then the naming scheme depends on the --language command line option. That is:

C++
  • No extension for header files that contain only declarations.
  • .i for header files containing inline function implementation.
  • .cpp for source files.
C
  • .h for header files.
  • .c for source files.
``++``
  • .h++ for header files.
  • .c++ for source files.
``pp``
  • .hpp for header files.
  • .cpp for source files.
``cc``
  • .hh for header files.
  • .cc for source files.
``xx``
  • .hxx for header files.
  • .cxx for source files.

For C there is currently no different naming scheme supported.

--language name

Defines the programming language of the output. name can be

  • C for plain C code.
  • C++ for C++ code.
  • dot for plotting information in graphviz format.

DEFAULT = C++

--character-display ['hex', 'utf8']

Specifies how the character of the state transition are to be displayed when –language dot is used. If hex is specified then character will be displayed with the Unicode code point in hexadecimal notation. If utf8 is specified the character will be displayed ‘as is’ in UTF8 notation.

DEFAULT=`”utf8”`

--debug`

If provided, then code fragments are created to activate the output of every pattern match. Then defining the macro QUEX_OPTION_DEBUG_QUEX_PATTERN_MATCHES activates those printouts in the standard error output. Note, that this options produces considerable code overhead.

DEFAULT = <disabled>

--buffer-based`, `--bb`

Turns on buffer based analyzis. If this option is not set, buffer based analyzis can still be activated with the compile option QUEX_OPTION_BUFFER_BASED_ANALYZIS.

--version-id name
name = arbitrary name of the version that was generated. This string

is reported by the version() member function of the lexical analyser.

DEFAULT = “0.0.0-pre-release”

--no-mode-transition-check`

Turns off the mode transition check and makes the engine a little faster. During development this option should not be used. But the final lexical analyzer should be created with this option set.

By default, the mode transition check is enabled.

--single-mode-analyzer`, `--sma`

In case that there is only one mode, this flag can be used to inform quex that it is not intended to refer to the mode at all. In that case no instance of the mode is going to be implemented. This reduces memory consumption a little and may possibly increase performance slightly.

--no-string-accumulator`, `--nsacc`

Turns the string accumulator option off. This disables the use of the string accumulator to accumulate lexemes. See class ‘quex::Accumulator’.

By default, the string-accumulator is implemented.

--no-include-stack`, `--nois`

Disables the support of include stacks where the state of the lexical analyzer can be saved and restored before diving into included files. Setting this flag may speed up a bit compile time

By default, the include stack handler is implemented.

--post-categorizer`

Turns the post categorizer option on. This allows a ‘secondary’ mapping from lexemes to token ids based on their name. See class ‘quex::PostCategorizer’.

--no-count-lines`

Lets quex generate an analyzer without internal line counting.

--no-count-columns`

Lets quex generate an analyzer without internal column counting.

If an independent source package is required that can be compiled without an installation of quex, the following option may be used

--source-package` directory

Creates all source code that is required to compile the produced lexical analyzer. Only those packages are included which are actually required. Thus, when creating a source package the same command line ‘as usual’ must be used with the added –source-package option.

The string that follows is the directory where the source package is to be located.

For the support of derivation from the generated lexical analyzer class the following command line options can be used.

--derived-class`, `--dc` name
name = If specified, the name of the derived class that the user intends to provide
(see section <<sec-formal-derivation>>). Note, specifying this option

signalizes that the user wants to derive from the generated class. If this is not desired, this option, and the following, have to be left out. The namespace of the derived analyzer class is specified analgously to the specification for –analyzer-class, as mentioned above.

DEFAULT = <empty>

--derived-class-file` filename

filename = If specified, the name of the file where the derived class is defined. This option only makes sense in the context of optioin –derived-class`.

DEFAULT = <empty>

--token-prefix name

name = Name prefix to prepend to the name given in the token-id files. For example, if a token section contains the name COMPLEX and the token-prefix is TOKEN\_PRE_ then the token-id inside the code will be TOKEN_PRE_COMPLEX.

The token prefix can contain namespace delimiters, i.e. ::. In the brief token senders the namespace specifier can be left out.

DEFAULT = QUEX_TKN_

--token-policy, --tp [queue, single]

Determines the policy for passing tokens from the analyzer to the user.

DEFAULT = queue

--token-memory-management-by-user, --tmmbu

Enables the token memory management by the user. This command line option is equivalent to the compile option:

QUEX_OPTION_USER_MANAGED_TOKEN_MEMORY

It provides the functions token_queue_memory_switch(...) for token policy ‘queue’ and token_p_switch(...) for token policy ‘single’ (section Token Passing Policies).

--token-queue-size number

In conjunction with token passing policy ‘queue’, number specifies the number of tokens in the token queue. This determines the maximum number of tokens that can be send without returning from the analyzer function.

DEFAULT = 64.

--token-queue-safety-border number

Specifies the number of tokens that can be sent at maximum as reaction to one single pattern match. More precisely, it determines the number of token slots that are left empty when the token queue is detected to be full.

DEFAULT = 16

--no-warning-on-no-token-queue

If set, this option disables the warning in case that an event handler is specified without having set --token-policy queue. Without a token queue no tokens can be propperly sent from inside event handlers. There might be other things the user decides to do without receiving any warning.

--token-id-offset number

number = Number where the numeric values for the token ids start to count. Note, that this does not include the standard token ids for termination, unitialized, and indentation error.

DEFAULT = 10000

Certain token ids are standard, in a sense that they are required for a functioning lexical analyzer. Namely they are TERMINATION and UNINITIALIZED The default values of those do not follow the token id offset, but are 0, 1, and 2. If they need to be different, they must be defined in the token { ... } section, e.g.:

token {
    TERMINATION   = 10001;
    UNINITIALIZED = 10002;
    ...
}

A file with token ids can be provided by the option

--foreign-token-id-file filename

filename = Name of the file that contains an alternative definition of the numerical values for the token-ids (see also section <<sec-formal-macro>>).

DEFAULT = <empty>

The following options support the definition of a independently customized token class:

--token-class-file` filename

filename = Name of file that contains the definition of the token class. Note, that the setting provided here is possibly overwritten if the token_type section defines a file name explicity (see Customized Token Classes).

DEFAULT = $(QUEX_PATH)/code_base/Token

--token-class`, `--tc` name

name is the name of the token class. Using ‘::’-separators it is possible to defined the exact namespace as mentioned for the –analyzer-class command line option.

--token-id-type` type-name

type-name defines the type of the token id. This defines internally the macro QUEX_TYPE_TOKEN_ID. This macro is to be used when a customized token class is defined. The types of Standard C99 ‘stdint.h’ are encouraged.

DEFAULT = uint32_t

--token-type-no-stringless-check`, `--ttnsc`

Disable the ‘stringless check’ for customized token types. If the user defines a token type that cannot take a QUEX_TYPE_CHARACTER* then quex posts a warning. By means of this flag the warning is disabled.

There may be cases where the characters used to indicate buffer limit needs to be redefined, because the default value appear in a pattern footnote:[As for ‘normal’ ASCII or Unicode based lexical analyzers, this would most probably not be a good design decision. But, when other, alien, non-unicode codings are to be used, this case is conceivable.]. The following option allows modification of the buffer limit code:

--buffer-limit number

DEFAULT = 0x0

If the trivial end-of-line pre-condition (i.e. the ‘$’ at the end of a regular expression) is used, by default quex produces code that runs on both Unix and DOS-like systems. Practically, this means that it matches against ‘newline’ 0x0A and ‘carriage return/newline’ 0x0D 0x0A. For the case that the resulting analyzer only runs on a Unix machine some tiny performance improvements might be achieved by disabling the 0x0D 0x0A sequence and only triggering on 0x0A. In this case, the following flag may be specified:

--no-DOS

For unicode support it is essential to allow character conversion. Currently quex can interact with GNU IConv and IBM’s ICU library. For this the correspondent library must be installed on your system. On Unix systems, the iconv library is usually present. If a coding other than ASCII is required, specify the following options:

--iconv

Enable the use of the iconv library for character stream decoding. This is equivalent to defining ‘-DQUEX_OPTION_CONVERTER_ICONV’ as a compiler flag. Depending on your compiler setup, you might have to set the ‘-liconv’ flag explicitly in order to link against the IConv library. DEFAULT = <disabled>

--icu

Enable the use of IBM’s ICU library for character stream decoding. This is equivalent to defining ‘-DQUEX_OPTION_CONVERTER_ICU’ as a compiler flag. There are a couple of libraries that are required for ICU. You can query those using the ICU tool ‘icu-config’. A command line call to this tool with ‘–ldflags’ delivers all libraries that need to be linked. A typical list is ‘-lpthread -lm -L/usr/lib -licui18n -licuuc -licudata’. DEFAULT = <disabled>

-b, --bes, --buffer-element-size [1, 2, 4]

With this option the number of bytes are specified that a buffer element occupies. The buffer element is the ‘trigger’ on which the analyzer’s state machine triggers. Usually, a buffer element carries a character. This is true for fixed size character encodings, or when converters are used (see options --icu or --iconv). In these cases all characters are internally processed in Unicode. Thus, the size of a buffer element should be large enough so that it can carry the unicode value of any character of the desired input coding space. When using Unicode, to be safe ‘-b 4’ should be used except that it is unconceivable that any code point beyond 0xFFFF ever appears. In this case ‘-b 2’ is enough.

Dynamic size character encodings are available with the ‘–codec’ option, for example ‘utf8’ and ‘utf16’. In this case, the buffer element carries a ‘character chunk’. UTF8 runs on chunks of one byte. UTF16 runs on two byte chunks. Respectively, UTF8 requires ‘-b 1’ and UTF16 requires ‘-b 2’.

You can only specify 1 byte, 2 byte or 4 byte per character.

DEFAULT = 1

Warning

If a character size different from one byte is used, the .get_text() member of the token class does contain an array that particular type. This means, that .text().c_str() does not result in a nicely printable UTF8 string. Use the member .utf8_text() instead.

--buffer-element-size-irrelevant

If this flag is specified, the regular expressions are not constructed and not checked to fit a particular character range. Note, that this option is implicitly set when a ‘–codec’ is specified.

DEFFAULT: not set.

--bet, --buffer-element-type name

A flexible approach to specify the buffer element size and type is by specifying the name of the buffer element’s type, which is the purpose of this option. Note, that there are some ‘well-known’ types such as uint*_t (C99 Standard), u* (Linux Kernel), unsigned* (OSAL) where the * stands for 8, 16, or 32. Quex can derive its size automatically.

Note, that if a converter is specified (--iconv, --icu, or --converter-new) and the buffer element type does not allow to deduce the unicode coding name that the converter requires, then it must be explicitly specified by ‘–converter-ucs-coding-name’.

DEFAULT: Determined by buffer element size.

--endian [little, big, <system>]

There are two types of byte ordering for integer number for different CPUs. For creating a lexical analyzer engine on the same CPU type as quex runs then this option is not required, since quex finds this out by its own. If you create an engine for a different plattform, you must know its byte ordering scheme, i.e. little endian or big endian, and specify it after --endian.

According to the setting of this option one of the three macros is defined in the header files:

  • __QUEX_OPTION_SYSTEM_ENDIAN
  • __QUEX_OPTION_LITTLE_ENDIAN
  • __QUEX_OPTION_BIG_ENDIAN

Those macros are of primary use for character code converters. The converters need to know what the analyser engines number representation is. However, the user might want to use them for his own special purposes (using #ifdef __QUEX_OPTION_BIG_ENDIAN ... #endif).

DEFAULT=`”<system>”`

--converter-new String

Section Customized Converters explains how to implement customized converters. With the command line option above the user may specify his own converter. The string that follows the option is the name of the converter’s _New function. When this option is set, automatically customized user conversion is turned on.

--converter-ucs-coding-name, --cucn

Determines what string is passed to the converter so that it converters a codec into unicode. In general, this is not necessary. But, if a unknown user defined type is specified via ‘–buffer-element-type’ then this option must be specified.

DEFAULT: Determined by buffer element type.

--codec String

Specifies a codec for the generated engine. By default the internal engine runs on unicode code points, i.e. ASCII for characters below 0x7F. When --codec specifies a codec ‘X’, for example, is specified, the internal engine triggers on code elements of ‘X’. It does not need character conversion (neither --iconv nor --icu). Codec based analyzers are explained in section Analyzer Engine Codec.

When --codec is specified the command line flag -b or --buffer-element-size does not represent the number of bytes per character, but the number of bytes per code element. The codec UTF8, for example, is of dynamic length and its code elements are bytes, thus only -b 1 makes sense. UTF16 triggers on elements of two bytes, while the length of an encoding for a character varries. For UTF16, only -b 2 makes sense.

When --codec is specified, the range check for characters is disabled. That means, the option --buffer-element-size-irrelevant is set automatically.

--codec-file filename.dat

By means of this option a freely customized codec can be defined. The follower filename.dat determines at the same time the file where the codec mapping is described and the codec’s name. The codec’s name is the directory-stripped and extension-less part of the given follower. Each line of such a file must consist of three numbers, that specify ‘source interval begin’, ‘source interval length’, and ‘target interval end. Such a line specifies how a cohesive Unicode character range is mapped to the number range of the customized codec. For example, the mapping for codec iso8859-6 looks like the following:

0x000 0xA1 0x00
0x0A4 0x1  0xA4
0x0AD 0x1  0xAD
0x60C 0x1  0xAC
0x61B 0x1  0xBB
0x61F 0x1  0xBF
0x621 0x1A 0xC1
0x640 0x13 0xE0

Here, the Unicode range from 0 to 0xA1 is mapped one to one from Unicode to the codec. 0xA4 and 0xAD are also the same as in Unicode. The remaining lines describe how Unicode characters from the 0x600-er page are mapped inside the range somewhere from 0xAC to 0xFF.

Note, that this option is only to be used, if quex does not support the codec directly. The options --codec-info and --codec-for-language help to find out whether Quex directly supports a specific codec. If a --codec-file is required, it is advisable to use --codec-file-info  filename.dat to see if the mapping is in fact as desired.

Template and Path Compression can be controlled with the following command line options:

--template-compression

If this option is set, then template compression is activated.

--template-compression-min-gain 'number'

The number following this option specifies the template compression coefficient. It indicates the relative cost of routing to a target state compared to a simple ‘goto’ statement. The optimal value may vary from processor platform to processor platform, and from compiler to compiler.

DEFAULT = 1

--path-compression

This flag activates path compression. By default, it compresses any sequence of states that allow to be lined up as a ‘path’. This includes states of different acceptance values, store input positions, etc.

--path-compression-uniform

This flag enables path compression. In contrast to the previous flag it compresses such states into a path which are uniform. This simplifies the structure of the correspondent pathwalkers. In some cases this might result in smaller code size and faster execution speed.

--path-termination 'number'

Path compression requires a ‘pathwalker’ to determine quickly the end of a path. For this, each path internally ends with a signal character, the ‘path termination code’. It must be different from the buffer limit code in order to avoid ambiguities.

Modification of the ‘path termination code’ makes only sense if the input stream to be analyzed contains the default value.

DEFAULT = 0x1.

For version information pass option –version‘ or -v‘. The options –help and -h‘ are reserved for requesting a help text. Those are the options for using quex in the ‘normal’ mode where it creates lexical analyzers. However, quex provides some services to query and test character sets. If one of those options is called, then quex does not create a lexical analyzer but responds with some information requested by the user. Those options are the following.

--codec-info [name]

Displays the characters that are covered by the given codec’s name. If the name is omitted, a list of all supported codecs is printed. Engine internal character encoding is discussed in section sec-engine-internal-coding.

--codec-file-info filenname.dat

Displays the characters that are covered by the codec provided in the given file. This makes sense in conjunction with --codec-file where customized codecs can be defined.

--codec-for-language [language]

Displays the codecs that quex supports for the given human language. If the language argument is omitted, all available languages are listed.

--property name

If name is specified, then information about the property with the given name is displayed. Note, that name can also be a property alias. If name is not specified, then brief information about all available unicode properties is displayed.

--set-by-property setting

For binary properties only the property name has to be specified. All other properties require a term of the form property-name = value. Quex then displays the set of character that has this particular property.

--set-by-expression expression

Character set expressions that are ususally specified in [: ... :] brackets can be specified as expression. Quex then displays the set of characters that results from it.

--property-match wildcard-expression

Quex allows the use of wildcards in property values. Using this option allows display of the list of values to which the given wildcard expression expands. Example: The wildcard-expression Name=*LATIN* gives all settings of property Name that contain the string LATIN.

--numeric

If this option is specified the numeric character codes are displayed rather then the utf8 characters.

--intervals

This option disables the display of single character or single character codes. In this case sets of adjacent characters are displayed as intervals. This provides a somewhat more abbreviated display.

The following options control the comment which is added to the generated code:

--comment-state-machine

With this option set a comment is generated that shows all state transitions of the analyzer in a comment at the begin of the analyzer function. The format follows the scheme presented in the following example

/* BEGIN: STATE MACHINE
 ...
 * 02353(A, S) <~ (117, 398, A, S)
 *       <no epsilon>
 * 02369(A, S) <~ (394, 1354, A, S), (384, 1329)
 *       == '=' ==> 02400
 *       <no epsilon>
 ...
 * END: STATE MACHINE
 */

It means that state 2369 is an acceptance state (flag ‘A’) and it should store the input position (‘S’), if no backtrack elimination is applied. It originates from pattern ‘394’ which is also an acceptance state and ‘384’. It transits to state 2400 on the event of a ‘=’ character.

--comment-transitions

Adds to each transition in a transition map information about the characters which trigger the transition, e.g. in a transition segment implemented in a C-switch case construct

...
case 0x67:
case 0x68: goto _2292;/* ['g', 'h'] */
case 0x69: goto _2295;/* 'i' */
case 0x6A:
case 0x6B: goto _2292;/* ['j', 'k'] */
case 0x6C: goto _2302;/* 'l' */
case 0x6D:
...

The output of the characters happens in UTF8 format.

--comment-mode-patterns

If this option is set a comment is printed that shows what pattern is present in a mode and from what mode it is inherited. The comment follows the following scheme:

/* BEGIN: MODE PATTERNS
 ...
 * MODE: PROGRAM
 *
 *     PATTERN-ACTION PAIRS:
 *       (117) ALL:     [ \r\n\t]
 *       (119) CALC_OP: "+"|"-"|"*"|"/"
 *       (121) PROGRAM: "//"
 ...
 * END: MODE PATTERNS
 */

This means, that there is a mode PROGRAM. The first three pattern are related to the terminal states ‘117’, ‘119’, and ‘121’. The whitespace pattern of 117 was inherited from mode ALL. The math operator pattern was inherited from mode CALC_OP and the comment start pattern “//” was implemented in PROGRAM itself.

The comment output is framed by BEGIN: and END: markers. This facilitates the extraction of this information for further processing. For example, the Unix command ‘awk’ can be used:

awk 'BEGIN {w=0} /BEGIN:/ {w=1;} // {if(w) print;} /END:/ {w=0;}' MyLexer.c

The following option influences the analysis process on the very lowest level.

--state-entry-analysis-complexity-limit N

Note

Never use this option until quex proposes in a warning message that you may use it to control the speed of code generation. The warning message proposing the usage of this option should only appear in engines with thousands of very similar patterns including some repetitions.

For state entry analysis an algorithm is applied that is quadratic with the number of different cases to be considered. In extremely strange setups, this may blow the computation time beyond of what is acceptable. When more than ‘N’ different cases are detected, Quex only considers the ‘N’ best candidates in the search of an optimal solution. This includes a certain risk of not finding the absolute optimum.

DEFAULT = 1000