SourceForge.net Logo

Introduction

The queχ program generates a lexical analyser that scans text and identifies patterns. The result of this lexical analysis is a list of tokens. A token is a piece of atomic information directly relating to a pattern, or an event. It consists of a type-identifier, i.e. the token type, and content which is extracted from the text fragment that matched the pattern.

Figure (here) shows the principle of lexical analysis. The lexical analyser receives a stream of characters "if( x> 3.1 ) { …" and produces a list of tokens that tells what the stream signifies. A first token tells that there was an if statement, the second token tells that there was an opening bracket, the third one tells that there was an identifier with the content x, and so on.

In compilers for serious programming languages the token stream is received by a parser that interprets the given input according to a specific grammar. However, for simple scripting languages this token stream might be treated immediately. Using a lexical analyser generator for handcrafted ad-hoc scripting languages has the advantage that it can be developed faster and it is much easier and safer to provide flexibility and power. This is demonstrated in the following chapters.

figures/lexical-analysis-process.png
Figure: The process of lexical analysis.

The following features distinguish queχ from the traditional lexical analysers such as lex or flex:

This text briefly explains the basic concepts of lexical analysis in queχ. Here, a short review is given on lexical analysis, but then it concentrates on the introduction of the features mentioned above. The subsequent chapter dicusses a simple example of a complete application for lexical analysis. The final chapter elaborates on the formal usage of all features of queχ.

Installation

Before beginning the installation of queχ, make sure that Python (http://www.python.org) is installed. Most linux distributions provide handy rpm packages—so this should not be an issue. Then, follow these steps:

  1. Extract the file queχ-x.x.x.tar.gz into the desired directory.

  2. Setthe environment variable QUEX_PATH in your system environment to the place where you installed queχ. If you are using a Unix system and the bash-shell, add the following line to your .bashrc-file: export QUEX_PATH=the/directory/where/queχ/was/installed/ if you installed queχ in the directory given on the right hand side of the assignment.

  3. Make a link from the file $QUEX_PATH/queχ-exe.py to $EXECUTABLE_PATH/queχ where $EXECUTABLE_PATH is a path where executables can be found by your system. If you work on a unix system, you can type ln -s $QUEX_PATH/queχ-exe.py /usr/local/bin/queχ You can ensure executable rights with chmod a+rx $QUEX_PATH/queχ-exe.py, chmod a+rx /usr/local/bin/queχ.

  4. Supply your C++ compiler with the include path $QUEX_PATH. For the vast majority of compilers this means you must add -I$(QUEX_PATH) to the list of compiler flags. An example of how this is done can be observed in the test applications which come with the queχ distribution.

That is all. Now, you should either copy the directories ./demo/* to a place where you want to work on it, or simply change directory to there. These directories contain sample applications 000, 001, $\ldots$. Change to the directory of the sample applications and type make. If everything is setup properly, you will get your first queχ-made lexical analyser executable in a matter of seconds.

The example applications depict easy ways to specify traditional lexical analysers, they show some special features of queχ such as mode transitions, and more. Each demo-application deals with a particular feature of queχ:

The author of this document suggests that the user look at these sample applications first before continuing with the remainder of this text. With the background of easy-to-use examples to serve as starting point for their own efforts, it should be natural to get a feeling for the ease of queχ.

Licensing Information

The Software is distributed under LPGL, provided the following restriction:

Note
There is no license for products targeting military purpose.

Note, that if any part of the LPGL contradicts with the former restrictions, the former restrictions have precedence. The no warranty clause holds in any case. Here is a brief summary of the LPGL:

Queχ is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

The Name of the Game

The name queχ obviously has a lexo-graphical relation to lex and flex—the most well known tools for generating lexical analysers at the time of this writing. The Qu at the beginning of the word stands for quick, so hopefully the new features of queχ and the ability to create directly coded engines will generate much faster lexical analysers. Also, the process of programming shall be much quicker by means of the handy shortcuts and elegant solutions that queχ provides.

The last letter χ is a the greek lowercase letter chi. It is intended to remind the author of this text that he actually wanted to create a better \TeX
[As Donald Knuth explains it\cite{}, the last letter in \TeX is an uppercase chi so the name is to be pronounced like the German word ach (it's actually more a sound Germans utter, rather than a word with a dedicated meaning). Analogously, the χ at the end of queχ should not be pronounced like the x of the German word nix.]
system with optimized source code appearance. When he realized that traditional lexical analysers did not provide the functionality to express solutions easily he started working on queχ—and dove much deeper into the subject than he ever thought he would.

Basics

Lexical analysis is all about patterns. The following section recalls some basics of the mechanicss of pattern matching. For a detailed description see \cite{}, \cite{}, and \cite{}. It also explains how tokens are communicated to the caller of the lexical analyzer. Further, it introduces queχ's concept of lexical analyzer modes. Finally, a section explains the control and handling of mode transitions.

Pattern Matching

Pattern matching can be described as a finite state automaton where incoming characters play the role of triggers. Each incoming character (trigger), determines whether the state machine transits into a new state or if it remains in the same state. Consider figure Example that depicts a state machine to match a floating point number. In its initial state it waits for a digit (the START state). If no digit arrives, the state machine goes into the FAILED state. It can be said for sure, that there was no number. If a digit arrives, though, it transits into DIGITS-A. As long as digits arrive, it remains in this state. If a non-digit comes in it transits into the ACCEPTANCE state, since already enough digits have arrived as to judge that it is a number. However, if a dot arrived, then the state machine enters into the DOT state. If after the dot a new digit arrives, it enters into the state DIGITS-B. Note, that in the states DIGITS-A, DOT, or DIGITS-B a character from the else set (non-expected characters) lets the machine still transit into an ACCEPTANCE state, since enough characters have arrived to judge that it is a number.

figures/pattern-state-machine.png
Figure: A state machine to detect a floating point number pattern.

In real world lexical analysis a whole bank of state machines are active at the same time. Starting from a position $x_0$ in the character stream, a_ step in lexical analysis_ consists of the following procedures:

  1. All state machines (one for each pattern) are set into their START state.

  2. Starting at position $x_0$, characters are eaten and the state machines transit their internal states. Some of them enter the FAIL state and drop out, others enter the ACCEPTANCE state and drop out.

  3. At some point in time $x_1$, each of state machines either enters the FAIL state or the ACCEPTANCE state
    [Let's assume that the end-of-stream character be included into the pattern definitions, so that this statement holds.]
    . At this point in time a decision is to be made about who is the winner.

    1. If all state machines reach the FAIL state, then it can be set that from position $x_0$ the character stream is not sound. The lexical analyser could only send an error-token. One looks at the next character after $x_0$, i.e. $x_0 = x_0 \,+\,1$ and goes back to step 1.

    2. If one or more state machines reache the ACCEPTANCE state, then the pattern that has eaten the most characters, i.e. lived the longest is considered the winner (the matcher). The show continues.

  4. The matcher pattern has an associated pattern match action to be executed when the pattern matches. Usually, this interprets the character stream from position $x_0$ to position $x_1$ and puts the information into a token, i.e. some signal as output of the lexical analyser.

In brief, a step in lexical analysis starts at a position $x_0$ in the character stream. After eating some characters, at some position $x_1$, it can be decided what pattern matches the best, and eventually the action related to the winning pattern is executed. Then lexical analysis starts at position $x_1$ all over again until the end of the character stream.

In principle, the functioning of a lexical analyser is determined by a 1) a set of patterns and 2) by their related actions. For most practical applications though one requires additionally a definition of token identifiers because the signals need to be understood by another unit (e.g. the parser).

Traditionally, lexical analysers send a single token for an identified pattern in the character stream. Practically, this means that, after a pattern match has happened the token-getter-function returns the token that was identified. Queχ, though, implements a more flexible approach which is better suited for implicit tokens and mode transitions. The following section shows how tokens are communicated by a lexical analyser created with the queχ program.

The Token Queue

Queχ promotes the idea of implicit tokens, i.e. tokens that are produced without any pattern match. This concept raised from the fact that some tokens are, in fact, redundant and can be derived from context. From the perspective of the source code appearance of a programming language, a redundancy free language is much clearer than one that forces the programmer to specify information over and over again. Consider the example of the \verb|\|begin-\verb|\|end-regions in {LaTeX} as shown in the subsequent listing.

\begin{itemize}
\item this is the first item.
\item this is the second one.
...
\item this is the last item.
\end{itemize}

Logically, the begin and end tokens could have been derived from the fact that an \verb|\|item token pops up in normal text mode. Implicit tokens means that the lexical analyser can read between the lines. It sees things that are meant to be there from the context. Implementing the concept of implicit token generation inside lexical analysis rather than in a parsers allows the parser to constitute a unit that reasons logically, i.e. it considers distinct signals as input, it has a clear set of grammatical rules, and produces a clear understanding of the information (e.g. a syntax tree of a program). In Queχ's philosophy words that are implicitly thought to be inserted into the context of a sentence, are treated by lexical analysis. Multiple-Meanings (TODO: check this wording), i.e. the interpretation of a signal/token due to its context is expressed by grammar.

Traditionally, the parser calls a function of the form get_token(); the lexical analyser does one single pattern match, and then returns the token that represents the matched pattern---or an error, if no pattern matched. This mechanism is not sufficient when implicit tokens are involved, because implicit tokens can pop up out of nowhere in the character stream. When a pattern matches a whole set of such implicit tokens may be ready to be sent.

Sequence diagram displaying the token-passing from the pattern-match action to the caller of get_token().

image::figures/token-queue.png

Figure style=ref shows the information flow when a lexical analyser generated by queχ parses a character stream. The user begins and requests a new token using member function get_token(). The engine's object start then the lexical analysis. During the analysis, the pattern matching actions may send a whole set of tokens into the _token-queue_. When the pattern matching actions decide to come back, the get_token() function returns the first token that was entered into the queue.

The subsequent call to get_token(), though, does not trigger a lexical analysis since there are still tokens in the queue. Until the queue is empty, the function only returns tokens from the queue. When the queue is empty, the lexical analysis starts again in the same manner as described above. From the caller's side the process appears to be a reading from a continuous token stream. From inside the pattern match action the process appears to be a sending process, where tokens are sent to some recipient.

An important discussion evolves around what type of variables should be supported by the token that is sent. The author of queχ, prefers the idea of separating the interpretation of the lexeme from lexical analysis. This is for the following reasons:

Clarity of Concept
  • The operation, such as convert to double needs to be imported somehow into the lexical analysis framework. The lexical analyser by itself has no knowledge how to do that.

  • The token type becomes polymorph. That means that it has to support multiple data types. This increases complexity of handling this simple data structure.

  • A separate layer of interpretation increases clarity.

Computational Issues
  • An interpretation creates an object, i.e. the result of the interpretation of the lexeme. This object is possibly not used by the parser. Thus the creation, initialization and destruction of this object is wasted effort. If the parser controls the interpretation these redundant operations can be avoided.

  • It is particularly hard to define a data structure that can carry multiple types of objects, possibly of different sizes, that is memory and computationally efficient.

  • If the interpreted objects posses a some more complex structure, e.g. pointers to related objects, then the data locality increases. That means that the probability of a cache-miss increases. This can have a dramatic impact on the computational speed.

On the other hand, there might be hints that the lexer gives about a token. The author did not come up with the perfect token class, but instead leaves it open to be redefined by the user for his particular needs.

Lexical Analysis Modes

In many cases the behavior of lexical analysers can be described effectively by modes for lexical analysis. A standard example is format strings. In format strings, there is no need for the lexical patterns that are necessary when parsing an algorithm. They are actually better turned off. Thus, a mechanism is needed that is able to control the activation and deactivation of groups of patterns to be matched.

A lexical analyser mode is a state of the lexical analyser where only a particular set of patterns is actively waiting to be matched. A lexical analyser needs to be in exactly one mode at a particular point in time. Thus, a lexical analyser mode corresponds to the way that the input stream is currently transformed into signals, i.e. tokens. To handle the detection and influence of mode transitions, Queχ provides event handling mechanisms as described in the following sections.

A potential disadvantage of modes is confusion. With traditional lexical analyser generators. In flex, for example, the end-user does not see more than a token stream. He has no insight into the current lexical analyser mode. He cannot sense or control the mode transitions that are currently being made. The mode transitions are hidden somewhere in the pattern match actions. GNU Flex's start conditions are similar to modes, but the only way two modes, A and B, can be related in flex is via inclusion, i.e. by letting a pattern be active in A and B. There is no convienient mechanism to say: let B override all patterns of A. This is where the mode inheritance relationships of Queχ provide clear, convenient advantages as described in the section after next.

Mode Inheritance

Object oriented programming introduced the concept of_ inheritance_\cite{}, in the sense that two classes can have an is-a relationship. The class of cars may for example be related to the class of vehicles by an is-a relationship. This means that cars inherit all attributes of the broader class vehicles and may add some specific attributes to it. Such inheritance relationships are directional. In our example this means that every object that belongs to the class of cars also belongs to the class of vehicles (every car is a vehicle). However, not every object that belongs to the class of vehicles (e.g. a submarine) belongs to the the class of cars. Mode inheritance in queχ works similarly and effects the following three mechanisms:

Pattern Dispatching

If mode B is derived\footnote{The statements B is derived from A, A is base class of B, and B inherits from A are equivalent in meaning in the context of object oriented programming. In the same way, the term derivation, base class, and inheritance shall express mode relations in mode oriented lexical analysis.} from mode A, then it overtakes all of its patterns and the corresponding pattern-match actions. This is practical if one wants to ensure that a certain mode reacts on certain patterns. In this case it has to be derived from the mode that shows the desired behavior.
{bf Rule}: If mode B is derived from A and both have a common pattern {sf XYZ} where mode B relates it to a pattern action $XYZ_B$ and mode A relates it to a pattern action $XYZ_A$, then the pattern action $XYZ_B$ is cancelled. A base pattern-action pair overrides any derived pattern action pair of the same pattern.

Event Handlers

If mode B is derived from mode A, then B's event handler will be executed after A's event handlers. For example, A may have an event handler on_entry for entering mode A. Mode B itself might have an event handler for on_entry. If the lexical analyser now enters mode B, the event handler of its base mode A is first executed and then the event handler of class B itself.
{bf Rule:} _Event handlers are executed in sequence from base to derived classes._

Mode Access

If mode B is derived from mode A, then any permission of entry or exit to A is also a permission for B to enter or exit. The philosophy behind this is that that due to pattern inheritance mode B will behave like mode A, so it can be trusted the same way that one trusts A.
{bf Rule:} Entrys and exits that are allowed for a base class are also allowed for all its derived classes.

Figure style=ref shows an example where mode A acts as a base for mode B. Internally, queχ creates a new mode B* that contains the patterns that were explicitly givent by the definition of B and the patterns that are inherited from A. The pattern match action of {sf XYZ} from mode B is omitted, since mode B has to behave like mode A for patterns specified in mode A. The on_entry and on_exit functions of mode A are pasted into B's own event handlers, so that they are executed before them. What is not displayed in the figure is access inheritance. When Queχ checks the entry and exits of a mode it consideres automatically also the base classes. If there is no base mode that acts like a door opener to a particular mode, then transits will cause a run-time error.

figures/mode-inheritance.png
Figure: Mode Inheritance.

In conclusion, it can be said that inheritance is a very useful tool to implement some standard behavior in a base mode. Derived modes can then comply to this standard simply by being derived from this mode. Also, for the control of mode transitions the is-a relationships of inheritance provides a lot of flexibility. Mode inheritance promotes thinking about mode relationships naturally and helps to avoid code duplication.

Mode Transitions

In Queχ, modes can be changed by two basic mechanisms. First, modes can simple be changed. This makes sense if the new mode knows exactly to which mode it has to change, and this does not depend on the previous mode. Another mechanism is that of pushing and popping. This makes sense if the new mode shall return to the previous mode. It is therefore similar to a call-subroutine in procedural programming languages.

Figure style=ref shows a typical mode transition. Mode X has some patterns and one of them matched. Inside the pattern match action a transition to mode Y is requested. This triggers the event handler on_exit() of mode X before any transition is made. Here the user can send tokens, change some internal variables or write some debug output. Then the event handler on_entry() of entered mode Y is called. Again, the user can provide some actions related to the entrance of that mode.

figures/mode-transition.png
Figure: Event handling during lexical mode transitions.

The ability to define event handlers for entering and exiting modes not only supports implicit token production, but also supports transparency. It can now exactly be traced how modes are transitted. In order to restrict the transition from and to lexical modes, queχ provides transition control mechanisms. In the header of a definition of a mode X it can be specified from which modes mode X can be entered and to which modes mode X can exit. If a mode transition happens that does not conform to these restrictions, a run-time error is triggered. This helps to prevent the lexical analyser from sliding from one mode to another due to some thoughtlessly designed patterns.

Figure style=ref shows an example, where mode transition control is very handy. There is a lexical mode for parsing the content of a {sf FUNCTION}, a mode for {sf COMMENT} and one for {sf DOCU}-mentation parsing, a mode for parsing a character {sf STRING} and one for {sf MATH}. The design here says that the comment mode can only be activated from the {sf FUNCTION} mode and can only leave to {sf FUNCTION} mode. Similarly for the {sf STRING} and the {sf MATH} mode, Queχ provides the feature that for {sf COMMENT} mode, for example, one can specify that it can only be entered from {sf FUNCTION} mode, thus preventing the {sf DOCU}, {sf STRING} and the {sf MATH} mode from ever transiting into {sf COMMENT} mode. Also, it can be specified that the {sf COMMENT} mode can only exit to {sf FUNCTION} or {sf DOCU} mode, thus preventing it from dropping into {sf STRING} or {sf MATH} mode.

figures/mode-transition-control.png
Figure: Restricted mode transitions.

Extensive mode transitions, without a control mechanism as implemented is queχ, is highly error prone. The advantage is that the restrictions on the mode transitions are explicitly specified in the description of the lexical analyser. The alternative would be to create an external document and review the behavior of the lexical analyser with respect to mode transitions whenever an error is assumed from being caused by straying mode transitions. Practically, solid handling of extensive mode transitions in lexical analysis are not possible except with a control mechanism as provided by queχ.

Indentation

With the rise of the Python programming language, the use of indentation as the scope delimiter has become popular. Indeed, it may be the most efficient method to delimit scope with the least amount of additional characters provides a convenient mechanism to handle indentations which runs quasi-parallel to the pattern matching: indentation events.

Figure style=ref displays an example of the principle of indentation events. Whenever the lexical analyser reaches the first non-whitespace in a line, an indentation event (indicated as a little star in the figure) is triggered. The lexical analyser engine then calls a user defined indentation handler. The numbers at the indentation events indicate the number of characters that the indentation spans. They are passed to the user's indentation handler as arguments. The indentation handler can then keep track of indentation blocks or whatever its little heart desires.

figures/indentation.png
Figure: Principle of Indentation Events.

Note that it is not trivial to express indentation in terms of pattern action pairs based solely on regular expressions. It is not enough to define a pattern such as

          P_INDENTATION_CATCHER    "\n"[ ]*

That is a newline followed by whitespace. Imagine, one introduces a comment sign such as the traditional # for comment until newline. The comment eating pattern would be at first glance:

          P_COMMENT_EATER    "#"[^\n]*\n

That is a # followed by anything but newline and then one newline. The action related to this pattern would have to put pack the last newline. Otherwise the indentation catcher which starts with a newline can never trigger. In this particular case, this problem can be solved by deleting the last newline from the comment eater pattern, knowing that after as many not-newline as possible there must be a newline, i.e.

          P_COMMENT_EATER    "#"[^\n]*

The last newline is then eaten by the indentation catcher. However, the main problem remains: with such a design patterns need to know about each other to work propperly! In an environment of many different modes which are additionally related by inheritance, it is difficult to guarantee that all patterns respect the quirks of each other. They better work independently. The indentation events of queχ prevent inter-pattern dependencies.

Similarly, catching indentation with pre-condition newline plus whitespace, i.e. \verb|^[ \t]*| is fragile, in the sense that another pattern that contains newline plus whitespace might hinder this pattern from triggering. In a lexical analyzer with dozens or hundreds of patterns this becomes quickly unmanageable. Errors that arise from patterns defined somewhere else are very hard to find and require a lot of insight into the actual process of lexical analysis. Using the on_indentation event handler ends up in a much clearer and safer design. For more information about the pre-condition newline pitfall see section // \ref{sec-formal/patterns/context-dependent-pitfalls}, page // \pageref{formal/patterns/context-dependent-pitfalls}. [formal-patterns-context-dependent-pitfalls].

Caveat:

Note that if a pattern contains more than one newline then only the indentation event concerning the last newline is triggered! Imagine a pattern

      STRANGER   " "*"hello"[\n]+" "*"world"[\n]+" "*"how are you?"

then the following pattern would match

     hello
   world
        how are you?

If this matches, then the lines of hello and world do not trigger an indentation event. So, when dealing with indentation based scoping such strange things are best avoided\footnote{Probably, it is anyway hard to think of a case where such pattern definitions are not to be avoided.}.

Line and Column Number Counting

Any compiler or lexical analyzer imposes rules on the text that it has to apply. Most of those texts are written by humans, and humans occasionally make errors and disrespect rules. A gentle compiler tells its user about his errors and also tells it about the place where the error occured. Here line numbering becomes crucial. Additionally, columns can be counted. This provides even more detailed information.

There is a general default algorithm to count lines and columns that is always applied in case that there is no better alternative. Queχ analyzes the patterns and adapts the way of counting lines and columns according to special characteristics\footnote{Actually, even the indentation count algorithm is adapted to profit from knowledge about the patterns internal structure.}. If the pattern contains, for example, a fixed number of newlines, then only a fixed number is added and no newlines are counted at runtime. The mechanisms for line and column counting are optimized for the most reasonable pattern characteristics. There are strange cases\footnote{Example: A pattern that contains a newline which is followed by a fixed number of characters. The determination of this in the context of post-conditions is complicated. On the other hand, such patterns are considered strange and occur rarely,, so the expected gain with an optimized algorithm was considered negligible by the author.} for which a slightly better counting mechanism might be found. For reasonable applications as known from popular programming languages the mechanisms of counting should be optimal.

Note that line and column counting can be turned off individually by pre-processor switches. By these switches the generated analyzer is naked of any counting mechanism and it is likely that it runs a little faster. For serious applications, though, at least line number counting should be in place for error reporting.

Summary

This chapter discussed the major features of queχ. To the knowledge of the author of this text, at the time of this writing, there is no lexical analyser generator that matches these features. Moreover, once one gets used to thinking in terms of sending tokens and mode-transitions, and once one gets used to defined modes that implement standard behavior for other modes using inheritance, then it becomes hard to get back to the cumbersome thinking of plain lex or flex. The little overhead required to learn new concepts comes with the great advantage of being able to design the lexical analysis process in a very natural way
[Well, at least natural to the author of this text.]
.

Practical Usage

This practical example guides the user through the phases of creating a lexical analyser using queχ. The files used in this example can be found in the directory demo/001/ of the queχ distribution. The file simple.qx contains the description of the lexical analyzer, lexer.cpp contains a lexical analyzer application, and a Makefile allows for the convenient creation of the whole lexical analyzer application. Copying and pasting this example may be a good starting point for working with queχ. A good approach is to divide the development of a lexical analyzer into the two steps of design and implementation. The first step consists of three minor activities:

Definition of Patterns

Patterns can be defined directly on the pattern action pair line, but it is much more intuitive to define patterns in terms of regular expressions \cite{} in a separate data section. Those patterns act like constants that allow a quick understanding of what a pattern/action pair stands for. Using a name {NOT_NEWLINE}+ is simply visually more appealing than writing [^\n]+}. These patterns can be defined very easily in terms of regular expressions.

Definition of Token-IDs

Finally, we want to send tokens, and those tokens must have identifiers. For this reason, the user needs to specify names of tokens in a token section so that queχ can create ids for them.

Definition of Modes

This is best done with a drawing program or by hand on paper. One should be clear about how lexer modes are named, how they relate to each other, and which transitions are possible. This design will be the basis for the later coding of modes.

Once these simple steps are completed, the implementation of the modes in terms of pattern-action pairs can be started. The example discussed in this section develops the following files, which are required for generating a lexical analyser:

simple.qx

which contains all information queχ needs to create a lexical analyzer.

lexer.cpp

which uses the created lexical analyser in a mini example application.

To follow the explanations of this chapter the directory demo/001 is best copied into a user domain for further examination and modifications.

Patterns

In this simple example, only very basic patterns have to be defined. There is a pattern for whitespace, consisting of space, tabulator, and newline. There are two bracket types for opening and closing brackets: an assignment operator =, the two keywords if and struct, identifiers consisting of letters from A-Z in capital and lower-case, floating point numbers and string delimeters. The listing in figure [fig-pattern-code] shows the content of the define section in simple.qx describing all required patterns in terms of regular expressions.

Example: Definition of pattern shorthands in a define-section.
define {
    // patterns of interest for all modes
    P_WHITESPACE          [ \t\n]
    P_IDENTIFIER          [_a-zA-Z]+
    P_NUMBER              [0-9]+
    // string mode related patterns
    P_STRING_DELIMITER                "\""
    P_BACKSLASHED_STRING_DELIMITER    "\\\""
    P_BACKSLASHED_BACKSLASH           "\\\\"
}

The filename including the extension can, of course, be chosen freely. Note that all pattern names are global. They are not local to a specific mode. That means, that all modes can access the pattern definitions. Note also that inside the define-range only //-comments are allowed. The P_-prefix was chosen to indicate to the reader, that this is a pattern definition. There is no danger of interfering with the program namespace. These shorthand names are only considered for the definition of regular expressions. The later generated engine does not contain them as variables.

Token-ID

Token-ids can be specified as simply as patterns. A token-section containing token-ids only has to contain a newline separated list of token names. Queχ will create a set of constants with unique numerical values. Figure [fig-token-ids-code] shows the definition of token-ids for the example.

Example: Definition of token-ids in a token-section.
token {
    BRACKET_O            // appears as TKN_BRACKET_O
    BRACKET_C            //            TKN_BRACKET_C
    CURLY_BRACKET_O      // ...
    CURLY_BRACKET_C
    OP_ASSIGNMENT
    IF
    STRUCT
    SEMICOLON
    IDENTIFIER
    NUMBER
    STRING
}

Note, that queχ pastes a prefix in front of the numerical constants for the tokens. A token-id of STRUCT will be called TOK_STRUCT, if TOK is specified as token-prefix. This is discussed briefly in the section [sec-formal-command-line-options]. It is also important to reinforce that the token-ids are relevant to the namespace. They name constant variables inside the namespace queχ. It is not advisable to set the token-id prefix to an empty string and name a token-id i or x because those names are likely to interfere with the names of other variables.

Modes

The design of modes, their relations, and their transitions is something to be done informally with a drawing program or on a sheet of paper. It is basically an aid to help user develop a clear design of the lexical analysis process. In our simple example, there are only three modes:

END_OF_FILE

This mode describes the reaction to end-of-file. All modes in the simple example share this behavior, so they are derived from this mode. On the other hand, this is more an abstract mode\footnote{In the C++ sense of the word, there is no real existing mode of that type. Only derived modes can exists.}. The lexical analyser shall never be in an END_OF_FILE mode, but in a derived mode of it.

PROGRAM

In this mode, we detect program related tokens. It is derived from END_OF_FILE. Thus, if an end-of-file arrives in this mode it behaves properly. When a string delimiter arrives, this mode shall transition into the STRING_READER mode.

STRING_READER

This mode shall also treat end-of-file in the standard way, so it is also derived from END_OF_FILE. However, all patterns for number detection, keywords, and brackets are totally meaningless inside the string. A string-delimiter shall trigger the return to the PROGRAM mode.

The inheritance relationships in this example are displayed in figure style=ref. Both real modes PROGRAM and` STRING_READER` are derived from END_OF_FILE so that they implement the standard behavior for the end-of-file event.

figures/inheritance-modes.png
Figure: Inheritance relationships of the modes in the example lexical analyser.

In this example, the lexical analyser can only be in one of two modes. The posible transitions are shown in figure style=ref. PROGRAM mode can transit into STRING_READER mode and vice versa. The mode inheritances and transitions are only pinpointed in non-machine readable form. This information can be used to write the first stubs of the modes in which one will later fill in the pattern-action pairs.

figures/mode-transition-example.png
Figure: Scratch of mode transitions in the simple example.
Example: Example code stubs for lexical analyser modes.
mode END_OF_FILE :
<inheritable: only>
{
...
}

mode PROGRAM :
     END_OF_FILE
<exit:  STRING_READER>
<entry: STRING_READER>
{
...
}

mode STRING_READER :
     END_OF_FILE
<exit:  PROGRAM>
<entry: PROGRAM>
{
...
}

The three code-stubs for the three modes are shown in figure style=ref. Until now, there are no pattern action pairs defined inside the modes. The dots inside the curly braces will later be replaced by the pattern-action pairs. First of all, mode END_OF_FILE is defined. It does not inherit any mode, but can only be used as a base mode. Thus the option

    <inheritable: only>

is specified in edgy brackets. Second, the PROGRAM mode is derived from END_OF_FILE which is indicated by its name following the colon. The options

    <exit:  STRING_READER>
    <entry: STRING_READER>

tell that this mode can be entered from STRING_READER mode, and that it can exit to this mode. Any transition to another mode (that is not derived from STRING_READER) is detected as an error\footnote{Note, that inside queχ assert() commands are used. According to ISO9899 (ANSI-C) the command only takes effect, if NDEBUG is not defined. If it is, the mode transitions are consequently not checked by the queχ engine. The general rule applies here also: only define NDEBUG if it is 100\% safe to say that no use-case will violate the design.}. Analogously, the STRING_READER mode is defined specifying its base class and the allowed transitions to PROGRAM.

Note, that all options given here are indeed optional. They are not necessary to define pattern-action pairs and they do not contribute to the functioning of the generated lexical analyser. However, they allow verification of whether or not mode definitions and mode transitions are according to the design. The detection of usage against design specifications is a great help to avoiding errors or finding their causes.

Pattern-Action Pairs

The previous sections explained the definitions of patterns, token-ids, and mode characteristics. In this section the core of the lexical analysis is discussed: pattern-action pairs. It is a basic concept in lexical analysis that a pattern triggers an action. This action has the following goals:

The most simple pattern-action pair consists of sending a specific token whenever a particular pattern occurs. Consider for example figure style=ref. It shows the definition of a mode that can handle the end of a file. The only pattern of this mode is <<`EOF`>>. The action to be executed if this pattern occurs is embraced by curly brackets. Inside the brackets a special token is sent, a token with the id:

          quex::TKN_TERMINATION

telling that lexical analysis is over. Note, that in queχ the keyword self and not a this-pointer is used to refer to the lexical analyser. Recall that queχ is a code generator, not a compiler. Internally, the code fragments that are specified as pattern-action pairs or event handlers may be pasted by queχ into objects other than the lexical analyser (e.g. the lexical-analyser-mode-information structures). Also the user might derive his own class from the lexical analyser class, so there would be a type-issue. To unify the access to the lexical analyser, all places were code is pasted allow to use self as a variable that represents the lexical analyser. Use self. not this→!

Example: The END_OF_FILE mode handling the end-of-file pattern.
mode END_OF_FILE :
<inheritable: only>
{
    <<EOF>> {
        self.send(quex::TKN_TERMINATION);
        RETURN;
    }
}

A final RETURN statement in the pattern-action pair says that the flow of control can be turned back to the caller of function` get_token(). Note again, that `return is not used. The RETURN macro ensures that the token queue is in a sound state before returning from the lexical analysis. The rule here is: say RETURN not return.

The pattern-action pairs of the PROGRAM mode displayed in figure style=ref are a little more diverse. Note also, that for simple patterns the operator =$>$ is used to express the sending of a token of a specific id. The whitespace pattern does not cause any token to be sent and one does not return to the user, thus it is followed by an empty pair of curly brackets. The pattern P_IDENTIFIER creates a token that contains the string that was matched. P_NUMBER also stores a number that corresponds to the lexeme that was matched. The pattern` P_STRING_DELIMITER` initiates a mode transition to the mode STRING_READER. Since we are working with a token queue it is necessary to send a dummy token TKN_EVENT_MODE_CHANGE and return. The return is necessary, so that the analyser function pointer can be reset by the analyser.

Example: The PROGRAM mode.
mode PROGRAM :
     END_OF_FILE
<exit:  STRING_READER>
<entry: STRING_READER>
{
    "{"         => TKN_CURLY_BRACKET_O;
    "}"         => TKN_CURLY_BRACKET_C;
    "="         => TKN_OP_ASSIGNMENT;
    "struct"    => TKN_STRUCT;
    "if"        => TKN_IF;
    ";"         => TKN_SEMICOLON;
    {P_NUMBER}  => TKN_NUMBER(atoi(yytext));
    {P_IDENTIFIER} { self.send(TKN_IDENTIFIER, yytext); RETURN; }

    {P_WHITESPACE} {
    }
    {P_STRING_DELIMITER} {
        self << STRING_READER;
        self.send(TKN_EVENT_MODE_CHANGE);
        return;
    }
}

The STRING_READER mode's pattern-action pairs are shown in figure style=ref. This mode makes use of a special member of the lexical analyser object: _the accumulator_. This object allows accumulation of text from incoming patterns and flushes them when desired. When this mode is entered, the accumulator is emptied (see on_entry). When this mode is left, the accumulator is flushed (see on_exit). This means, that if the accumulated text is not empty, it is sent as a token with a user defined token-id. The STRING_READER mode sends the accumulated text with a token-id TKN_STRING.

The arrival of a string delimiter causes a mode transition back to the mode PROGRAM. A backslashed string delimiter allows a string delimiter inside the string so that it does not cause a return to PROGRAM mode. Instead, it adds a quote to the accumulator. The default action . specifies that any other incoming text is simply accumulated. Even in this simple example it becomes clear that the specification of entry and exit event handlers facilitates the specification of mode transitions. It can be said that wherever in the code a command triggers an exit to another mode, the accumulator is flushed.

Example: The STRING_READER mode.
mode STRING_READER :
     END_OF_FILE
<exit:  PROGRAM>
<entry: PROGRAM>
{
    on_entry {
        self.accumulator.clear();
    }

    on_exit {
        self.accumulator.flush(TKN_STRING);
    }

    {P_BACKSLASHED_STRING_DELIMITER} {
        self.accumulator.add('\"');
    }

    {P_STRING_DELIMITER} {
        /*.........................................*/ self << PROGRAM;
        self.send(TKN_EVENT_MODE_CHANGE);
        return;
    }

    . {
        self.accumulator.add(Lexeme);
    }
}

This section now finished the specification of file simple.qx as an input file for the queχ lexical analyser generator. The next section tells how to call queχ to generate a full fledged engine that does the lexical analysis as required.

Skippers

In order to skip portions of the input without considering its content, queχ provides the ability to define so called skippers. Skippers can be used, for example, to skip whitespaces and comments. Of course, most of the skippers can also be defined as patterns that have no action related to it. However, there are great advantages to using skippers rather than their implementation as a skipping pattern.

First of all, skippers offer faster performance since they only compare against the closing delimiter while swallowing the input file. Second, they may reduce the required buffer size. Normally, the buffer size has to be in the range of the size of the largest possible token. This would include skipping patterns, such as patterns eating the / - / type of comments. With skippers the buffer size could be left at 64K, even though, comments may be much larger. Finally, the nested skipper type, allowing nested comments, for example, cannot be implemented as a regular expression.

Calling Queχ

With the files from directory \$(QUEX_PATH)/DEMO/001/ the creation of the example lexical analyser application is as simple as can be. Type make in the command line and the application is built. The Makefile that comes with the sample application is written in a way that makes it very easy to adapt and extend its contents for other user's needs. The core of this make file explains how to call queχ:

# (*) definitions
MODE_FILES  = ./in/simple.qx
#
ENGINE_NAME = tiny_lexer
#
ENGINE_SOURCES = $(ENGINE_NAME)             \
                 $(ENGINE_NAME).cpp         \
                 $(ENGINE_NAME)-internal.h  \
                 $(ENGINE_NAME)-token_ids   \
                 $(ENGINE_NAME)-core-engine.cpp
...

# (*) build rule
$(ENGINE_SOURCES): $(PATTERN_FILE) $(MODE_FILES) $(TOKEN_DB)
        quex --mode-files $(MODE_FILES)      \
             --engine     $(ENGINE_NAME)

The build rule says that whenever one of the files containing mode descriptions changes the source code for the lexical analyser has to be rebuild. In our example the only mode file is simple.qx. Queχ receives the names of the input file names as no-minus followers of the command line option —mode-files.

The output of queχ is determined by the option —engine. The name that follows it specifies the name of the class that implements the engine. But, it also acts as a filestem for all output files. In our example, the engine's name is tiny_lexer. Therefore, there will be four files created:

tiny_lexer

containing the header file that contains the class definition of class queχ::tiny_lexer. This is _the only output file that is of direct interest to the user_. It has to be included in each file that interacts with the produced lexical analyser.

tiny_lexer.cpp

containing the created lexical analyser engine. The user does not need to touch it, but it has to be compiled sometime and linked to the application. The example Makefile does this already. The user is not directly concerned with this file.

tiny_lexer-token-ids

containing definitions of token-ids. That means it provides variables with the names of token-ids that carry unique numerical values. It also defines the member function string token::map_id_to_name(const id_type TokenID) which maps a token-id to a token-id name. This becomes practical in many stages of the development. However, this file is included inside the header file for the lexer class definition and the user does not worry about this file at all.

tiny-lexer-core-engine.cpp

is a generated C++ code file containing the generated lexical analyzer engines for all modes.

Once the engine sources are built by queχ, an example application can be built that uses the lexical analyser. The source code for the example is specified in file lexer.cpp. The content of this file is shown in figure [fig-sample-application].

Example: A sample application using the generated lexical analyser` tiny_lexer`.
#include<fstream>
#include<iostream>

// (*) include lexical analyser header
#include <./tiny_lexer>

using namespace std;

int
main(int argc, char** argv)
{
    // (*) create token
    quex::token        Token;
    // (*) create the lexical analyser
    //     if no command line argument is specified user file 'example.txt'
    quex::tiny_lexer*  qlex = new quex::tiny_lexer("example.txt");

    // (*) print the version
    cout << qlex->version() << endl;

    cout << ",------------------------------------------------------\n";
    cout << "| [START]\n";

    int number_of_tokens = 0;
    // (*) loop until the 'termination' token arrives
    do {
        // (*) get next token from the token stream
        qlex->get_token(&Token);

        // (*) print out token information
        //     -- line number and column number
        cout << "(" << qlex->line_number()   << ", ";
        cout <<        qlex->column_number() << ")  \t";
        cout << Token.type_id_name() << endl;

        ++number_of_tokens;

        // (*) check against 'termination'
    } while( Token.type_id() != quex::TKN_TERMINATION );

    cout << "| [END] number of token = " << number_of_tokens << "\n";
    cout << ",------------------------------------------------------\n";

    return 0;
}

Initially, the lexical analyser needs to be created as an object of type` queχ::tiny_lexer`. The filename of the file to be analysed is directly passed to the constructor of the class. Then, set the initial lexical analysis mode. In our example, PROGRAM shall be the initial mode. Since it shall not trigger any mode transitions or protection algorithms, it is set using the function set_mode_brutally().

After printing out the version information, the loop over the token stream starts. In the sample application it does not more than getting the a token using get_token() and printing it out. In a more realistic application this token is to be passed to a parser that does its syntactic analysis on it. When a token arrives with an identifier equal to` queχ::TKN_TERMINATION`, then the loop exits and the program terminates.

At this point in time everything is ready for compilation. Using the Makefile that came with the example, typing make shall produce a ready-to-run application called lexer in the current directory. If one types now

     > ./lexer example.txt

on the command line essentially the following output is produced:

tiny_lexer: Version 0.0.0-pre-release. Date Fri Sep 7 7:51:20 2007
Generated by Quex 0.15.3
,-----------------------------------------------------------------
| [START]
(0, 0)          STRUCT
(0, 7)          IDENTIFIER
(0, 15)         CURLY_BRACKET_O
(1, 2)          IDENTIFIER
(1, 9)          IDENTIFIER
(1, 10)         SEMICOLON
(2, 2)          IDENTIFIER
(2, 9)          IDENTIFIER
(2, 10)         SEMICOLON
(3, 0)          CURLY_BRACKET_C
(3, 1)          SEMICOLON
(5, 0)          IF
(5, 3)          IDENTIFIER
(5, 12)         CURLY_BRACKET_O
(6, 2)          IDENTIFIER
(6, 7)          IDENTIFIER
(6, 14)         NUMBER
(6, 34)         STRING
(6, 35)         SEMICOLON
(7, 0)          CURLY_BRACKET_C
(7, 1)          <TERMINATION>
| [END] number of token = 21
`-----------------------------------------------------------------

Summary

This section has shown how to build a lexical analyser with queχ and provided a ready-to-run sample application. It was discussed how to specify patterns in a pattern file, how to announce names for token-identifiers and how to code mode-files that contain information about modes and pattern-action pairs. After an explanation of how to invoke queχ, a sample application was discussed that uses the created lexical analyser engine. The Makefile that comes with the example can be easily adapted to particular applications and reduces the required effort to the typing of make on the command line. In general, the example files simple.qx and lexer.cpp build a good basis for a new project. In the following chapters various features and options of queχ are discussed. Additionally some discussion will be made about the generated lexical analyser. The current chapter provided enough information to start coding a lexical analyzer. Following chapters become interesting once you have tasted the ease of queχ.

Formal Usage

The last chapter provided a functioning example for the usage of queχ. At this point the user should have a clear idea about the basic concepts and how they are applied. This chapter discusses the details of usage and minor features that may not have been mentioned in the previous chapters. The first sections discuss the specification of patterns, mode characteristics, pattern-action pairs, and mode transitions. Section [sec-formal-generated-class] discusses the lexical analyser class that queχ creates with all members and member functions that the user might want to use.

In some cases, the provided functionality of the generated lexical analyser class might not be sufficient, so section [sec-formal-derivation] discusses the formalities to implement a derived class and its usage. The generated code contains some macro switches that allow to control the behavior of the analyser at compile time. Section [sec-formal-macros] discusses the usage of those macros. Finally, section [sec-formal-command-line-options] explains all command line arguments that can be passed to queχ.

Regular Expression Syntax

Section [sec-practical-patterns] already discussed the format of the pattern file. In this section, it is described how to specify patterns by means of regular expressions. The regular expression syntax follows the scheme of what is usually called traditional UNIX syntax \cite{}, includes elements of extended POSIX regular expressions \cite{IEEE POSIX Standard 1003.2} and POSIX bracket expressions \cite{}. This facilitates the migration from and to other lexical analyzer generators and test environments. Additionally, queχ provides support for Unicode Propperties. A compliance to Unicode Regular Expressions \cite{Unicode 5.0 Technical Report Standard #18} is currently not targeted, though, because this expressive power is usually not required for compiler generation.

Nevertheless, queχ provides features that, for example, flex does not. If it is intended to maintain compatibility of regular expressions with flex, then please refer to the flex manual \cite{}, section Patterns and do not use queχ-specific constructs. This section discusses pure queχ syntax. The explanation is divided into the consideration of context-free expressions and context-dependent expressions.

Context Free Regular Expressions

Context free regular expressions match against an input independent on what come before or after it. For example the regular expression for will match against the letters f, o, and r independent if there was a whitespace or whatsoever before it or after it. This is the usual way to define patterns. More sophisticated techniques are explained in the subsequent section. This sections explains how to define simple chains of characters and operations to combine them into powerful patterns.

Chains of Characters

x

matches the character x. That means, lonestanding characters match simply the character that they represent. This is true, as long as those characters are not operators by which regular expressions describe some fancy mechanisms---see below.

.

matches any character (byte) except newline and EOF. Note, that on systems where newline is coded as $0D,\,0A$ this matches also the $0D$ character whenever a newline occurs (subject to possible change).

[xyz]

a "character class" or "character set"; in this case, the pattern matches either an `x', a `y', or a `z'. The brackets $[$ and $]$ are examples for lonestanding characters that are operators. If they are to be matched quotes or backslashes have to be used as shown below. Character sets are a form of alternative expressions— for one single character. For more sophisticated alternative expressions see the paragraphs below.

[: expression :] matches a set of characters that result from a character set expression expression. Section [formal/patterns/character-set-expressions] discusses this feature in detail. In particual [:alnum:], [:alpha:] and the like are the character sets as defined as POSIX bracket expressions.

[abj-oZ]

a "character class" with a range in it; matches an `a', a `b', any letter from `j' through `o', or a `Z'. The minus - determines the range specification. Its left hand side is the start of the range. Its right hand sinde the end of the range (here j-o means from j to o). Note, that - standards for range from to where the character code of the right hand side needs to be greater than the character code of the left hand side.

[^A-Z\n]

a "negated character class", i.e., any character but those in the class. The ^ character indicates negation at this point. This expression matches any character except an uppercase letter or newline.

"[xyz]\"foo"

the literal string: `$[$xyz$]$"foo'. That is, inside quotes the characters which are used as operators for regular expressions can be applied in their original sense. A $[$ stands for code point 91 (hex. 5B), matches against a $[$ and does not mean open character set. Note, than inside strings one can still use the ANSI-C backslashed characters \n, \t, etc. as well as the Unicode name property \N. However, general Unicode property expression \P that result in character sets are not dealt with inside strings.

\0

a NULL character (ASCII/Unicode code point 0). This is to be used with extreme caution! The NULL character is also used aa buffer limitting character! Same as with the code for begin-of-file and end-of-file, if these codes are used inside regular expressions, their values have to be re-assigned. See section [sec-formal-command-line-options] for specifying a different value for the buffer limit code.

\U11A0FF

the character with hexadecimal value 11A0FF. A maximum of six hexadecimal digits can be specified. Hexadecimal numbers with less than six digits must either be followed by a non-hex-digit, a delimiter such as ", [, or (, or specified with leading zeroes (i.e. use \U00071F, for hexadecimal 71F). The latter choice is probably the best candidate for an established habit. Hexadecimal digits can contain be uppercase or lowercase letters (from A to F).

\X7A27

the character with hexadecimal value 7A27. A maximum of four hexadecimal digits can be specified. The delimiting rules are are ananlogous to the rules for \U.

\x27

the character with hexadecimal value 27. A maximum of two hexadecimal digits can be specified. The delimiting rules are are ananlogous to the rules for \U.

\123

the character with octal value 123, a maximum of three digits less than 8 can follow the backslash. The delimiting rules are ananlogous to the rules for \U.

\a, \b, \f, \n, \r, \t, \r, or \v

the ANSI-C interpretation of the backslashed character.

\P{ Unicode Property Expression }

the set of characters for which the Unicode Property Expression holds. Note, that these expressions cannot be used inside quoted strings.

\N{ UNICODE CHARACTER NAME }

the code of the character with the given Unicode character name. This is a shortcut for \P{Name=UNICODE CHARACTER NAME}. For possible settings of this character see \cite{Unicode 5.0}.

\G{ X }

the code of the character with the given General Category \cite{}. This is a shortcut for \P{General_Category=X}. Note, that these expressions cannot be used inside quoted strings. For possible settings of the General_Category property, see section [sec-formal-unicode-properties].

Any character specified as character code, i.e. using \, \x, \X, or \U are considered to be unicode code points. For applications in english spoken cultures this is identical to the ASCII encoding. For details about unicode [sec-formal-ucs-properties] is dedicated to an introduction to Unicode properties.

Two special rules have to appear isolatedly, out of the context of regular expressions. With the following two rules the actions for the event of end of file and the failure event can be specified:

<<EOF>>

the event of an end-of-file (end of data-stream).

<<FAIL>>

the event of failure, i.e. no single pattern matched. Note, this rule is of the lex style, but is only available with the queχ core engine.

This syntax is more in recognition of the traditional lex syntax. In fact the two event handlers on_failure and on_end_of_stream are a one-to-one correspondance to what is mentioned above. Possibly some later versions will totally dismiss the lex related engine core, and then also these constructs will disappear in favor of the mentioned two event handlers.

Operations

Let R and S be regular expressions, i.e. a chain of characters specified in the way mentioned above, or a regular expression as a result from the operations below. Much of the syntax is directly based on POSIX extended regular expressions \cite{}.

R*

zero or more occurencies of the regular expression R.

R+

one or more repetition of the regular expression R.

R?

zero or one R. That means, there maybe an R or not.

R{2,5}

anywhere from two to five repetitions of the regular expressions R.

R{2,}

two or more repetitions of the regular expression R.

R{4}

exactly four repetitions of the regular expression R.

(R)

match an R; parentheses are used to group operations, i.e. to override precedence, in the same way as the brackets in $(a\, +\, b)\,\cdot\,c$ override the precedence of multiplication over addition.

RS

the regular expression R followed by the regular expression S. This is usually called a concatenation or a sequence.

R|S

either an R or an S, i.e. R and S. This is usually called an alternative.

${NAME}

the expansion of the defined pattern "NAME". Pattern names can be defined in define sections (see section [sec-practical-patterns]).

Character Set Expressions

Character set expression are a tool to combine, filter and or select character ranges conviniently. The result of a character set expression is a set of characters. Such a set of characters can then be used to express that any of them can occur at a given position of the input stream. The character set expression [:alpha:], for example matches all characters that are letters, i.e. anything from a to z and A to Z. It belongs to the POSIX bracket expressions which are explained below. Further, this section explains how sets can be generated from other sets via the operations union, intersection, difference, and inverse.

POSIX bracket expressions are basically shortcuts for some more regular expressions that would formally look a bit more clumpsy. Queχ provides those expressions bracketed in [: and :] brackets. They are specified in the table below.

Expression Meaning Related Regular Expression
[:alnum:] Alphanumeric characters [A-Za-z0-9]
[:alpha:] Alphabetic characters [A-Za-z]
[:blank:] Space and tab [ \t]
[:cntrl:] Control characters [\x00-\x1F\x7F]
[:digit:] Digits [0-9]
[:graph:] Visible characters [\x21-\x7E]
[:lower:] Lowercase letters [a-z]
[:print:] Visible characters and spaces [\x20-\x7E]
[:punct:] Punctuation characters [!"#$%&'()*+,-./:;?@[\\\]_`{|}~]
[:space:] Whitespace characters [ \t\r\n\v\f]
[:upper:] Uppercase letters [A-Z]
[:xdigit:] Hexadecimal digits [A-Fa-f0-9]

Caution has to be taken if these expressions are used for non-english languages. They are solely concerned with the ASCII character set. For more sophisticated property processing it is advisable to use Unicode property expressions as explained in section [formal/ucs-properties]. In particular,

The use of Unicode character set potentially implies the handling of many different properties and character sets. For convinience, queχ provides operations on character sets to combine and filter different character sets and create new adapted ones. The basic operations that queχ allows are displayed in the following table:

Syntax Meaning Example
union(A0, A1, …) Union of the list of sets. union([a-z], [A-Z]) = [a-zA-Z]
intersection(A0, A1, …) Intersection the list of sets. intersection([0-9], [4-5]) = [4-5]
difference(A, B0, B1, …) Subtracts B0, B1, from set A. difference([0-9], [4-5]) [0-36-9]
inverse(A0, A1, …) Inverse of the union of the list of sets. inverse([\x40-\5A]) = [\x00-\x40\x5B-\U12FFFF]

Note, that the difference and intersection operation can be used convienently to filter different sets. For example

  [: difference(\P{Script=Greek}, \P{Digit}) :]

results in the set of greek characters except the digits. To allow only the numbers from the arabic code block intersection can be used as follows:

  [: intersection(\P{Block=Arabic}, \P{Numeric}) :]

The subsequent section elaborates on the concept of Unicode properties.

Unicode Character Properties

Unicode is distinguished from other coding standards not only with respect to its level of completeness including all possibly used character sets. Moreover, much effort has been accomplished in categorizing characters, providing terminology for concepts related to letter systems and defining character properties. Unicode defines a fixed set of properties that a character can have. The most important property for a lexical analyzer is the code point, i.e. the integer value that represents a character. However, there are other interesting properties that simplify the description of regular expressions.

Queχ supports Unicode Standard Properties through the \P{..} expressions, where the P stands for property. For a queχ user properties can be devided into two categories:

Note, that the result of a \P{...} expression is always a set of characters. Therefore, it cannot be used inside a quotes string expression. For convinience, the properties Name and General_Category are provided through the shortcuts \N{…} and \G{…}. Thus, \N{MIDDLE DOT} is a shorthand for \PUppercase_Letter.

Unicode 5.0 provides more than 17000 character names, so please consult the standard literature for settings of the Name property
[Alternatively, consider the file UnicodeData.txt that comes with the queχ application]
. Note also, that names as defined in Unicode 1.0 can be accessed through the Unicode_1_Name property. This property also contains names of control functions according to ISO 6429 \cite{}.

As for the General_Category property, the appendix provides the list of possible settings [sec-appendix-property-general-category]. At this place, more detailed information is specified about provided properties, their meaning, and settings of other likely-to-be-used properties.

When starting to write lexical analyzers using a wider range of unicode characters the reliance on properties becomes almost unavoidable. However, the huge volume of characters requires some sophisticated tool to browse through properties and related character sets. For this purpose queχ provides a query mode. This mode allows the user to specify some queries on the command line and get the response immediately on the console. This subject is handled in section [sec-query-intro].

As an option to facilitate the specification of property values, queχ allows you to use wildcards *, ? and simple character sets such as as [AEIOUX-Z] for the characters A, E, I, O, U, X, Y, and Z. If the first character is an ! then the inverse character set is considered. This is conform with unix file name matching. set specifications such as [a-z] such to facilitate the search. It is advisable, though, to use queχ's query functionality first [sec-query-intro] in order to get an impression to what value such a wild-card expression expands.

Pitfalls

The most dangerous pitfall is related to precedence and length. Note, that a pattern that is defined {it before} another pattern has a higher precedence. Also, if a pattern can match a longer chain of characters it wins. Thus, if there are for example two patterns

     [A-Z]+     => TKN_IDENTIFIER(Lexeme);
     "PRINT"    => TKN_KEYWORD_PRINT;

then the keyword PRINT will never be matched. This is so, because [A-Z] matches also the character chain PRINT and has a higher precedence, because it is defined first. To illustrate the danger of greedy matching, i.e. the fact that length matters, let two patterns be defined as:

     "Else"             => TKN_KEYWORD_ELSE;
     "Else\tAugenstein" => TKN_SWABIAN_LADY(Lexeme);

Now, the Else statement may be matched, but only if it is not followed by tabulator and Augenstein. On the first glance, this case does not seem to be very probable. Sometimes it may be necessary, though, to define delimiters to avoid such confusion. In the very large majority of cases greedy matching is a convienient blessing. Imagine the problem with identifiers, i.e. any chain of alphabetic characters, and a keyword for. If there was no greedy matching (longest match), then any variable starting with for could not propperly be detected, since the first three letters would result in the for-keyword token.

Another pitfall is related to character codes that the lexical analyser uses to indicate begin-of-file, end-of-file, or buffer-limit. The values for those codes are chosen to be out of the range for sound regular expressions parsing human written text (0x0 for buffer-limit, 0x1A for end-of-file, and 0x19 for begin-of-file). If it is intended to parse binary files, and these values are supposes to occur in patterns, then these codes need to be changed. Section [sec-formal-command-line-options] mentions how to specify their codes on the command line.

Pre- and Post-Conditions

Additionally to the specification of the pattern to be matched queχ allows to define conditions on the boundary of the pattern. This happens through pre- and post-conditions. First, the trivial pre- and post-conditions for begin of line and end of line are discussed. Then it is shown how to specify whole regular expressions to express conditions on the surroundings of the pattern to be matched. The traditional characters to condition begin and end of line are:

^R

an regular expression R, but only at the beginning of a line. This condition holds whenever the scan starts at the beginning of the character stream or right after a newline character. This shortcut scans only for a single newline character \verb|\| (hexadecimal $0A$) backwards, independent on how the particular operating system codes the newline. In this case, there is no harm coming from different conventions of newline.

R$

a regular expression R, but only at the end of a line and not at the end of the file. Note, that the meaning of this shortcut can be adapted according to the target operating system. Some operating systems, such as DOS and Windows, code a newline as a sequence \r\n (hexadecimal 0D, 0A), i.e. as two characters. If you want to use this feature on those systems, you need to specify the —DOS option on the command line (or in your makefile). Otherwise, $ will scan only for the newline character (hexadecimal 0A). Note, that for the trivial end-of-line post condition the newline coding convention is essential. If newline is coded as 0D, 0A then the first 0D would discard a pattern that was supposed to be followed by 0A only. Note, that if the $ shall trigger at the end of the file, it might be advantageous to add a newline at the end of the file by default.

For more sophisticated case real regular expressions can be defined to handle pre- and post-conditions. Note, that pre- and post-conditions can only appear at the front and rear of the core pattern. Let R be the core regular expression, Q the regular expression of the pre-condition, and S the regular expression for the post-condition.

R/S

matches an R, but only if it is followed by an S. If the pattern matches the input is set to the place where R. The part that is matched by S is available for the next pattern to be matched. R is post-conditioned. Note, that the case where the end of R matches the beginning of S cannot be treated by Version 0.9.0\footnote{The reason for this lies in the nature of state machines. Flex has the exact same problem. To avoid this some type of step back from the end of the post-condition must be implemented.}.

Q/R/

matches R from the current position, but only if it is preceeded by a Q. Practically, this means that queχ goes backwards in order to determine if the pre-condition Q has matched and then forward to see if R has matched. R is pre-conditioned. Note, with pre-conditions there is no trailing context problem as with post-conditions above.

Q/R/S

matches R from the current position, but only if the preceeding stream matches a Q and the following stream matches an S. R is pre- and post-conditioned.

Pitfalls

A somehow subtle pitfall is related to the begin-of-line pre-condition. When using the ^ sign at the beginning of a line it is tempting to sing "don't worry, be happy … this matches at any begin of line"—well, it does not! The important thing to understand is that it matches when the lexical analysis step starts at the beginning of a line. The alarm signal should ring if begin-of-line is triggered and a whitespace pattern is defined that includes newline. Consider the following patterns being defined:

define {
   WHITESPACE    [ \t\n]+
       ....
   GREETING       ^[ \t]*hello
}

mode x_y_z : {
    {WHITESPACE}  => TKN_WHITESPACE;
    {GREETING}    => TKN_GREETING(Lexeme);
}

Where the hello greeting is to be matched after newline and some possible whitespace. Now, given the character stream:

   ...
   something
   ...
   hello

will not send a GREETING token. This is because the whitespace pattern eats the newline before the hello-line and even the whitespace before hello. Now, the next analysis step starts right before hello and this is not the beginning of a line. Splitting the whitespace eater into newline and non-newline helps:

define {
   WHITESPACE_1  [ \t]+
   WHITESPACE_2  [\n]+
   GREETING      ^[ \t]*hello
}

mode x_y_z : {
    {WHITESPACE_1}  => TKN_WHITESPACE;
    {WHITESPACE_2}  => TKN_WHITESPACE;
    {GREETING}      => TKN_GREETING(Lexeme);
}

Now, the first whitespace eating ends right before newline, then the second whitespace eating ends after newline, and the hello-greeting at the beginning of the line can be matched.

Another pitfall is, again, related to precedence. Make sure that if there are two patterns with the same core pattern R, then the pre- or post-conditioned patterns must be defined before the un-conditioned pattern. Otherwise, the pre- or post-conditioned patterns may never match. Recall section \ref{formal/patterns/context-free-pitfalls} for a detailed discussion on precedence pitfalls.

Specification of Starting Mode with start

In most practical applications it is not a good idea to start in a void initial mode, so queχ requires an initial mode explicitly to be specified. This happens using the start keyword, i.e. one has to type somewhere in between the mode definitions (but best at the very beginning):

  start = PROGRAM

Let the mode-stubs be stored in a file called simple.qx. The previous sections defined patterns, token-ids, and modes. Together with these definitions one can now start with the specification of pattern-action pairs as discussed in the following section.

Mode Characteristics

Queχ allows to specify characteristics of modes concerning transitions and inheritance. The keyword mode starts the definition of a lexical analyser mode. Then it can be directly followed by a curly bracket that starts the pattern-action pair definitions as shown in figure style=refa. This is the most simple way to define a mode.

a)

   mode MY_MODE {
     ...
     // pattern-action pairs of MY_MODE
     ...
   }

b)

   mode MY_MODE :
      END_OF_FILE
      WHITESPACE_EATER
      DOCUMENTATION_TRIGGER
      {
     ...
     // pattern-action pairs of MY_MODE
     ...
   }

c)

   mode MY_MODE :
      END_OF_FILE
      <inheritable: yes>
      <exit:        OTHER_MODE>
      <restrict:    exit>
      {
     ...
     // pattern-action pairs of MY_MODE
     ...
   }

In order to define options or base modes the name of the mode has to be followed by a colon as in the examples in figure <<:formal-mode-definition, style=ref>>b and \ref{fig-formal-mode-definition, style=ref>>c. Note, that if a mode specifies base modes and options, the options are specified after the base modes.

Base modes directly follow the : sign. The list of base modes is whitespace separated. When the first $<$ sign arrives, queχ starts reading the options. Options are bracketed by $<$ and $>$ signs and multiple options are also listed whitespace separatedly. The following options can be specified:

<inheritable: [yes, no, only]>

restricts the inheritance relationships. If yes is specified, it is allowed for any other mode to be derived from this mode. If no is specified and another mode tries to derive from this mode, an error will be printed. Specifying only tells that this mode cannot be a lexical analyser mode. Its reason for existance is to support derived modes. This is similar to the concept of abstract classes in C++ or interfaces in Java.

<entry: mode-name>

tells that it is allowed that the lexical analyser enters this mode from mode mode-name or one of its derived modes. If no entry mode is specified, then all modes are allowed.

<exit: mode-name>

tells that it is allowed that the lexical analyser exits from this mode to mode mode-name or one of its derived modes. If no exit mode is specified, then all modes are allowed.

<restrict: [ entry, exit ]>

turns of the admissibility of derived modes for entry or exit. If entry is specified the entry is restricted, so any other mode than the the listed entry modes is forbidden even if it is derived from one of them. Respectively, specifying exit restricts the list of exit modes.

Options are optional, i.e. they do not need to be specified. But, they facilitate the debugging and add confidence that the mode transitions are safe and sound. The list of options is then to be followed by an opening curly bracket { that opens the list of pattern-action pairs. These are discussed in the following section.

Pattern-Action Pairs

Queχ only allows to specify pattern-action pairs inside modes. As the name says, a pattern-action pair consists of a pattern that is to be matched and a code fragment that is executed when the pattern matches. Patterns are best specified inside the pattern definition sections, so that inside the mode sections only names of patterns occur. However, one can also use regular expressions inside the mode definitions. The format for pattern-action pairs is as simple as can be:

  1. The pattern is specified as a regular expression or a pattern name embraced by curly brackets.

  2. Then comes a whitespace that separates the pattern from the action.

  3. Then there must be a curly bracket followed by whitespace that delimits the start of the code fragment of the action. Queχ parses then until the matching closing curly bracket that terminates the action.

Examples for pattern-action pairs have been given enough in section [sec-practical-pattern-action-pairs], page There are a few important variables to know for coding the action for a pattern match:

Pattern Action Shortcuts

In practical applications for programming languages, the large majority of patterns to be identified are simple and do not require large algorithms to handle them. Keywords, such as if, then, and foreach, operators, such as $+$ and * and delimiters such as ;, ( and )`' simply need to send a token telling that such and such appeared. Their pattern action pairs all look similar to the following:

  ...
  {P_KEYWORD_IF} {
       self.send(TKN_KEYWORD_IF);
       RETURN;
  }
  {P_KEYWORD_THEN} {
       self.send(TKN_KEYWORD_THEN);
       RETURN;
  }
  {P_OPERATOR_PLUS} {
       self.send(TKN_OP_PLUS);
       RETURN;
  }
  ...

Queχ allows to write such simple token sendings in a more elegant manner. If a pattern is followed by a operator and a name xyz then this is translated into pattern sends token with name xyz. The code as shown above will be internally generated. The above code segment is equivalent to the following:

  ...
  {P_KEYWORD_IF}    => TKN_KEYWORD_IF;
  {P_KEYWORD_THEN}  => TKN_KEYWORD_THEN;
  {P_OPERATOR_PLUS} => TKN_OP_PLUS;
  ...

For simple patterns that consist of a fixed string and that appear only in one place it does not make sense to define a pattern explicitly. They can also be given directly, i.e. one can write

  ...
  "if"    => TKN_KEYWORD_IF;
  "then"  => TKN_KEYWORD_THEN;
  "+"     => TKN_OP_PLUS;
  ...

thus avoiding long lists of pattern definitions external to the mode definition. Sometimes, though, things are little more complicated. It might be necessary to extract some of the information in the lexeme and store it into the token as in the following pattern action pairs:

  ...
  {P_NUMBER} {
    self.send(TKN_NUMBER, atoi(Lexeme));
    RETURN;
  }
  {P_IDENTIFIER} {
    self.send(TKN_IDENTIFIER, Lexeme);
    RETURN;
  }
  ...

In these cases, one passes arguments to the send function other then the token-id. In order to specify additional arguments to the send functions the token-ids have to be followed by a pair of brackets containing the list of arguments (possibly separated by comma). This way, the above two pattern action pairs can be rewritten elegantly as:

  ...
  {P_NUMBER}     => TKN_NUMBER(atoi(Lexeme));
  {P_IDENTIFIER} => TKN_IDENTIFIER(Lexeme);
  ...

These abbreviations not only speed up the generation of a lexical analyser tremendously, but they make the lexical analyser descriptions also very readable. Long lists of pattern-action pairs can be understood and analysed in a few seconds.

Note, that if an identifier is used that is not declared in a token-section, then queχ implicitly declares this token, but still a warning is reported. In order to get things setup quickly, it is possible to complete renounce on token sections. However, for a good style the output of these warnings might be pasted into one of the token-sections.

Priorities

For a given character stream it might be possible that more than one pattern matches. Therefore, rules are needed that describe the resolution of those conflicts. This section explains the priorities that queχ assigns to a given set of patterns based on length, position, inheritance relationships, and a possible priority-mark. Imagine, two patterns:

      "print"    { ... /* print keyword */ }
      "printer"  { ... /* device name   */ }

both patterns "print" and "printer" match the first five characters of a character stream print…. The first pattern could match, but what if the stream continuous with …er? It would never be possible to match an incoming printer because the first pattern eats the first five characters and the remaining er is lost in space. Thus, the rule, followed by most lexical analysers: The pattern with the longest match proceeds! Still, there might be more than one pattern that matches the same number of characters, e.g. the two patterns

      "print" { ... /* print keyword */ }
      [a-z]+  { ... /* identifier    */ }

match in a character stream print exactly five characters. How can be decided wether to vote for a keyword or an identifier? The rule here is: first come, first serve_ — patterns that are mentioned first in the code win!. If a pattern in a base mode and a pattern in a derived mode match with the same number of characters, than the base mode' pattern proceeds!. This follows the philosophie that the base mode imposes behavior on the derived modes. Figure style=ref shows how a pattern is matched in case of multiple patterns matching the current character input stream.

figures/pattern-priorities.png
Figure: Dispatch of pattern actions in case of concurrent matches.

With respect to pre- and post-conditioned patterns, the rules remain the same. However, some things need to be clarified. For a pre-conditioned pattern Q/R/ the string matching the pre-condition Q does not contribute to the length of the pattern match. For a post-conditioned pattern R/S the matching characters of the post-condition S do contribute to the length of the pattern match. Note, that the last rule is, on the first glance, counter-intuitive to the fact that the Lexeme variable in a post-conditioned pattern match action contains only the length of the core pattern and not the length of the post-condition. There is a rational, though, behind this distinction:

For cases of real urgency, a keyword allows to struck the ruleset of pattern-action dispatching: PRIORITY-MARK. A pattern followed by this keyword is lifted into the current mode, thus having the priority according to the position in the current mode not of the base mode. This requires, of course, for the pattern to be defined before. For example:

mode my_base {
  ...
  [a-z]+ { ... /* identifier */ }
  ...
}
mode my_derived :
   my_base
{
  ...
  {"print"} { ... /* keyword */ }
  ...
}

When the lexical analyser is in the my_derived mode, then print is always recognized as an identifier and never as keyword. However, if the PRIORITY-MARK is set as in the following code fragment,

mode my_base {
  ...
  [a-z]+    { ... /* identifier */ }
  ...
}
mode my_derived :
  my_base
{
  ...
  {"print"} { ... /* keyword */ }
  [a-z]+    PRIORITY-MARK
  ...
}

then the [a-z]+ pattern has a priority of a pattern in the mode my_derived _after_ the pattern "print". The action related to the [a-z]+ pattern, though, remains. An incoming print character stream is now always recognized as keyword. It cannot be overemphasized, that using priority marks allow derived modes to act against the concepts of the base modes. Thus a mode B may be derived from mode A, i.e. is-a mode A, but it behaves different! Priority marks are indecent and a sign of a bad design!

Priority marks can be avoided by splitting the base mode A into two modes A0 and A1, one containing desired patterns and the undesired patterns. Figure style=ref shows this idea. The original mode can be achieved by derivation from A0 and A1. The mode B can derive from the base mode of desired patterns. This is the clean way to avoid that undesired base class patterns have to high priority —use PRIORITY-MARK in case of laziness.

figures/resolve-priority-mark.png
Figure: Avoiding PRIORITY-MARK through re-design.

The Match Event

Queχ provides a special event is executed at the moment any match is executed. This allows the user to do some things—for example for debugging purposes—at any pattern match. The event handler name to do this is on_match. The code fragment that follows this name is considered to be the action which is executed before any match action is executed. Every mode has its own on_match event handler. However, these event handlers can be inherited in the same way as on_exit and on_entry are inherited.

The event handler receives the two arguments Lexeme, the pointer to the first character of the lexeme and LexemeL, i.e. the length of the lexeme that was matched. The following shows an example, where the user wants to do some statistics on the number of matches, the number of backslashes and the total length that was matched in the mode CORE.

mode CORE :
...
{
   ...
   on_match {
       match_count        += 1;
       backslashe_n       += __count_backslashes_in_string(Lexeme);
       total_match_length += LexemeL;
   }
   ...
}

Mode Transition Events

There are exactly two events that are related to mode transitions: entering and exiting. Accordingly, one can define inside a mode the two handlers` on_entry` and on_exit. In addition to the self reference, the following variables are available:

const queχ_mode& FromMode

Only availabe in on_entry where this is the mode _from_ which the current mode is entered.

const queχ_mode& ToMode

Only availabe in on_exit where this is the mode _to_ which the lexical analyser transits.

The on_exit function of the mode that is left is always called before the on_entry function of the mode that is entered. Note, that it does not make sense inside transitions to use the RETURN statement since at this point the token-queue is not to be synchronized with a user's token request. Examples for enter and exit handlers have been given in section [sec-practical-pattern-action-pairs].

Caveat

Note, that initiating explicitly mode transition inside on_exit {t will cause} an infinite recursion! If this is intended the mode transition mechanism should be circumvented using the member function` set_mode_\-brutally(). Note also, that initiating an explicit mode transition inside `on_entry _can cause_ an infinit recursion, if one initiates a transit into the entered mode. Certainly, such a mode transition does not make sense. In general, mode transitions inside on_entry or on_exit event handlers are best avoided. Consider the code fragment

mode MY_MODE_A {
...
   {P_SOMETHING} {
        self << MY_MODE_B;
        self.send(TKN_EVENT_MODE_TRANSITION);
        return;
   }
...
}

meaning that one desires to enter MY_MODE_B if pattern` P_SOMETHING` is detected. Imagine, now, because of some twisted mode transitions in the transition event handlers one ends up in a mode` MY_MODE_K`! Not to mention the chain of mode transition event handler calls - such a design makes it hard to conclude from written code to the functionality that it implements. To repeat it: _explicit mode transitions inside on_entry and on_exit are best avoided!_

One might recall how uncomfortable one would feel if one mounts a train in Munich, leading to Cologne, and because of some redirections the trains ends up in Warsaw or in Moscow. In a same way that train companies should better not do this to theirs customers, a programmer should not do to himself mode transitions inside on_entry and on_exit.

Another issue has to do with optimization. Queχ is aware of transition handlers being implemented or not. If no transition handlers are implemented at all then no event handling code is executed. Note, that each entry and exit transition handler requires a dereferecing and a function call—even if the function itself is empty. For fine tuning of speed it is advisable to use only entry handlers or only exit handlers. The gain in computation time can be computed simply as: probability of mode swith times time for dereferencing and empty function call over two. For reasonable languages (probability of mode change < 1 / 25 characters) consider the speed gain in a range of less than 2%. The derefencing, though, can come costy if the mode change is seldom enough that is causes a cache-miss. Thus, for compilers that compile large fragments of code, this optimization should be considered.

Indentation Events

Additionally to the on_entry and on_exit event there is an event for the first non-whitespace character that appears in a line: on_indentation. In order to trigger this event a dedicated process keeps track of newlines and whitespace that follows it. This process happens virtually in paralell to pattern matching. It knows to states:

At the beginning of lexical analysis, or at the beginning of a line the lexical analyser is in state 1. When the first non-whitespace character appears in a line it enters state 2. The transition from state 1 to state 2, i.e. the appearance of the first non-whitespace character in a line triggers the indentation event and the correspondent event handler is called. The transition from state 2 to state 1 and state 1 to state 2 can also appear inside one single matched pattern
[Imagine, for example a pattern [ \n]
+$$ that matches newlines and whitespace.].

This event handler is called right before the pattern action is executed that belongs to the matched pattern. Inside the event handler, in addition to the self reference, the following variable is available:

const int Indentation

The distance from the beginning of the line to the first non-whitespace.

Note, that the indentation event can be disabled, but only for single event. Using the member function

  void  disable_next_indentation_event();

disables the indentation event for the next time. However, after the next prevented indentation handling it is enabled again. This comes handy if one needs to have a line-prolonger, such as a backslash in python and many shell script languages, or the underscore in VisualBasic. The following mode action pair would prevent the subsequent line to be considered for indentation handling:

mode SOMETHING {
   ...
   {P_BACKSLASH_FOLLOWED_BY_NEWLINE} {
        self.disable_next_indentation_event();
   }
   ...
}

If the subsequent line, though, does not end with a backslash the event handler is automatically active and indentation handling is in place.

Lexical Analyser Class

Queχ generates a C++ class that implements a lexical analyser as specified by the user. This section discusses the class members and member functions with which the user may interacts. Section [sec-formal-generated-class-ui] discusses the usage of the lexical analyser from outside, i.e. how the end user of the engine is going to create a lexical analyser and how a character stream is transformed into a token stream. The subsequent sections discuss how modes can be handled using member functions, token handling, the_ accumulator_, and some virtual functions to be possibly overriden by a derived class. Any member function can theoretically be used in any context
[This is, of course, only true as long as acces rights of` private` or public do not forbid it.]
.

General User Interface

Before the lexical analyser can be used, one has to create an instance of it. Assuming that the_lexer is the name of the lexical analyser engine (see command line option —engine) the following two constructors are provided:

The first constructor accepts an input stream of any kind and an optional output stream for unmatched characters. The second constructor directly accepts a filename instead of a stream. The stream or the filename act as the root-file for the lexical analysis. The lexical analyser is now ready to run. Tokens are received by calling the member function

It initiates a lexical analyser process and fills the object at` result_p` with the next token. Internally, tokens may be stacked and not every call to get_token() initiates an analysis. But the user does not care. He simply receives a sequence of tokens through this function until a token arrives with the the token-id

 quex::TKN_TERMINATION

Note, that TKN_ represents the token prefix. If the user decides to modify the token prefix to TOK_, for example, the identifier for termination is TOK_TERMINATION. Note, that the default action on failure also sends this identifier, so that no second check at the end of the analyzing loop is required. This id is by default defined as zero, because also many parsers require that. This id tells that the input has been totally treated and no further token will arrive. In order to keep track of different versions of generated lexers the member function

returns a string telling about the version (see also command line option` —version`), and the date when queχ produced the engine. The current line and column numbers can be accessed using the functions

where the *_number() functions return the information where the last pattern started, and the *_index() functions return the information where the last pattern ended.

Mode Handling

From the given set of modes, queχ creates a set of mode-information objects as members of the lexical analyser class. The names of the members are exactly the same names as the names of the modes. In the example of the last section, the following three objects are members of the class tiny_lexer:

        END_OF_FILE_tag    END_OF_FILE;
        STRING_READER_tag  STRING_READER;
        PROGRAM_tag        PROGRAM;

The three classes END_OF_FILE_tag, STRING_READER_tag, and PROGRAM_tag are all derived from class queχ_mode. Thus pointers and references to them can do the job, whenever a pointer to a` queχ_mode` can. The lexical analyser class provides the following functions to access information about the current mode.

The first function returns a reference to the mode-information object that represents the current mode. The second returns a string with the name of the current mode. Those member functions shall be accessed using the self member, i.e. for example

     if( self.mode() != PROGRAM )
         cerr << self.mode_name() << endl;

The third function mode_id() returns a mode-identifier, i.e. an integer value that is unique for the current mode. This identifier was historically used by the flex generated engine under the hood. Since version 0.10.0 queχ produces its own engine and does not rely on flex anymore. However, these ids might still be useful for administration of modes, or for tagging of tokens. The mode-ids can be mapped to mode objects using the member functions:

A direct transition to another mode is initiated through the $<$$<$-operator or the function enter_mode(), i.e.

The operators return void deliberately because mode transitions are not thought to be done in concatination without pattern matching. If so, one can still write multiple mode transitions in a row. In any case of the three above the following three steps are guaranteed:

  1. Call on_exit handler of current mode.

  2. Set the target mode.

  3. Call on_entry handler of target mode.

Additionally to direct mode transitions modes can be pushed and popped similar to subroutine calls (without arguments). This is provided by the functions:

The member function push_mode(new_mode) pushed the current mode on a last-in-first-out stack and set the new_mode as the current mode. A call to pop_mode() pops the last mode from the stack and sets it as the current mode. Note, that the mode transitions with push and pop follow the same mode transition procedure as for entering a mode directly. This means, that the on_exit and on_entry handler of the source and target mode are called. The function pop_drop_mode() pops a mode from the mode-stack, but does not set it as current mode. It is rather dropped and forgotten.

If one wants to avoid the call of exit and enter event handlers, then modes can also set brutally using the member functions:

Using these functions only the current mode is adapted, but no event handlers are called. This also means that mode transition control is turned off. Inadmissible transitions triggered with these functions cannot be detected during run-time.

Mode Objects

Modes themselves are implemented as objects of classes which are derived from the base class queχ_mode. Those mode objects have member functions that provide information about the modes and possible transitions:

    bool  has_base(const quex_mode& Mode,       bool PrintErrorMsgF = false) const;
    bool  has_entry_from(const quex_mode& Mode, bool PrintErrorMsgF = false) const;
    bool  has_exit_to(const quex_mode& Mode,    bool PrintErrorMsgF = false) const;
    const int     ID; \\
    const string  Name; \\

The first three member functions allow to get information about the relation to other modes. If the flag PringErrorMsgF is set than the function will print an error message to the standard error output in case that the condition is not matched. This comes very handy when using these functions in assert`s or during debugging. The functions can be applied on a given mode object or inside the `on_entry and on_exit functions with the this pointer. In a pattern action pair, for example, one might write

     if( PROGRAM.has_base(self.mode()) )
         cerr << "mode not a base of PROGRAM: " << self.mode_name() << endl;

For the end-user these functions are not really relevant, since queχ itself introduces `assert`s on mode transitions and provides convienient member functions in the lexical analyser class to access information about the current mode.

Token Handling

Section [sec-basics-token-queue] elaborated on the idea that the lexical analyser communicates tokens to the user. In the queχ generated lexical analyser the tokens are stored in a token-queue before they are delivered to the caller of get_token(). Inside the pattern-action the send function group allows to send tokens:

        void        send(const token& That);
        void        send(const token::id_type TokenID);
        void        send_n(const int N, const token::id_type TokenID);
        void        send(const token::id_type TokenID, const char* Text);
        void        send(const token::id_type TokenID, const int Number1);

As mentioned before, the author of queχ is in favor for the idea that the lexical analyzer should only find out for what a string in the input stream stands for. It should not interpret it, but leave the interpretation to the user or the parser.

Figure style=ref, displayed a sequence diagram depicting the communication of tokens. Using the send function group the writer of pattern-action pairs does not need to care about the get_token() function call. He sends tokens and leaves the responsibility of delivery to the engine.

The Accumulator

The accumulator is a member that allows stock strings to communicate between pattern-actions. In the practical example in section [sec-practical-intro] the string contained in string delimiter marks was accumulated until the on_exit handler was activated, i.e. the` STRING_READER` mode is left. Inside the handler, the string is flushed into a token with a specific id TKN_STRING. The accumulator provides the following functions:

  void   add(const char*);
  void   add(const char);
  void   flush(const token::id\_type TokenID);
  void   clear();

The add-functions add a string or a character to the accumulated string, the flush() function sends a token with the accumulated string and the specified token-id. Finally, the clear() function clears the accumulated string without sending any token.

Class Body Extensions with body

Instead of deriving a class from the lexical analyser, it is sometimes more convenient to simply paste some more content into the class body. This can be done using the keyword body outside the mode definitions. The following code, for example, pastes the definition of an integer vector into the class' body, so that it can later be used to keep track of indentation columns.

body {
  std::vector<int>    indentation_stack;
}

Note, that the namespace prefix std:: is used. This is because the class definition's body appears inside a header file and in header files one better uses absolute namings, rather than abbreviating using namespaces. The newly defined member can then be used inside any pattern action or event handler via access of the self reference, e.g.

mode SOMETHING {
   ...
   on_indentation {
      ...
      // close any block that has higher indentation
      while( Indentation < self.indentation_stack.back() ) {
          // send token that indicates: block-end
          self.send(TOKEN_END_OF_BLOCK);
          // cut indentation block from list
          self.indentation_stack.pop_back();
      }

   }
   ...
}

Constructor Extensions with init

The last section discussed how additional contents can be pasted into the defintion of the class' body that represents the lexical analyser. Naturally, those new contents need to be initialized. This happens inside the constructor of the generated lexical analyser. Queχ allows to specify additional content to be executed inside the constructor using the init keyword outside any mode definition. The following code fragment shows a setup as it might occur in a lexical analyser that supports indentation based statement blocks:

body {
    std::vector<int>  indentation_stack;
    bool              allow_opening_indentation_f;
}

init {
    // first indentation at column = 0
    self.indentation_stack.push_back(0);
    // default: do not allow to open a sub-block.
    // only function definitions, if statements, and for loops
    // shoul allow to open a new indentation block in the next line.
    self.allow_opening_indentation_f = false;
}

The body fragment adds an indentation stack as a member to keep track of line indentations. A second member variable tells if a higher indentation is supposed to appear that opens a statement block. These two variables need to be initialized. Therefore, the init fragment contains an initial block starting at column 0. This indentation block cannot be closed (except by end-of-file) because no column can have a negative indentation. Also, by default open sub-blocks is disallowed, so the correspondent member variable is set to false. This code fragment will then appear in any constructor of the generated lexical analyser class.

Including Files/Switching Buffers

Many programming languages provide an include feature. This means that they allow the user to get some input from another file and then returning to the file itself. If in C one specifies stdio.h to be included, then the content of this file will be treated before one continues parsing the current file. Such features are popular and provide a lot of convenience to the user. In a more general sense, what happens is that the input stream changes and needs to return after the alternative input has finished. This sounds like a case for the memento pattern \cite{} where the state of the lexical analyzer needs to be stored and restored---and it is.

The decision of what information about the lexical analyzer is to be stored during a input channel change, i.e. an include is not left to the user. The generated analyzer class provides the user with handy functions that do the whole job. Those are the following:

        template <class InputHandle>
        void   include_stack_push(InputHandle*, const int MODE_ID = -1,
                              const char* IConvInputCodingName = 0x0);
        template <class InputHandle>
        void   include_stack_push(InputHandle*, const quex_mode&,
                              const char* IConvInputCodingName = 0x0);

        bool   include_stack_pop();

The functions can be grouped according to two actions: pushing the current state of the lexical analyzer on an include stack and popping the last pushed state of the lexical analyzer from the include stack. The overloaded functions include_stack_push(…) require as a first argument a FILE* or an istream* pointer of the stream which is supposed to be the new input stream. The analyzer then automatically creates a new buffer object and sets everything up for the input from the new stream. Also, the current state is stored in a stack. If the program feels that is has finished parsing an _included file_\footnote{This happens usually on the occasion of end of file.} then it may want to return to the including file. The reset of the old state is now easily possible using the function include_stack_pop(). It _deletes_ the buffer of the included file, but it does not delete the related input stream. If there is something to do with it, it has to be done before calling include_stack_pop(). In particular, it is generally a good idea to close the file stream when reading has finished.

Character Encoding (Unicode Support)

Traditional lexical analyzer generators concentrate on the ASCII \cite{} character set for the encoding of characters. With the rise of UTF-8 \cite{} those generators could even be used to parse unicode character sets, since they were unaware of escape characters. The problem with this approach is a lack of flexibility. UTF-8 is for most languages, i.e. unicode code pages, a highly redundant coding. Second the UTF-8 standard (ISO-10646) is a dynamic length coding where characters are coded with different numbers of bytes. Counting of columns or lexeme lengthes is no longer trivial and is prone to consume significant computation time. For example, for lex the definition of character ranges such as [a-z], for non-latin languages, is not possible because it considers only the two bytes around the - where the characters might actually be represented by more than one byte.

figures/character-encoding.png
Figure: Character encodings involved in quex's generation process.}

Under Unix, a utility called iconv \cite{} is the quasi standard for character set conversions. It is freely available for any major operating system. Queχ uses the library of this utility and has it webbed into its buffer handling. This way queχ is able to analyze character streams of all the hundreds of character encodings that iconv supports. It is essential to understand that there are three different places where character encodings are involved in the lexical analyzer generation. As shown in figure [fig-character-encodings] these are the following:

Lexical Analyzer Description

The encoding of the lexical analyzer generator, i.e. the queχ-files which are used as input to queχ and must use UTF-8. This encoding is independent of the coding of the files the generated lexical analyzer takes as input.

Files to Be Analyzed

The coding format of the files that the generated lexical analyzer reads. Note, that this format is tied to a buffer. Queχ, though, provides the flexibility to switch buffers and to allow to read files in different encodings with the same lexical analyzer.

Internal Engine

The internal character encoding. This is always Unicode Standard \cite{}. The user, though, can specify whether one, two, or four bytes are to be used for the character representation. Accordingly, it uses ASCII, UCS-2, or UCS-4 as internal coding. Note, that this encoding determines the lexeme type. It might be a good idea to adapt the character width to the definition of wchar_t\footnote{The type wchar_t which stands for wide character type, is usually 32 bits, i.e. 4 bytes on Unix systems and 16 bit on Windows based platforms.}.

There only three things required to get the encoding support for a particular coding. First one needs to make sure that the queχ files, i.e. the files that describe the lexical analyzer are stored in UTF-8 format by your editor. Most editors do this anyway by default. You need to pass the option --iconv to queχ when calling it. Third, the constructor of queχ is to be called with the encoding name of the files you want to parse, i.e. one might type

    quex::my_lexer*  qlex = new quex::my_lexer("program.mine", "GREEK-CCITT");

if one intends to create a purely greek programming language. If you set the command line option for byte per character -b wchar_t, then queχ will use the number of bytes that corresponds to wchar_t on your platform. For anything other than ASCII input encodings do not specify a byte per character less than two. At the time of this writing queχ does not support internal engines that are not based on Unicode. Clearly, in the above example the greek encoding is treated internally as character codes in the range of code points from \$370 to \$3FF and \$1F00 to \$1FFF.

Pasting Header Information with header

It is sometimes necessary to provide header file information or macro definitions to the pattern-action pairs or event handlers. To do this queχ allows to specify a header. This keyword has to appear outside any mode definition. It is followed by an opening curly bracket, the code to be pasted and a matching closing curly bracket.

Deriving from Lexical Analyser

The class generated by queχ implements an engine for lexical analysis with some basic functionality as was discussed in the previous sections. Sometimes, though, it is necessary to add features to the lexical analyser. Especially, adding event handlers for file inclusion and pattern matching requires overiding virtual functions of the engine. Adding new features while keeping basic functionality is best done using C++ inheritance, i.e. one derives his own class from the generated lexer class.

The name and definition of this class needs to be known for the creation of the lexical analyser. Thus, one has to specify the command line option` —derived-class` followed by the name of the class and` —derived-class-file` followed by the name of the file that contains the definition of the derived class. The derivation itself happens in the standard C++ way, i.e. one derives publicly from the lexical analyser class:

   class small_lexer
      : public quex::tiny_lexer {
        small_lexer(const std::string& filename,
                    std::ostream*      output_stream = 0);

        // ...
   };

Note that the derived class has to call a base class' constructor from inside its constructor initialization list in order to properly initialize the lexical analyser object.

    small_lexer(const std::string& filename,
                std::ostream*      output_stream = 0)
        : tiny_lexer(filename, output_stream),
          // ...
    {
          // ...
    }

If these caveats are taken care of, the user is free to create objects of his derived class and use it the same way as he used the previous plain generated class.

User defined Token Classes

Queχ provides a default token class that allows the storage of a string object, an int value. Note, that any complex structure might be transformed later on by the parser based on the given string. In some cases, though, it may be promising to implement a dedicated token class that is optimal in memory and speed with respect to a specific problem. For these cases, queχ allows the user to specify his own token class. The token queue does not require any adaption, since it is implemented as a template. The context of usages of tokens, however, imposes that it complies to some policies:

As long as these conventions are respected the user created token class will interoperate with the framework smoothly. The inner structure of the token class can be freely implemented according to the programmer's optimization concepts. Note, that the name of the token class is also of free choice. When invoquing queχ, the command line option —token-class needs to be followed by the user defined token class name. The command line option —token-class-file tells queχ the name of the file where this class is defined. As long as this options are not defined queχ will not consider user defined token classes and provide the standard token class.

Summary of Code Sections

Previous chapters and sections introduced different code sections where patterns, token identifier, modes, and engine code could be specified. The following table summarizes these section in order to serve as an overview. It associates the code section's keyword with its content and the section and page reference of its explanation.

For convenience, even though it is not a region, let it be mentioned here, that the assignment

    start = MY_MODE

defines the starting mode of the lexical analyzer. If no starting mode is explicitly defined queχ assumes the first mode defined in the source code as starting mode and produces a warning if the console.

Macro Switches

When queχ creates a lexical analyser it leaves some switches in the code that can be turned on and off in order to control certain charactersistics. First of all, queχ makes heavy use of`assert`s to make sure that the code is executed appropriately. These asserts can be turned off by defining the`NDEBUG` macro. The following macros are queχ specific and can be specified with the -D option of your compiler:

QUEX_OPTION_LINE_NUMBER_COUNTING

If this macro is defined, then the lexical analyzer class contains support for line number counting. This means, that according member functions are present and internal related bookkeeping is activated. Note, that activating this feature may consume some calculation time. However, the solution for number counting provided by queχ is likely to beat any user-provided approach. Note, that queχ is able to analyze regular expressions to the point that it tries to determine constant line offsets for a pattern, before the engine actually runs.

QUEX_OPTION_COLUMN_NUMBER_COUNTING

Same as QUEX_OPTION_LINE_NUMBER_COUNTING for column number counting.

QUEX_OPTION_RUNTIME_MODE_TRANSITION_CHECK

This flag is only interesting if the flag —no-mode-transition-check was not set, i.e. that code for mode-transition check has been produced by queχ. This macro allows to activate these mode transition checks. Note, that pushing and popping of modes requires a run time mode transition check.

QUEX_OPTION_DEBUG_TOKEN_SENDING

If this option is activated a message is printed each time a token is sent to the token queue. Together with a printout of the token after reception with get_token() this helps to see if tokens are flushed 'in time'. This feature is only available if the engine is produced with the --debug flag.

QUEX_OPTION_DEBUG_MODE_TRANSITIONS

If defined, the lexical analyser engine displays any mode transition on the standard error output. This feature is only available if the engine is produced with the --debug flag.

QUEX_OPTION_DEBUG_QUEX_PATTERN_MATCHES

If defined every pattern match is going to be displayed on the standard error output. This feature is only available if the engine is produced with the --debug flag.

QUEX_OPTION_INCLUDE_STACK_SUPPORT

If enabled, then this macro activates support for the include stack, i.e. the member functions to switch the input buffer elegantly to parse from included files and return back to the including file.

QUEX_SETTING_BUFFER_SIZE

The internal input buffer size. (DEFAULT = 65536)

QUEX_SETTING_TOKEN_QUEUE_INITIAL_SIZE

The initial size of the token queue. The token queue extends itself if there is not enough space to store tokens. However, in order to avoid such memory allocations, it might be set to a reasonable number. (DEFAULT = 2048, but something like 32 should actually be reasonable)

Query Mode

The specification of character sets based on properties and character set operations requires in general some closer inspection in order to avoid the inclusion of unwanted characters or to span a character set that is wider than actually intended. Consider for example that one might want to provide a programming language that allows Latin and Greek letters in identifiers. The direct approach would be as follows:

define {
    LATIN_ID_START   [: intersection(\P{ID_Start},    \P{Script=Latin}) :]
    GREEK_ID_START   [: intersection(\P{ID_Start},    \P{Script=Greek}) :]
    LATIN_ID_CONT    [: intersection(\P{ID_Continue}, \P{Script=Latin}) :]
    GREEK_ID_CONT    [: intersection(\P{ID_Continue}, \P{Script=Greek}) :]

    LATIN_ID   {LATIN_ID_START}({LATIN_ID_CONT})*
    GREEK_ID   {GREEK_ID_START}({GREEK_ID_CONT})*
}

This specification is totally rational. However, it might include more characters than one actually intended. If one mentions Greek and Latin characters one usually thinks about lower and upper case letters from a to z and α to ω. However, the set of characters that are included in Script=Latin is much larger than that—and so is the set of characters in Script=Greek. Figure [fig-screenshot-character-set-query] displays the set of characters for the character set specified by GREEK_ID_CONT.

figures/screenshot-character-set-query.png
Figure: Example query for greek identifier characters.

As can be seen in the screenshot, the character set that is actually spanned by the expression is rather huge and exceeds the set of characters that can be displayed by the console of the user. At this point in time, one might have to use further intersection and difference operations in order to cut out the set that is actually desired. Also, consider not using the powerful tool of unicode properties, if things are expressed elegantly with traditions character set expressions. In the above case, a solution that likely fulfills the user's purpose might look like this:

define {
    LATIN_ID   [_a-zA-Z]([_a-zA-Z0-9])*
    GREEK_ID   [-ωΑ-Ω]([-ωΑ-Ω0-9])*
}

Including other scripts, or enabling other features of a used language requires closer investigation of the unicode properties. Queχ provides some powerful services to investigate the effect of a character set expression on the command line. Sophisticated programming language design on a non-latin language, though, requires some review of the literature of UCS based character sets and their characteristics. This chapter shows how queχ can be used to query the unicode database and see the effect of character set expressions immediately, without having to compile a whole lexical analyzer.

Unicode Properties

The first doubt that arises when properties are to be applied is whether the properties actually exists. The second doubt, then, is what values these properties can actually have. To help out with the first question, simply call queχ with the command line option —property and it prints a list of properties that it is able to extract from the unicode database. This leads to an output as follows:

# Abbreviation, Name, Type
AHex,    ASCII_Hex_Digit,                    Binary
Alpha,   Alphabetic,                         Binary
Bidi_C,  Bidi_Control,                       Binary
Bidi_M,  Bidi_Mirrored,                      Binary
CE,      Composition_Exclusion,              Binary
Comp_Ex, Full_Composition_Exclusion,         Binary
DI,      Default_Ignorable_Code_Point,       Binary
Dash,    Dash,                               Binary
Dep,     Deprecated,                         Binary

...

na,      Name,                               Miscellaneous
na1,     Unicode_1_Name,                     Miscellaneous
nt,      Numeric_Type,                       Enumerated
nv,      Numeric_Value,                      Numeric
sc,      Script,                             Catalog
scc,     Special_Case_Condition,             String,        <unsupported>
sfc,     Simple_Case_Folding,                String,        <unsupported>
slc,     Simple_Lowercase_Mapping,           String,        <unsupported>
stc,     Simple_Titlecase_Mapping,           String,        <unsupported>
suc,     Simple_Uppercase_Mapping,           String,        <unsupported>
tc,      Titlecase_Mapping,                  String,        <unsupported>
uc,      Uppercase_Mapping,                  String,        <unsupported>

Each line contains three fields separated by commas. The first field contains the alias for the property name, the second field contains the property name, and the last column contains the type of property. If a field containing <unsuported> is appended, this means that this property is not supported by queχ. In most cases this so because these properties support character operations rather then the definition of character sets.

To help out with the second question call queχ with the command line option —property followed by the name of the property that you want to know more about. The following displays the query about the property Numeric_Type:

    > quex --property Numeric_Type
    (please, wait for database parsing to complete)

    NAME          = 'Numeric_Type'
    ALIAS         = 'nt'
    TYPE          = 'Enumerated'
    VALUE_ALIASES = {
            Decimal(De),
            Digit(Di),
            Numeric(Nu).
    }

This tells, that Numeric_Type is a property of type Enumerated, i.e. its values are taken from a fixed list of values. The alias nt can be used as a placeholder for Numeric_Type, and the possible value settings are Decimal, Digit, and Numeric. The strings mentioned in brackets are the value aliases that can be used as placeholders for the values, in case one does not want to type the whole name of the value. From this output one knows that expressions such as \P{Numeric_Type=Decimal}, \P{nt=Di}, and \P{Numeric_Type=Nu} are valid for this property. The next doubt that arises is about the character set that is actually spanned by such expressions. This is discussed in the subsequent section.

Character Set Expressions

The specification of character sets using unit code properties requires caution and a precise vision of the results of character set operations. In order to allow to user to see results of character set operations, queχ provides a query mode where the character set corresponding to a specific property setting can be investigated. Additionally, whole character set expressions can be specified and queχ displays the set of characters that results from it.

In order to display the character set related to a unicode property setting, call queχ with the option —set-by-property followed by a string that specifies the setting in the same way as this is done in regular expressions with the \P{…} operation. Note, that binary properties have no value in the sense that they are specified in a manner as property-name=value. Instead, it is sufficient to give the name of the binary property and queχ displays all characters that have that property, e.g.

 > quex --set-by-property ASCII_Hex_Digit

displays all characters that have the property ASCII_Hex_Digit, i.e.

00020:                               0 1 2 3 4 5 6 7 8 9
00040: A B C D E F
00060: a b c d e f

For non-binary properties the value of the property has to be specified. For example, the set of characters where the property Block is equal to Ethiopic can be displayed by calling

 > quex --set-by-property Block=Ethiopic
figures/screenshot-block-ethiopic.png
Figure: Characters for unicode property Block=Ethiopic.

and the result is displayed in figure [fig-screenshot-block-ethiopic.png]. Naturally, sets specified with a simple property setting are not precisely what the user desires. To apply complex operations of character set, queχ provides character set operations [sec-formal/patterns/operations]. Again, it is essential to know which character sets these expressions expand. Thus, queχ provides a function to investigate set expressions with the command line option —set-by-expression. The following call

 > quex --set-by-expression 'intersection(\P{Script=Arabic}, \G{Nd})'

Displays all characters that are numeric digits in the arabic script. Note that the display of characters on the screen is not always meaningful, so you might specify the option —numeric in order to display numeric values of the characters. The option —intervals prints out character intervals instead of each isolated character. This way output may become more brief and understandable.

Wilcard Expansions

Property value settings can be specified using wildcards as explained in [sec-formal/patterns/character-set]. Using the command line option —property-match allows the user to see to which values these wildcard expressions actually expand. For example, the call

> quex --property-match Name=MATH*FIVE*

delivers:

MATHEMATICAL_BOLD_DIGIT_FIVE
MATHEMATICAL_DOUBLE-STRUCK_DIGIT_FIVE
MATHEMATICAL_MONOSPACE_DIGIT_FIVE
MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_FIVE
MATHEMATICAL_SANS-SERIF_DIGIT_FIVE

Note, that the final character sets spanned by these settings can be viewed using —set-by-property Name=MATH*FIVE* and —set-by-expression \N{MATH*FIVE*}$$` in the manner previously explained. For example,

> quex --set-by-property 'Name=MATH*FIVE*'

delivers:

1D7D0:             1D7D3
1D7D8:                         1D7DD
1D7E0:                                     1D7E7
...
1D7F0: 1D7F1
1D7F8:             1D7FB

Command Line Options

This section lists the command line options to control the behavior of the generated lexical analyzer. Numbers following these options can be either decimal, without any prefix, hexadecimal with a 0x prefix or octal with a 0o prefix.

-i, —mode-files file-list

file-list = list of files of the file containing mode definitions (see sections [sec-practical-modes], [sec-practical-pattern-action-pairs], and [sec-formal-generated-class-mode-handling]). DEFAULT = <empty>

—token-prefix name

name = Name prefix to prepend to the name given in the token-id files. For example, if a token-id file contains the name COMPLEX and the token-prefix is TOKEN_PRE_ then the token-id inside the code will be TOKEN_PRE_COMPLEX. DEFAULT = "TKN_".

—token-offset number

number = Number where the numeric values for the token-ids start to count. DEFAULT = "10000"

—no-token-queue

Disables the sending of tokens via a queue. Note, this can be a little faster than tranmitting tokens via a token queue.

—token-id-termination number

number = Token identifier for the event of end of token stream. DEFAULT = "0". Note, that the termination token id can always be referred to via __QUEX_TOKEN_ID_TERMINATION. One does not have to specify it just in order to be able to know its value. Consider also the comments on the token for uninitialized.

—token-id-uninitialized number

number = Token identifier of a token that has not yet been initialized. Any token that arrives to the user on purpose should not contain this identifier. It is basically a means for debugging in order to check whether tokens slip through due to any erroneous behavior. DEFAULT = "1".
This option is close to meaningless under normal conditions. One can always refer to the unitiliazed token id via __QUEX_TOKEN_ID_UNINITIALIZED. In certain cases, though, it might make sense to ensure this id to have a certain value. This would be the case, if someone would like to code more than just the token id in the token id variable.

—version-id name

name = arbitrary name of the version that was generated. This string is reported by the version() member function of the lexical analyser. DEFAULT = "0.0.0-pre-release"

—foreign-token-id-file filename

filename = Name of the file that contains an alternative definition of the numerical values for the token-ids (see also section [sec-formal-macro]). DEFAULT = <empty>

-o, —engine name

name = Name of the lexical analyser class that is to be created inside the namespace`queχ`. This name also determines the filestem of the output files generated by queχ. DEFAULT = "lexer"

—debug

If provided, then code fragments are created to activate the output of every pattern match. Then defining the macro QUEX_OPTION_DEBUG_QUEX_PATTERN_MATCHES activates those printouts in the standard error output. Note, that this options produces considerable code overhead. DEFAULT = <disabled>

—no-mode-transition-check

Turns off the mode transition check and makes the engine a little faster. During development this option should not be used. But the final lexical analyzer should be created with this option set. DEFAULT = <disabled>, which means that mode transition check is enabled.

—no-string-accumulator, —nsacc

Turns the string accumulator option off. This disables the use of the string accumulator to accumulate lexemes. See class queχ::Accumulator.

—post-categorizer

Turns the post categorizer option on. This allows a secondary mapping from lexemes to token ids based on their name. See class queχ::PostCategorizer.

For the support of derivation from the generated lexical analyzer class the following command line options can be used.

—derived-class name

name = If specified, the name of the derived class that the user intends to provide (see section [sec-formal-derivation]). Note, specifying this option signalizes that the user wants to derive from the generated class. If this is not desired, this option, and the following, have to be left out. DEFAULT = <empty>

—derived-class-file filename

filename = If specified, the name of the file where the derived class is defined. This option only makes sense in the context of optioin —derived-class`. DEFAULT = <empty>

—friend-class name-list

name-list = Names of classes that shall be friends of the generated lexical analyser. This is to be used, if other classes need to have access to protected or private members of the analyser. It is only to be used by specialists. DEFAULT = <empty>

Additionally, there are options for specialists who want to provide their own token-class:

—token-class name

name = Name of the token class that the user defined. Note that the token class needs to be specified in namespace queχ. DEFAULT = "token"

—token-class-file filename

filename = Name of file that contains the definition of the token class. DEFAULT = "$(QUEX_PATH)/code_base/token"

Even if a non-queχ token class is provided, the token-id generator may still be useful. By default, it remains in place. The user, however, can specify the following option to disable it:

—user-token-id-file filename

filename = Name of file that contains the definition of the token-ids and the mapping function from numerical token-ids to std::string objects, i.e. human readable names. DEFAULT = <disabled>

There may be cases where the characters used to indicate buffer limit, end-of-stream and begin-of-stream need to be redefined, because the default code points appear in a pattern
[As for normal ASCII or Unicode based lexical analyzers, this would most probably not be a good design decision. But, when other, alien, non-unicode codings are to be used, this case is conceivable.]
. Also, note that the . regular expression (meaning nothing but newline or end of file) needs check for begin-of-buffer and end-of-buffer in the general case. Giving both the same value may come with some speedup, and does not hurt. None of these values should be equal to the buffer limit delimiter. The following options allow modification of the values discussed above.

—buffer-limit number DEFAULT = "0x0"
—begin-of-stream number DEFAULT = "0x19"
—end-of-stream number DEFAULT = "0x1A"

If the trivial end-of-line pre-condition (i.e. the \$ at the end of a regular expression) is used, by default queχ produces code that runs on both Unix and DOS-like systems. Practically, this means that it matches against newline 0x0A and carriage return/newline 0x0D 0x0A. For the case that the resulting analyzer only runs on a Unix machine some tiny performance improvements might be achieved by disabling the 0x0D 0x0A sequence and only triggering on 0x0A. In this case, the following flag may be specified:

—no-DOS DEFAULT=<enabled>

For unicode support it is essential to allow iconv support \cite{}. For this the iconv library must be installed on your system. On Unix systems this library is usually present. If a coding other than ASCII is required, specify the following options:

—iconv

Enable the use of the iconv library for character stream decoding. This option is a must for unicode support. DEFAULT = <disabled>

—bytes-per-ucs-code-point, -b ["1", "2", "4", "wchar_t"]

With this option the internal representation of character is specified. It determines the bytes per character, which compose the lexeme strings and on which the lexical analyzer engine internally operates. The byte number should at least suffice to carry the desired input coding space. You can only specify 1 byte, 2 byte or 4 byte per character. If`wchar_t` is specified queχ automatically adapts to the correspondent type of the operating system environment where the target code is compiled. Use this, if option`—iconv` is used and you are in doubt. DEFAULT = "2"

—endian ["little", "big", "<system>"]

There are two types of byte ordering for integer number for different CPUs. For creating a lexical analyzer engine on the same CPU type as queχ runs then this option is not required, since queχ finds this out by its own. If you create an engine for a different plattform, you must know its byte ordering scheme, i.e. little endian or big endian, and specify it after —endian'. DEFAULT="<system>"

For version information pass option —version or -v. The options —help and -h are reserved for requesting a help text. Those are the options for using queχ in the 'normal mode where it creates lexical analyzers. However, queχ provides some services to query and test character sets. If one of those options is called, then queχ does not create a lexical analyzer but responds with some information requested by the user. Those options are the following.

—property [name]

If name is specified, then information about the property with the given name is displayed. Note, that name can also be a property alias. If name is not specified, then brief information about all available unicode properties is displayed.

—set-by-property setting

For binary properties only the property name has to be specified. All other properties require a term of the form property-name = value. Queχ then displays the set of character that has this particular property.

—set-by-expression expression

Character set expressions that are ususally specified in [::] brackets can be specified as expression. Queχ then displays the set of characters that results from it.

—property-match wildcard-expression

Queχ allows the use of wildcards in property values. Using this option allows display of the list of values to which the given wildcard expression expands. Example: The wildcard-expression Name=LATIN gives all settings of property Name that contain the string LATIN.

—numeric

If this option is specified the numeric character codes are displayed rather then the utf8 characters.

—intervals

This option disables the display of single character or single character codes. In this case sets of adjacent characters are displayed as intervals. This provides a somewhat more abbreviated display.

Additionally, queχ provides the ability to display transition diagrams of produced state machines graphically. The following command line options support this feature:

—plot graphic format

Runs queχ in the plotting mode. Rather than producing source code, queχ produces transition diagrams of the defined modes. For querying possible graphic formats, run queχ with the —plot-format-list command line option. Note, for this option to work, the graphviz package needs to be installed (see www.graphviz.org).

—plot-format-list

Lists all possible graphic formats for which queχ can produce transition graphs.

Transition Graph Generation

Since version 0.21.11, queχ has had the ability to produce transition graphs for the state machines that result from the regular expressions of its modes. This happens with help of the tool graphviz from AT\&T Labs (http://www.graphviz.org). Please install this package before any attempt to produce transition graphs.

The plot of transition graphs is initiated with the command line argument —plot followed by a string that indicates the graphic format, e.g.

> quex -i core.qx --plot svg

will produce transition graphs of all state machines that would result from descriptions in core.qx. As a result of this operation a set of files is created that carry the extension of the graphic format. The basenames of the files are related to the mode from which its content results. A list of supported graphic formats can be queried by passing the command line argument —plot-format-list.

Appendix: Unicode Character Properties

This chapter focuses on the Unicode Character Properties as they are provied by queχ. The current version of Queχ is based on Unicode 5.0. It uses the databases as provided by the Unicode Consortium and it is likely that Queχ integrates the property system of any later standard as soon as it is in a major state. In particular, Unicode Character Properties are used to define sets of characters to be matched during lexical analysis. This excludes some properties from consideration, namely the properties tagged as string property types and the quick-check properties (see [UCS#15]). The following properties are explicitly supported by queχ. The expressions in brackets are the aliases that can be used as a shorthand for the full name of the property.

Binary Properties

ASCII_Hex_Digit(AHex), Alphabetic(Alpha), Bidi_Control(Bidi_C), Bidi_Mirrored(Bidi_M), Composition_Exclusion(CE), Dash(Dash), Default_Ignorable_Code_Point(DI), Deprecated(Dep), Diacritic(Dia), Expands_On_NFC(XO_NFC), Expands_On_NFD(XO_NFD), Expands_On_NFKC(XO_NFKC), Expands_On_NFKD(XO_NFKD), Extender(Ext), Full_Composition_Exclusion(Comp_Ex), Grapheme_Base(Gr_Base), Grapheme_Extend(Gr_Ext), Grapheme_Link(Gr_Link), Hex_Digit(Hex), Hyphen(Hyphen), IDS_Binary_Operator(IDSB), IDS_Trinary_Operator(IDST), ID_Continue(IDC), ID_Start(IDS), Ideographic(Ideo), Join_Control(Join_C), Logical_Order_Exception(LOE), Lowercase(Lower), Math(Math), Noncharacter_Code_Point(NChar), Other_Alphabetic(OAlpha), Other_Default_Ignorable_Code_Point(ODI), Other_Grapheme_Extend(OGr_Ext), Other_ID_Continue(OIDC), Other_ID_Start(OIDS), Other_Lowercase(OLower), Other_Math(OMath), Other_Uppercase(OUpper), Pattern_Syntax(Pat_Syn), Pattern_White_Space(Pat_WS), Quotation_Mark(QMark), Radical(Radical), STerm(STerm), Soft_Dotted(SD), Terminal_Punctuation(Term), Unified_Ideograph(UIdeo), Uppercase(Upper), Variation_Selector(VS), White_Space(WSpace), XID_Continue(XIDC), XID_Start(XIDS),

Non-Binary Properties

Age(age), Bidi_Class(bc), Bidi_Mirroring_Glyph(bmg), Block(blk), Canonical_Combining_Class(ccc), Case_Folding(cf), Decomposition_Mapping(dm), Decomposition_Type(dt), East_Asian_Width(ea), FC_NFKC_Closure(FC_NFKC), General_Category(gc), Grapheme_Cluster_Break(GCB), Hangul_Syllable_Type(hst), ISO_Comment(isc), Joining_Group(jg), Joining_Type(jt), Line_Break(lb), Lowercase_Mapping(lc), NFC_Quick_Check(NFC_QC), NFD_Quick_Check(NFD_QC), NFKC_Quick_Check(NFKC_QC), NFKD_Quick_Check(NFKD_QC), Name(na), Numeric_Type(nt), Numeric_Value(nv), Script(sc), Sentence_Break(SB), Simple_Case_Folding(sfc), Simple_Lowercase_Mapping(slc), Simple_Titlecase_Mapping(stc), Simple_Uppercase_Mapping(suc), Special_Case_Condition(scc), Titlecase_Mapping(tc), Unicode_1_Name(na1), Unicode_Radical_Stroke(URS), Uppercase_Mapping(uc), Word_Break(WB),

Binary properties can simply be applied using the \P{…} expression. Thus, in usual programming languages.

Unicode Character Property Settings

This section elaborates on possible property settings for non-binary properties. In queχ's regular expression syntax they can be used using the \P{..} expressions. For example, \P{Age=3.1} selects the set of character that exists since Unicode Version 3.1. For reasons mentioned in the previous section some Unicode Character Code properties are not supported. Note also, that some settings can be provided using value aliases. Those aliases are specified in brackets.

Age

1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0, 4.1, 5.0.

Bidi_Class

Arabic_Letter(AL), Arabic_Number(AN), Boundary_Neutral(BN), Common_Separator(CS), European_Number(EN), European_Separator(ES), European_Terminator(ET), Left_To_Right(L), Left_To_Right_Embedding(LRE), Left_To_Right_Override(LRO), Nonspacing_Mark(NSM), Other_Neutral(ON), Paragraph_Separator(B), Pop_Directional_Format(PDF), Right_To_Left(R), Right_To_Left_Embedding(RLE), Right_To_Left_Override(RLO), Segment_Separator(S), White_Space(WS).

Bidi_Mirroring_Glyph

(not supported)

Block

Aegean_Numbers, Alphabetic_Presentation_Forms, Ancient_Greek_Musical_Notation, Ancient_Greek_Numbers, Arabic, Arabic_Presentation_Forms-A, Arabic_Presentation_Forms-B, Arabic_Supplement, Armenian, Arrows, Balinese, Basic_Latin, Bengali, Block_Elements, Bopomofo, Bopomofo_Extended, Box_Drawing, Braille_Patterns, Buginese, Buhid, Byzantine_Musical_Symbols, CJK_Compatibility, CJK_Compatibility_Forms, CJK_Compatibility_Ideographs, CJK_Compatibility_Ideographs_Supplement, CJK_Radicals_Supplement, CJK_Strokes, CJK_Symbols_and_Punctuation, CJK_Unified_Ideographs, CJK_Unified_Ideographs_Extension_A, CJK_Unified_Ideographs_Extension_B, Cherokee, Combining_Diacritical_Marks, Combining_Diacritical_Marks_Supplement, Combining_Diacritical_Marks_for_Symbols, Combining_Half_Marks, Control_Pictures, Coptic, Counting_Rod_Numerals, Cuneiform, Cuneiform_Numbers_and_Punctuation, Currency_Symbols, Cypriot_Syllabary, Cyrillic, Cyrillic_Supplement, Deseret, Devanagari, Dingbats, Enclosed_Alphanumerics, Enclosed_CJK_Letters_and_Months, Ethiopic, Ethiopic_Extended, Ethiopic_Supplement, General_Punctuation, Geometric_Shapes, Georgian, Georgian_Supplement, Glagolitic, Gothic, Greek_Extended, Greek_and_Coptic, Gujarati, Gurmukhi, Halfwidth_and_Fullwidth_Forms, Hangul_Compatibility_Jamo, Hangul_Jamo, Hangul_Syllables, Hanunoo, Hebrew, High_Private_Use_Surrogates, High_Surrogates, Hiragana, IPA_Extensions, Ideographic_Description_Characters, Kanbun, Kangxi_Radicals, Kannada, Katakana, Katakana_Phonetic_Extensions, Kharoshthi, Khmer, Khmer_Symbols, Lao, Latin-1_Supplement, Latin_Extended-A, Latin_Extended-B, Latin_Extended-C, Latin_Extended-D, Latin_Extended_Additional, Letterlike_Symbols, Limbu, Linear_B_Ideograms, Linear_B_Syllabary, Low_Surrogates, Malayalam, Mathematical_Alphanumeric_Symbols, Mathematical_Operators, Miscellaneous_Mathematical_Symbols-A, Miscellaneous_Mathematical_Symbols-B, Miscellaneous_Symbols, Miscellaneous_Symbols_and_Arrows, Miscellaneous_Technical, Modifier_Tone_Letters, Mongolian, Musical_Symbols, Myanmar, NKo, New_Tai_Lue, Number_Forms, Ogham, Old_Italic, Old_Persian, Optical_Character_Recognition, Oriya, Osmanya, Phags-pa, Phoenician, Phonetic_Extensions, Phonetic_Extensions_Supplement, Private_Use_Area, Runic, Shavian, Sinhala, Small_Form_Variants, Spacing_Modifier_Letters, Specials, Superscripts_and_Subscripts, Supplemental_Arrows-A, Supplemental_Arrows-B, Supplemental_Mathematical_Operators, Supplemental_Punctuation, Supplementary_Private_Use_Area-A, Supplementary_Private_Use_Area-B, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tags, Tai_Le, Tai_Xuan_Jing_Symbols, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Unified_Canadian_Aboriginal_Syllabics, Variation_Selectors, Variation_Selectors_Supplement, Vertical_Forms, Yi_Radicals, Yi_Syllables, Yijing_Hexagram_Symbols(n/a).

Canonical_Combining_Class

0, 1, 10, 103, 107, 11, 118, 12, 122, 129, 13, 130, 132, 14, 15, 16, 17, 18, 19, 20, 202, 21, 216, 218, 22, 220, 222, 224, 226, 228, 23, 230, 232, 233, 234, 24, 240, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 7, 8, 84, 9, 91.

Case_Folding

(not supported)

Decomposition_Mapping

(not supported)

Decomposition_Type

Canonical(can), Circle(enc), Compat(com), Final(fin), Font(font), Fraction(fra), Initial(init), Isolated(iso), Medial(med), Narrow(nar), Nobreak(nb), Small(sml), Square(sqr), Sub(sub), Super(sup), Vertical(vert), Wide(wide).

East_Asian_Width

A, F, H, N, Na, W.

FC_NFKC_Closure

(not supported)

General_Category

Close_Punctuation(Pe), Connector_Punctuation(Pc), Control(Cc), Currency_Symbol(Sc), Dash_Punctuation(Pd), Decimal_Number(Nd), Enclosing_Mark(Me), Final_Punctuation(Pf), Format(Cf), Initial_Punctuation(Pi), Letter_Number(Nl), Line_Separator(Zl), Lowercase_Letter(Ll), Math_Symbol(Sm), Modifier_Letter(Lm), Modifier_Symbol(Sk), Nonspacing_Mark(Mn), Open_Punctuation(Ps), Other_Letter(Lo), Other_Number(No), Other_Punctuation(Po), Other_Symbol(So), Paragraph_Separator(Zp), Private_Use(Co), Space_Separator(Zs), Spacing_Mark(Mc), Surrogate(Cs), Titlecase_Letter(Lt), Uppercase_Letter(Lu).

Grapheme_Cluster_Break

CR(CR), Control(CN), Extend(EX), L(L), LF(LF), LV(LV), LVT(LVT), T(T), V(V).

Hangul_Syllable_Type

L, LV, LVT, T, V.

ISO_Comment

*, Abkhasian, Adrar_yaj, Aristeri_keraia, Assamese, Byelorussian, Dasia, Dexia_keraia, Dialytika, Enn, Enotikon, Erotimatiko, Faliscan, German, Greenlandic, Icelandic, Kaeriten, Kanbun_Tateten, Khutsuri, Maatham, Mandarin_Chinese_first_tone, Mandarin_Chinese_fourth_tone, Mandarin_Chinese_light_tone, Mandarin_Chinese_second_tone, Mandarin_Chinese_third_tone, Merpadi, Naal, Oscan, Oxia,_Tonos, Patru, Psili, Rupai, Sami, Serbocroatian, Tuareg_yab, Tuareg_yaw, Ukrainian, Umbrian, Varavu, Varia, Varudam, Vietnamese, Vrachy, a, aa, ae, ai, ang_kang_ye, ang_kang_yun, anusvara, ardhacandra, ash_*, au, b_*, bb_*, bha, break, bs_*, bub_chey, c_*, candrabindu, cha, chang_tyu, che_go, che_ta, che_tsa_chen, chu_chen, colon, d_*, danda, dd_*, dda, ddha, deka_chig, deka_dena, deka_nyi, deka_sum, dena_chig, dena_nyi, dena_sum, dha, di_ren_*, dong_tsu, dorje, dorje_gya_dram, double_danda, drilbu, drul_shey, du_ta, dzu_ta_me_long_chen, dzu_ta_shi_mig_chen, e, escape, g_*, gg_*, gha, golden_number_17, golden_number_18, golden_number_19, gs_*, gug_ta_ye, gug_ta_yun, gup, gya_tram_shey, h_*, halfwidth_katakana-hiragana_semi-voiced_sound_mark, halfwidth_katakana-hiragana_voiced_sound_mark, harpoon_yaz, hdpe, hlak_ta, honorific_section, hwair, i, ii, independent, j_*, je_su_nga_ro, jha, ji_ta, jj_*, k_*, ka_sho_yik_go, ka_shog_gi_go_gyen, kha, kur_yik_go, kuruka, kuruka_shi_mik_chen, kyu_pa, l_*, lakkhang_yao, lazy_S, lb_*, ldpe, lg_*, lh_*, line-breaking_hyphen, lm_*, lp_*, ls_*, lt_*, m_*, mai_taikhu, mai_yamok, mar_tse, mathematical_use, n_*, nam_chey, nan_de, ng_*, nge_zung_gor_ta, nge_zung_nyi_da, nh_*, nikkhahit, nj_*, nna, norbu, norbu_nyi_khyi, norbu_shi_khyi, norbu_sum_khyi, not_independent, nukta, nyam_yig_gi_go_gyen, nyi_da_na_da, nyi_shey, nyi_tsek_shey, o, oe, or_shuruq, other, p_*, paiyan_noi, pause, pema_den, pete, pha, phurba, pp, ps, pug, punctuation_ring, pvc, r_*, ren_*, ren_di_*, ren_ren_*, ren_tian_*, repha, rinchen_pung_shey, s_*, sara_ai_mai_malai, sara_ai_mai_muan, sara_uue, section, sha, shey, ss_*, ssa, t_*, tamatart, ter_tsek, ter_yik_go_a_thung, ter_yik_go_wum_nam_chey_ma, ter_yik_go_wum_ter_tsek_ma, tha, tian_ren_*, trachen_char_ta, tru_chen_ging, tru_me_ging, tsa_tru, tsek, tsek_shey, tsek_tar, tta, ttha, u, uu, virama, visarga, vocalic_l, vocalic_ll, vocalic_r, vocalic_rr, yang_ta, yar_tse, yik_go_dun_ma, yik_go_kab_ma, yik_go_pur_shey_ma, yik_go_tsek_shey_ma.

Joining_Group

Ain, Alaph, Alef, Beh, Beth, Dal, Dalath_Rish, E, Fe, Feh, Final_Semkath, Gaf, Gamal, Hah, Hamza_On_Heh_Goal, He, Heh, Heh_Goal, Heth, Kaf, Kaph, Khaph, Knotted_Heh, Lam, Lamadh, Meem, Mim, Noon, Nun, Pe, Qaf, Qaph, Reh, Reversed_Pe, Sad, Sadhe, Seen, Semkath, Shin, Swash_Kaf, Syriac_Waw, Tah, Taw, Teh_Marbuta, Teth, Waw, Yeh, Yeh_Barree, Yeh_With_Tail, Yudh, Yudh_He, Zain, Zhain(n/a).

Joining_Type

C, D, R, T.

Line_Break

AI, AL, B2, BA, BB, BK, CB, CL, CM, CR, EX, GL, H2(H2), H3(H3), HY, ID, IN, IS, JL(JL), JT(JT), JV(JV), LF, NL, NS, NU, OP, PO, PR, QU, SA, SG, SP, SY, WJ, XX, ZW.

Lowercase_Mapping

(not supported)

NFC_Quick_Check

(not supported)

NFD_Quick_Check

(not supported)

NFKC_Quick_Check

(not supported)

NFKD_Quick_Check

(not supported)

Name

(see Unicode Standard Literature)

Numeric_Type

Decimal(De), Digit(Di), Numeric(Nu).

Numeric_Value

0, 1, 2, 3, 4, 5, 6, 7, 8, 9.

Script

Arabic(Arab), Armenian(Armn), Balinese(Bali), Bengali(Beng), Bopomofo(Bopo), Braille(Brai), Buginese(Bugi), Buhid(Buhd), Canadian_Aboriginal(Cans), Cherokee(Cher), Common(Zyyy), Coptic(Copt), Cuneiform(Xsux), Cypriot(Cprt), Cyrillic(Cyrl), Deseret(Dsrt), Devanagari(Deva), Ethiopic(Ethi), Georgian(Geor), Glagolitic(Glag), Gothic(Goth), Greek(Grek), Gujarati(Gujr), Gurmukhi(Guru), Han(Hani), Hangul(Hang), Hanunoo(Hano), Hebrew(Hebr), Hiragana(Hira), Inherited(Qaai), Kannada(Knda), Katakana(Kana), Kharoshthi(Khar), Khmer(Khmr), Lao(Laoo), Latin(Latn), Limbu(Limb), Linear_B(Linb), Malayalam(Mlym), Mongolian(Mong), Myanmar(Mymr), New_Tai_Lue(Talu), Nko(Nkoo), Ogham(Ogam), Old_Italic(Ital), Old_Persian(Xpeo), Oriya(Orya), Osmanya(Osma), Phags_Pa(Phag), Phoenician(Phnx), Runic(Runr), Shavian(Shaw), Sinhala(Sinh), Syloti_Nagri(Sylo), Syriac(Syrc), Tagalog(Tglg), Tagbanwa(Tagb), Tai_Le(Tale), Tamil(Taml), Telugu(Telu), Thaana(Thaa), Thai(Thai), Tibetan(Tibt), Tifinagh(Tfng), Ugaritic(Ugar), Yi(Yiii).

Sentence_Break

ATerm(AT), Close(CL), Format(FO), Lower(LO), Numeric(NU), OLetter(LE), STerm(ST), Sep(SE), Sp(SP), Upper(UP).

Simple_Case_Folding

(not supported)

Simple_Lowercase_Mapping

(not supported)

Simple_Titlecase_Mapping

(not supported)

Simple_Uppercase_Mapping

(not supported)

Special_Case_Condition

(not supported)

Titlecase_Mapping

(not supported)

Unicode_1_Name

(see Unicode Standard Literature)

Unicode_Radical_Stroke

(not supported)

Uppercase_Mapping

(not supported)

Word_Break

ALetter(LE), ExtendNumLet(EX), Format(FO), Katakana(KA), MidLetter(ML), MidNum(MN), Numeric(NU).

Appendix: Quex Intern: Generation of the Core Engine

The queχ program is provided in source code with an LPGL license in order to promote understanding of the processes of lexical analysis in its practical application. The source code may be used for teaching or to enhance the functionality or performance of the software. In order to allow programmers to get into queχ's source code programming, the following sections describe the way that queχ generates lexical analysers step by step. Clearly, this chapter is a hacker's guide to queχ. Users interested only in the application of queχ may skip this chapter without hesitation.

The generation of the core engine is separate from the handling of modes and their relationships. In the following sections, only the generation of the core engine, i.e. a lexical analyser for a single mode, is explained. Each mode consists of such an engine. The aggregation of engines for each mode to one single analyser engine is trivial and not the subject of further explanation\footnote{The switching between modes is not much more than the bending of a function pointer to the mode's analyzer function.}. For further support to get into programming, the unit tests for each sub-module in the source code may be considered. They can be found in the`./TEST/` sub-directory of each module.

The engine produced by queχ is a directly coded engine, rather than a table driven approach followed by the lex/flex family of analyzer generators. It is more suited to handling unicode and provides advantages in terms of system memory usage and calculation speed.

Introduction

The following sections elaborate on the process of generating code for a directly coded lexical analyser. Each lexical analyzer mode consists of such an independent code fragment embedded into a function. In a first section, the big picture is discussed. This gives an overview about the major issues involved and builds a basis for later sections to refer to. Many books discuss in detail the Thomson-construction and the Hopcroft-Optimization, but few (the author does not know a single one) discuss the construction down to code-generation and its specialities in detail. Queχ was created to handle efficiently Unicode character sets. As a consequence the table-driven approach, such as in lex \cite{} is no longer suitable%% %% \footnote{In table driven analysers the table size is the square of the number of characters. Unicode provides currently \$10.FFFF code points. A table for this character set would blow the memory of todays computers (The century). Compressing the table might slow down the speed of the lexical analyser significantly.}. %% Thus, queχ implements a directly coded lexical analyser where no state variable is needed and no character-trigger table. It is also a much faster approach, since no memory has to be referenced%% %% \footnote{Note, that an expression such as vector$[$i$]$' implies generally a multiplication offset = i * sizeof(type)), an addition adress = vector + offset), and a memory referencing operation object = *adress).}. %%

The Big Picture

In this section the overal lexical analyser is discussed. This is done before the detailed construction is dealt with in order to introduce the basic ideas that influence the creation of a lexical analyser. In particular concepts such as pattern privileges, pre- and post-conditions, as well as state origins are discussed at this point. The lexical analyser to be generated by queχ is a state machine. In this text the following notation for different states of a state machine is chosen:

figures/pattern-bank.png
Figure: Three example patterns implemented as state machines.

Figure style=ref shows an example of three patterns to be matched against: A keyword WHILE, a label a character sequence terminated by a colon), and an identifier, i.e. a character sequence. An incoming character stream triggers step by step the transition in each of these state machines. Some may bail out. Other may reach the acceptance state. If one state machine reached the acceptance state, one can say that the pattern it represents has matched. For example a character sequence`H`,E`, R`,E, : lets state machine 0 fail at the first character, state machine 1 reaches acceptance at character :' and state machine 2 reaches acceptance one character before. Here an essential concept has to be mentioned:

\definition{ Longest Match: Queχ produces a lexical analyser that eats as many characters as possible. When a pattern can match more incoming characters than another, it dominates. }

This approach is very logical, since it was otherwise impossible to have a keyword End} and EndIf for example. If the shortest match wins, then always End is detected, but never EndIf. Practically in the example above this means that we do not stop when state machine 2 enters acceptance, but we give state machine 1 a chance match a longer pattern.

figures/pattern-bank-merged.pdf
Figure: State machine matching against all three patterns from figure style=ref.

Now, consider the character sequence`W`,H,I,L`, E`,(. State machine 1 fails without ever entering acceptance at the last character. State machine 0 enters acceptance at the last`E`, and so does state machine 2. Both match the same amount of characters, but a lexical analyser can only report the detection of one pattern at a particular position in the character stream. There is no general logical rule here that can be applied to determine dominance of one pattern over the other. At this point the decision must be made by the user. Usually, this happens by sequence, i.e. the pattern that was defined first, matches in case of equal length\footnote{Note, that we are talking here on the state machine engine level. The rules for pattern precedence in derived modes as mentioned in section style=ref are actually translated into privilege rules on the engine level. The design of queχ is laid out in a way for using a database that maintains pattern-precedents. However, this has no practical advantage since base mode precendents can be expressed by the order that patterns are defined. This means, that pattern that are inherited from based modes simply have to be converted into state machines before the derived mode's patterns. This way it is guaranteed that the base mode's pattern state machines have lower ids and higher precedence.}.

\definition{{\bf A Pattern Precedence} is a user- (or inheritance-)determined
    relationship between two patterns. It is a statement that tells which
        pattern proceeds if both match the same amount of characters.}
\definition{{\bf A Dominating Pattern} is the pattern in a set of pattern that
    can match in a particular state, which 'wins', i.e. that has the
    highest precedence.}

The construction of a lexical analyser includes the construction of a single state machine that is equivalent to the bank of patterns which were shown in state machine which is the result of such a construction. In particular, this involves a construction of a deterministic finite automaton (DFA) from the pattern bank, which is a non-deterministic finite automaton (NFA). The pattern bank allows that the state machine is in multiple states at the same time, a DFA is only in one particular state at a particular time. The state machine that results from this operation is still to be optimized. A so called Hopcroft optimization \cite{} reduces the number of states to a minimum. The result in the aforementioned case is that the state machine of figure style=ref.

Queχ allows the definition of patterns that only trigger in a certain environment of characters. The ability to define borders of a pattern that are not part of the pattern itself is implemented by means of so called pre- and post-conditions. The following two subsections introduce these two

Post-Conditions

A post-condition is a condition on the character stream after the real pattern has matched. When the real pattern notifies a match at position X, the analyser needs to proceed in the character stream until it determines whether the subsequent characters match the post-condition. If the post-condition is met it reports acceptance---however, the lexical analyser's focus is put back to position X.

figures/post-conditions-patterns.pdf
Figure: State machine matching a post condition pattern.

An example of a post-condition is shown in figure style=ref. The rule for this pattern is that when a number comes in that is followed by a currency, then it has to be reported as {it amount}, rather than a number. In this example there are two currencies: EUR for Euro and USD for US-Dollars. When a number is matched---with or without a floating point---then we reached the position to which we want to step back, if it was an {it amount}. These acceptance states have a dashed inner circle, to indicate that the input position has to be marked, but we're not finished yet. Only after one of the currencies EUR or USD appeared, finally, an acceptance state is reached. But, the following analysis needs to start at the position, backwards, where the number matched.

figures/post-conditions-patterns-sequence.pdf
Figure: Sequence of analysing steps detecting a post-conditioned pattern.

Figure style=ref displays the sequence of actions for an incoming character stream. When the 12 has matched as a number the input position 2 is stored but the analysis continues. When USD has matched it is now determined, that the previous number was indeed an amount. A token AMOUNT(12) can be reported. But, but one needs to go back to the position where the amount ended, i.e. to position 2. From there, it detects a currency USD and reports a token CURRENCY("USD"). The same thing happens now with the number 45.10 and the currency EUR in the remainder of the stream.

Post-conditions are implemented by setting a store-input-position mark at the states where the real pattern matches. The end of the post-condition is an acceptance state. Figure style=ref depicts the states of the amount/currency pattern webbed into a state machine that contains other patterns. The process of combining and optimizing the state machine needs to make sure that the information about saving input positions and acceptance states remains---here indicated as notes on a flag.

figures/post-conditions-patterns-merged
Figure: States of a post-conditioned pattern in the combined state machine.

Pre-Conditions

A Pre-Condition is a condition on the character stream before the pattern is actually being parsed. Starting from the current position X, the analyser needs to go backwards to check whether the pre-condition is met. After determining whether it is met or not, the real pattern is analysed starting from position X.

Intuitively, it does not make sense to try to match against a pattern if it is known from the beginning that the pre-condition is not met. For this reason, it is tempting to consider a single analyzer function for each case of pre-condition met and pre-condition not met. With this approach, though, the total number of analysers to deal with $N$ pre-conditions is $2^N$ since any combination of pre-condition met and pre-condition not met has to be considered. Just a few pre-conditions would then blow the number of analysers beyond what is practical. Queχ could not be a general solution for pre-conditions.

Rather than creating an analyser for each pre-condition combination, one single analyser containing all pre-conditioned and non-pre-conditioned patterns is created. Pre-conditions are simply implemented by a conditional acceptance at the end of the pattern, i.e.

    // acceptance state ...
    if( pre_condition_X_fulfilled_f ) last_acceptance = X;
    ...

When the analyser reaches the acceptance state of a pre-conditioned pattern it notifies a match only under the condition that the pre-condition flag was raised before. From one point of view, this is inefficient since in some cases the analyser tries to match against a pattern, even though the pre-condition disallows an acceptance. On the other hand, the states of the pre-conditioned patterns are webbed altogether into one single state machine and the overhead for doing so is minimal.

\TODO{It would be an improvement to allow the user to control when a pre-conditioned pattern combination is webbed into the combined state machine and when not.}

Generation Steps

In detail, the process of generating a lexical analyser consists of the following steps:

- Code generation for the given combined state machine section [sec-inside-queχ-code-generation-core]. This consists of generation of code for:

Thomson Construction

The lexical analyser that queχ creates is based on regular expressions. In order to create state machines that represents regular expressions, it is necessary to provide operations that combine elementary regular expressions through concatenation, parallelization, optionality and_ repetition_. This section discusses how this is accomplished by means of the so called Thomson Construction \cite{}.

An essential concept is the so called $\epsilon$-transition. By means of this special transition state machines can be connected in sequence and in parallel. It builds a very basic concept in the Thomson construction for developing a state machine that represents a regular expression.

\definition{{bf An $\epsilon$-transition} is a transition that does not require any input. If a state A is connected with a state B via an $\epsilon$-transition, then entering state A implies that the state machine enters state B.}

In a sense, an $\epsilon$-transition behaves like a free-ride where no input character is necessary to trigger a subsequent state. However, those transitions imply that the state machine become non-deterministic, since it can now be in multiple states at the same time.

A very important principle which facilitated writing algorithms on state machines was that every state-id in every state machine is unique\/. That is, a state machine A cannot have a state-id that also appears in B. Further, state machines are mere collections of states. States are mere mappings that describe what subsequent state is reached by what trigger. Operations on state machines can then be applied by setting transitions, i.e. changing the mappings of states, and finally collecting all mappings (i.e. states) in a single state machine. The uniqueness of state-ids ensures consistency.

The following sections discuss the elementary operations on state machines that are necessary to describe regular expressions.

Mounting State Machines in Sequence

Mounting two state machines together in sequence is fairly simple. If state machine B is to be mounted after state machine A, then the following is to be done:

figures/pattern-sequentialize.pdf
Figure: Mounting two state machines in a sequence.

These steps imply that the acceptance state of the last state machine remains. In Queχ, an algorithm is implemented that does the sequentialization for any number of state machines---not just two---in one step. In figure style=ref a state machine is created representing a matcher for floating point numbers out of the sequentialized patterns [0-9]+, a dot ., and [0-9]+}.

Mounting State Machines in Parallel

In order to implement the logical or operation in regular expressions it must be possible to mount in parallel. Therefore, all acceptance states of the original state machines to be mounted in parallel need to remain acceptance states. All initial states need to remain initial states. However, there can be only one initial state. At this point the $\epsilon$-transition comes handy again. The following steps describe the process of mounting state machines in parallel:

Figure [fig-mount-parallel] shows the result of combining three state machines in parallel. By means of this procedure there is only one initial state and one terminal acceptance state for the state machine. Now, regular expression such as "hello"|"bonjour" can be implemented as two parallel state machines for "hello" and "bonjour".

figures/pattern-parallelize.pdf
Figure: Mounting two state machines in parallel.

Implementing Repeated State Machines

Repetition must govern the following scenarios:

R*

repetition of a pattern`R` zero or an arbitrary number of times.

R+

repetition of a pattern`R` one or an arbitrary number of times.

R{x}

repetition of a pattern exactly x times.

`R{x,y}

repetition of a pattern at least x and at maximum y times.

R{x,}

repetition of a pattern an arbitrary number of times, but at least`x` times.

R?

repetition of a pattern`R` zero or one time.

The first case`R$$*$$` is the so called Kleene-closure \cite{}. Again, by means of the $\epsilon$-transition this can be achieved by the following steps:

Figure style=ref shows the example of a Kleene-Closure for the pattern [A-Z]+":"}, which would match a C-style label. The other forms of repetitition are closely related to that. The one or arbitrary repetition R+ can be achieved by mounting the same state machine in front of the Kleene closure, ensuring that the pattern has to match at least once. Similarly minimum repetitions are achieved by mounting the state machines N times in front of the Kleene-Closure, where N is the minimum number of times the pattern has to be matched.

figures/pattern-repeat-kleene-closure.pdf
Figure: Pattern repetion by Kleene Closure.

Where a maximum number M of repetitions is involved, one cannot rely on the Kleene Closure anymore. The state machines have to be mounted M-N times in sequence, where the acceptance states in between remain acceptance states. This way one can be sure that any number of matches from N to M corresponds to an acceptance state.

A special case is the case where zero and a maximum number of repetions are involved, i.e. R? and R{x,y} with x=0. In this case, simply the initial state is raised to an acceptance state, and then the usual algorithm for generating a maximum number of repetitions is applied.

Labeling State Origins

The previous sections discussed how to generate a state machine for one isolated regular expression using the Thomson Construction. The final lexical analyzer, though, needs to have all possible patterns present, i.e. combined into an automaton in order to judge which pattern matches from the current input position. The approach that was chosen in queχ was to produce a single deterministic automaton from the set of state machines for each pattern. This process is described in the following sections.

It was already discussed that information about acceptance section [sec-basics-pattern-matching], storing of input position ssection [sec-inside-queχ-intro-post-conditions] and dependence on pre-conditions section [sec-inside-queχ-intro-pre-conditions] need to be stored as hidden state variables\cite{} inside the states of the isolated pattern state machines. This information needs to be maintained when the states dissolve into the super state machine that contains all patterns. This happens by labeling them with origin information.

As for acceptance, no special information carrier is required since the Thomson Construction and the Hopcroft Optimization treat them appropriately. A state in the super state machine has acceptance if one of its original states had acceptance and it does not have acceptance if none of its original states had acceptance. It remains to store information about which pattern has raised the acceptance in a particular acceptance state, i.e. some type of pattern-id.

Without post-conditions, an acceptance state stands for storing the input position. If the pattern that raised acceptance at this point wins, then the analyzer's focus must be set back to this position and the next analysis starts from this point. With post-conditions acceptance and storing the input position are no longer the same. When the core pattern has ended, the input position must be saved. But, at this point the state machine has not yet reached acceptance. It must first reach the end of the post condition. Then the input position can be put back to the end of the core pattern. Thus a flag is required telling wether or not it is required to store the input position.

The input positions for different post-conditions must be stored in different variables, since only at the end of the match can it be decided which pattern actually succeeded. A flag indicating that the original pattern was post-conditioned tells the code generator to use the pattern-id as the indicator for the variable storing the input positions.

The existence of pre-conditions imply the check at an acceptance state if a particular pre-condition has been met. Thus, a variable is required containing the pre-condition which has to be met in order to trigger acceptance. A special pre-condition is the begin-of-line pre-condition. It is a trivial pre-condition, since it does not involve an inverse state machine. Thus it is treated as a separate flag.

Inside Queχ, there is a data structure dedicated to store this information, called class`StateOriginInfo`. It contains the following fields:

\item [\tt state_machine_id] carries the id of the original state machine. This identifies the pattern to which the state belongs.

\item [\tt state_index] contains the index of the original state.

\item [\tt store_input_position_f()] gives information about the input position to be stored or not. In the table below referenced as {bf S}.

\item [\tt post_conditioned_acceptance_f()] indicates whether the origin of the state relates to a post condition. In the table below referenced as {bf post}.

\item [\tt pre_condition_id] indicates the id of the pre-condition that has to be fulfilled so that the acceptance of the state is valid. In the table below referenced as {bf pre}.

\item [\tt trivial_precondition_begin_of_line_f] indicates whether the pattern is supposed to start at the beginning of a line, i.e. after newline or the start of the buffer. In the table below referenced as {bf bol}.

Note, that there is nothing like a flag to indicate trivial post-condition end of line. This is for the simple reason that this trivial post condition is internally translated into a post-condition end of line or end of file. There is no mechanism that could gain some speed here. For the begin of line pre-condition though, one can rely on the last character that was considered.

A state in the super state machine contains a list of origins, each origin corresponds to an original state of the isolated patterns. The following table shows the possible combinations of these parameters together with the acceptance flag {bf AccF} of the state that may be labeled with these origins:

& {bf AccF} & {bf S} & {bf post} & {bf pre} & {bf bof} \\ \hline \hline %% normal acceptance state & must & yes & --- & --- & --- \\ \hline normal non-acceptance state & ? & --- & --- & --- & --- \\ \hline end of post-conditioned core & ? & yes & yes & --- & --- \pattern & & & & & \\ \hline end of post-condition & must & --- & yes & --- & --- \\ \hline acceptance of pre-conditioned & must & yes & --- & yes & no! \pattern & & & & & \\ \hline acceptance of begin-of-line & must & yes & --- & no! & --- \pre-conditioned pattern & & & & & \\ \hline

Note, that must for the acceptance flag means that the state in the final state machine that contains this origin has to be an acceptance state. The ? means that in the original state machine the state was not an acceptance state, thus the state that contains it together with others may or may not be an acceptance state. Note also, that pre-conditioned patterns and trivially pre-conditioned patterns are mutually exclusive.

In order to help get easy access to the origin's state type the member function type' returns direct information about the type of the original state, i.e. if it was a normal acceptance, a non-acceptance, a post-conditioned state (end of core pattern), an end of a post condition or a pre-conditioned acceptance state. In case the constitution does not lead to any valid interpretation, the function replies with an error. This would indicate that some functions do not work propperly. This allows for a double-check that all origins have passed the state machine construction properly.

Subset Construction—NFA to DFA

The conversion of a non-deterministic finite automaton (NFA) to a deterministic finite automaton (DFA) brings the advantage that at a particular point in time the state machine is in one particular state---not in multiple states as with the NFA. This not only reduces complexity, but also decreases memory requirements (to store state indices and pattern match-management) and computation time of the resulting lexical analyser. The key to doing so is to combine states that the NFA takes simultaneously into a single state. Whenever a state in the NFA triggers on the same character to more than one state, the states that are reached by this trigger are combined into one single state. The process relies on two sub-processes described in this section: 1) construction of a so called epsilon closure, and 2) determining unique state combinations for the range of all triggers.

The core idea of this process is to find all possible super states of the NFA, i.e. the state combinations that can exist at a particular point in time. Then trigger sets have to be found that trigger the transition from one state combination to the other. The state combinations can then be renamed as a single state and it is sure that the resulting state machine is deterministic. By requesting that the acceptance state of a combined state is acceptance as soon as one of its member states is acceptance the entire mechanism of the NFA is preserved.

figures/epsilon-closure.pdf
Figure: The $\epsilon$-closure.

The notion of $\epsilon$-triggers enabled a very efficient method to sequentialize, parallelize and repeat state machines. As soon as a state machine contains an $\epsilon$-trigger it becomes a NFA. If a state A transits on an $\epsilon$ trigger to a state B, then the state machine is in state A {it and} B simultaneously as soon as A is reached, because an $\epsilon$-trigger does not require any input. This is one reason why the result of those operations is an NFA. In order to help with the $\epsilon$-transitions the $\epsilon$-closure is defined:

\definition{An {bf Epsilon Closure} of a state $n$ is the aggregation of all states that can be reached from $n$ via an epsilon transition.}

Consequently, when treating a state, one first collects all states that can immediately be reached via $\epsilon$-transition into a new state. This is shown in figure style=ref. This new state $n0$ contains all transitions of all states of the $\epsilon$-closure. It is supposed to be equivalent to the super state of the NFA which is in all of the states of the epsilon closure simultaneously. Note, however, even though state 23 in the figure is part of the $\epsilon$-closure, as a target state it is kept isolated referred to its maiden name rather then as part of $n0$ %% \footnote{From a super state point of view, state 23 which is reached through the recursion of itself, is not the same as the state 23 being entered from state 0. Also, $n0$ as being a super state does not belong to the same world as the original states.}.

figures/NFA-to-DFA-transitions.pdf
Figure: Transscription of a state set into a single state of an DFA.

The next step consists of finding state combinations and the trigger sets by which they are reached from the new state $n0$. Figure style=ref shows an example of a state $n0$ where some characters trigger to multiple states at the same time. This means that the state machine would be in more than one state at the same time, and can, therefore, not be a DFA. On the other hand, one can definitely say that {it one particular character} triggers {it one state combination}. Combining the states of each character trigger to a state combination, again allows the one-to-one (injective) mapping from a trigger to a state---and the conditions for a DFA are met. The search of a distinct trigger map consist of the following:

\definition{A {bf distinct trigger map} from a state $n$ is based on target state combinations that are reached by exactly the same trigger. A set of target states can be combined, if there is at least one common trigger (character code point) by which they are reached from state $n$. When all target state combinations are found, they can be renamed as single states, and the trigger map describes a one-to-one mapping from incoming characters to target states.}

figures/NFA-to-DFA-trigger-history.pdf
Figure: Combinations of follow-up states displayed against the code points of a characters set—similar to a time history.

Traditionally \cite{}, \cite{} the detection of suitable state combinations considers each single character of the trigger map isolated. This works fine as long as the code set does not exceed the oldtimer 8-bit ASCII character set. Each state in the NFA then requires 255 checks. For a Unicode-based lexical analyser this behaves significantly different. Unicode 5.0 currently has $10FFFF, i.e. decimal 1.114.111, code points — not to speak about code points for private usage. Consequently about 140.000 times checks would have to be performed for each state in the state machine. In order to avoid such a waste in calculation time, the author of queχ developed a different algorithm based on character ranges as shown in figure style=ref. The algorithm identifies state combinations which are related to certain overlapping ranges of characters.

As mentioned in [sec-character-range-representation], the queχ engine relies on a description of triggers in terms of number ranges. Instead of checking each character, the algorithm considers the numeric ranges R_{A,B} in which a state A triggers to state B. If a range R_{A,B} and R_{A,B} intersect, then B and C are states which are entered simultaneously in the NFA. Thus, they have to be combined in the DFA. For each set of target state combinations the intersection of all trigger ranges gives the range on what the combination is triggered. It can now be renamed, i.e. a new state can be created that is triggered based on any character that falls into the intersection.

The computation of intersections between all possible target states can also be simplified. Again, the line-up of the ranges on the time line comes handy (figure style=ref). This time line is implemented by begin-trigger-target-state and and-trigger-target-state commands that appear at certain code points. Finding intersecting ranges then consists only in walking down the timeline, adding and removing target states on any such command, and storing the intersecting domains together with the target state combinations. With this approach it is possible to compute most suitable lexical analyzers in the frame of few seconds on present day computing machines.

Now, that multiple states are combined into one state, the question remains about the acceptance type of the combined state? The answer becomes simple, though, considering that we try to match patterns and that a pattern match relates to an acceptance state:

The transformation from NFA to DFA is a process that is not only applied during the construction of state machines for isolated patterns, but also on the state machine that combines all patterns to be matched. In this case, the information about the origins all have to be kept. This includes the information about the original state. If a non-acceptance state from pattern X is combined with an acceptance state from pattern Y, we want to be sure, that the lexical analyser does not notify that X matched. A correct way to determine the winning pattern is to consider the origins that were acceptance states, determine the most privileged winner.

The following very important assumption has to hold for the process of converting a set of state machines, i.e. a NFA, into a single state machine, i.e. a DFA:

\definition{ Let $P_{a, b}$ be the set of paths from a state $A$ to a state $B$ in a state machine (constituting a parallel branch of the NFA). resulting DFA. For the DFA to uniquely represent the NFA, one must the original one, i.e. %% }

A direct consequence of this is that if a state X in a pattern can only be reached over a state Y, then in the resulting state machine this holds equally. The following algorithm demonstrates in pseudo-code the process of constructing a DFA that corresponds to an NFA. It can be applied for single patterns, as well as for the combined state machine that represents the lexical analyser.

state-set-worklist = { [ init-state ] }

while worklist not empty:
    new worklist = empty
    for each state-set in worklist:
        ec = epsilon-closure of state-set
        new-worklist = distinct-target-state-combinations of ec
        set triggers for ec

    worklist = new-worklist

Hopcroft Optimization

In some cases, such as in the example shown in figure style=refa, the structure of a state machine can obviously be optimized. Note, that the sole task of the state machine is to determine if the sequence of characters appeared that matches a pattern. In this sense, there is no difference between all acceptance states. As a direct consequence, if there is a set of acceptance states, such as state 1, 2, and 3 in figure style=refa, where all characters trigger to another acceptance state, then those acceptance states can obviously be combined into a single one. This section concretizes this idea and develops the process known as the Hopcroft optimization.

{Combining equivalent states of a state machine.} image:figures/hopcroft-before.pdf] figures/hopcroft-after.pdf

An essential concept of the Hopcroft Optimization is the concept of_ equivalent states_. State machines are equivalent if all possible following sequences of triggers lead to the same result. Broken down to states, one requires for any two states A and B to be equivalent that all character streams that cause failure starting A will cause failure starting from B. Further, all character streams that cause acceptance starting from A cause {it equivalent acceptance} starting from B. Concretely, this means that for all possible triggers $t$, one requires that the follow-up state from A on $t$ is in the same state set than from B on $t$.

In the case of a single pattern, the set of states can be split up into two state sets that build the original worklist: the set of states that are acceptance states and those that are not. These two state sets are the initial worklist. For each state set X in the worklist one has to check wether it behaves abnormal, wether it triggers on a character $\alpha$ to a different set of states then the rest. Now, this state set is split into normal states $X_0$ and abornal states $X_1$. The state set X from the worklist is then replaced by the two state sets $X_0$ and $X_1$. This process can then be repeated until the worklist does not change anymore, i.e. no state sets have to be split up, because the represent equivalent states.

Combining equivalent states of a state machine with states of different origin.

figures/hopcroft-2-before.pdf figures/hopcroft-2-after.pdf

As long as one deals with a single isolated pattern things end up here. For isolated patterns, it only matters if a match occured or not. For the combined state machine, that represents the whole bank of patterns more information is to be considered. Such a case is depicted in figure style=ref. Two acceptance state that report different patterns as winner are not equivalent. Thus, the state set of initial equivalent acceptance states needs to be redefined. Note, that the remaining procedure remains the same — one only has to discuss the change of the initial set of equivalent acceptance states for the case that multiple origins are involved. Here, the term {it equivalent acceptance} has to be revisited.

\definition{As long as there are no origins involved, then two acceptance states are equivalent, if and only if — well — they are acceptance states.}

For the combined state machine, though, it is important to know what pattern, i.e. original state machine, actually reached acceptance, i.e. matched. It is possible that multiple patterns match at the same time. Thus, one has to concentrate on the dominating original pattern (see section

\definition{If a state machine contains origins, then two acceptance states are equivalent, if and only if their dominating original patterns are the same.}

Post conditions do not change anything at this setup, because the acceptance state of the post-condition is only reached through the state that represents acceptance of the core pattern. Pre-conditions, though, add some restrictions. At this point, it makes sense to imagine the code produced to deal with pre- conditions:

     // entry of a state with some pre-conditions:
     if      PreCondition_Pattern5 == True: acceptance = Pattern5
     else if PreCondition_Pattern2 == True: acceptance = Pattern2
     else if PreCondition_Pattern6 == True: acceptance = Pattern6
     else:                                  acceptance = 7
     // lesser priveledged patterns cannot be dealt with:
     //     if PreCondition_Pattern36 == True: acceptance = Pattern36   // does this
     //     if PreCondition_Pattern0 == True:  acceptance = Pattern0    // make sense ?
     ...

Let us assume that the pattern priviledges are ordered the same way as they appear in the code fragment, i.e. 5 has the highest priority, if not than comes 2, if not then comes 6, and if the precondition of 6 does not hold then comes the unconditional acceptance of pattern 7. Lesser priviledged patterns (conditioned or unconditioned) do not matter, since the pattern 7 takes it all. Lets call the set of origins from the most priviledged one to the first unconditional state (if it exists) the {it list of possible acceptance origins}. Now, the following can be set about equivalent acceptance states:

\definition{If a state machine contains preconditioned origins, then two acceptance states are equivalent, if and only if their list of possible origins are the same.}

The only modification one has to do to the Hopcroft algorithm is to adapt the construction of the initial set of acceptance states. It has to be split according to the above criterion. The splitting procedure considering the triggering to equivalent states remains the same. Here is the algorithm in pseudo-code:

TODO

Traditionally, the decision wether a set is to be split is made traditionally \cite{} based on single characters. This method works fine for ASCII character streams but becomes uneffective with Unicode (see also page \pageref{}). Again, the author developed a method that determines equivalence based on character ranges that cuts down the computation time to a tiny fraction.

Code Generation

The previous sections discused the construction of an annotated state machine representing the lexical analyser. The final step of code generation remains. Code generation means to write a program that behaves like the state machine. This means, that if a character α appears at the input one has to transit to the same state as the state machine would. Further, if the state machine fails, the program needs to fail. And, if it succeeds, i.e. a pattern matches, then the program need to notify a match. Additionally, some framework needs to be created for the lexical analyser to iterate over an incoming character stream.

One particular problem has not been discussed before. Real lexical analyser search for the longest match, i.e. they try to eat as many characters as possible to achieve a match. It is now conceivable that a pattern A has matched but pattern B is still in the race and has still hope to be matched — it might only need some more characters. Thus the state machine leaves the acceptance state, but it needs to store information about the match of A, just in case that pattern B is finally not matched. Consider the two patterns return and [a-z_]+":". When a character stream "return" came in, the first pattern matched, but the second is still an option. So, if a string "return_label:" appears one can report that the second pattern matched. But, if the trailing colon is missing, we need to be able to go back and report that there was a return statement. For this reason, we need two variables:

Since one incoming character is potentially compared multiple times until it is determined to what follow-up state it triggers, one needs a variable to store the character that just arrived:input.

For the case, that post-conditions appear, variables have to be added that store the place where a particular post-condition starts, i.e. the place one has to jump back if the post-condition is successful:

`last\_acceptance\_N\_input\_position`

where`N` is the numeric identifier of the pre-condition. If pre-conditions are involved one needs to store information wether the pre-conditions are fulfilled or not. Thus variables need to be provided to store the pre-condition results:

`pre\_condition\_M\_fulfilled\_f`

where`M` is the index of the pre-condition. The lexical analyser core, generated by \queχ produces a single function, let it be called`analyse_this()` that is called in order to initiate the lexical analysis:

QUEX_ANALYSER_RETURN_TYPE
analyse_this(QUEX_ANALYSER_FUNC_ARGS) {
    // (1) variable definitions ___________________________________________________________________________
    //
    //  -- basic required variables
    int                        last_acceptance = -1;
    QUEX_STREAM_POSITION_TYPE  last_acceptance_input_position = (QUEX_STREAM_POSITION_TYPE)(0x00);
    QUEX_CHARACTER_TYPE        input = (QUEX_CHARACTER_TYPE)(0x00);\n

    //  -- variables to deal with post-conditioned patterns (optional)
    //    QUEX_STREAM_POSITION_TYPE  last_acceptance_i_input_position = (QUEX_STREAM_POSITION_TYPE)(0x00);
    //    QUEX_STREAM_POSITION_TYPE  last_acceptance_k_input_position = (QUEX_STREAM_POSITION_TYPE)(0x00);
    //    QUEX_STREAM_POSITION_TYPE  last_acceptance_l_input_position = (QUEX_STREAM_POSITION_TYPE)(0x00);
    // ...

    //  -- variables to deal with pre-conditions (optional)
    //    int   pre_condition_i_fulfilled_f = 0;
    //    int   pre_condition_i_fulfilled_f = 0;
    //    int   pre_condition_i_fulfilled_f = 0;
    // ...


    // (2) state transition code  _________________________________________________________________________
    // ...


    // (3) pattern action code / terminal states __________________________________________________________
    //     (possible function return from here)
    // ...
}

The way the function is defined leaves opportunities to adapt it to different types of environments, depending on the definitions of th`e QUEX_…-macros. The following to sections discuss the construction of state transition code and code for the terminal states. Then it has to be discussed how inverse state machines for pre-conditions are translated into code and fit into the picture. A final section explains how to modify the analyser function by the definition of the`QUEX_…-macros.

State Transition Header

Each character that is read from a character stream triggers a state transition inside the lexical analyser. This continues until the currently incoming character does not trigger any further transition. Following cases need to be distingushed:

\item[\bf Mismatch] The lexical analyser never reached an acceptance state. No explicitly defined pattern has matched. If a default action is defined than it is executed at this point.

\item[\bf Match] The lexical analyser reached an acceptance state, some time earlier---or is currently in it. The the action of the dominating pattern has to be executed.

If the current state is an acceptance state, then the id of the winning pattern and the position where it was reached has to be stored. If the current state represents the beginning of a post-condition, then one needs to store the current input position. If later on the post-condition finishes, then one can go back to it.

What pattern actually wins in an acceptance state depends on two issues. First, something that can be compiled into the code: the patterns priviledge. Seconds, something that has to be determined at run-time: dependence on a pre-condition flag. Note, that information about a post-conditioned pattern needs not to be stored, since the path to the acceptance state can only be reached through the state where the core pattern would have raised acceptance. The header of the state transition code of a particular acceptance state looks like the following:

  QUEX_LABEL__ENTRY_98:
     QUEX_STREAM_TELL(last_acceptance_position);

     // (*) patterns that start now with the post-condition
     QUEX_STREAM_TELL(last_acceptance_a_input_position);
     QUEX_STREAM_TELL(last_acceptance_b_input_position);
     QUEX_STREAM_TELL(last_acceptance_c_input_position);
     ...

     // (*) patterns that dominate, but under a pre-condition
     if( pre_condition_i_fulfilled_f ) {
         last_acceptance = i;
     }
     else if( pre_condition_j_fulfilled_f ) {
         last_acceptance = k;
     }
     else if( pre_condition_k_fulfilled_f ) {
         last_acceptance = l;
     }
     ...
     else {
         last_acceptance = X;  // the dominating pattern
     }

If the state is not an acceptance state, then this header is simply empty. However, both types of states, acceptance and non-acceptance states share the following code fragment:

     // (*) get a character from the input stream as a trigger
     QUEX_STREAM_GET(input);

     // ... it follows: the state transitions ...

It simply reads the next character from the input stream. It is canned into a macro, so the user may adapt its underlying procedure to his way of representing the character stream (e.g. as a buffer, an object of`istream`, etc.). The code fragment for the state transition has to determine what trigger triggered, and therefore what the subsequent state will be.

State Transition Triggering

The dynamic behavior of a state machine is uniquely determined by the transitions of its states. For a DFA, it can be assumed that it is in one distinct state when an incoming character arrives. The incoming character determines, according to the triggers of the state, what the subsequent state will be. The next to sections discuss the code generation for a state of the lexical analyser state machine. The operations to be coded for a state can be subdivided into writing a state header (section \ref{}) which does some bookkeeping and writing code for the trigger transitions (section \ref{}).

As mentioned earlier, \queχ's engine is based on character ranges. This enables a fast algorithm to bracket the incoming character. Similar to binary search\cite{}, it determines if the incoming character lies above a border of the middle interval, then higher and lower intervals are checked. Of course, one first checks if the character fits any interval. Imagine an acceptance state for pattern 21 that has the following trigger map:

mismatch & $\longrightarrow$ & terminal 21 \ $[$A-H$]$, $[$J-Z$]$ & $\longrightarrow$ & state 94 \ : & $\longrightarrow$ & state 95 \ I & $\longrightarrow$ & state 99

The current implementation produces the following code to implement these transition rules:

    if( input >= 58 && input < 91 ) {
        if( input < 65 ) {
            if( input < 59 ) {
                goto QUEX_LABEL__ENTRY_95;
            } else {
                goto QUEX_LABEL__TERMINAL_21;
            }
        } else {
            if( input < 73 ) {
                goto QUEX_LABEL__ENTRY_94;
            } else {
                if( input < 74 ) {
                    goto QUEX_LABEL__ENTRY_99;
                } else {
                    goto QUEX_LABEL__ENTRY_94;
                }
            }
        }
    }
    // no trigger triggered
    goto QUEX_LABEL__TERMINAL_21;

The first if-statement checks wether the`input` lies in the borders of the trigger map. If not, one can directly jump to the terminal state 21. Then one checks wether the input is less than 65, i.e. an A, and then brackets the input down for each interval. This way of generating code seems to leave room for improvement, though. To check wether the input is : (character code 58) we test against $>=\,58$, $<\,91$, $<\,65$, and $<\,59$, instead of a simple check $==\,58$. A hand written transition code may look like the following:

    if( input == 58 ) goto QUEX_LABEL_ENTRY_95;
    if( input == 73 ) goto QUEX_LABEL_ENTRZ_99;
    // HERE: input != 73, therefore we can check against A-Z
    if( input >= 65 && input < 91 ) goto QUEX_LABEL_ENTRY_94;
    // no trigger triggered
    goto QUEX_LABEL__TERMINAL_21;

However, if one does too many single checks one would loose the advantage of binary bracketting and the result would be much less efficient. The following table shows an estimated cost in terms of comparisons with respect to the two approaches:

mismatch (before) & 1 & 3 \ mismatch (after) & 2 & 4 \ mismatch (inside) & 4 & 4 \ $[$A-H$]$ & 5 & 4 \ $[$J-Z$]$ & 4 & 4 \ : & 4 & 1 \ I & 5 & 2 \\ \hline \hline

A meaningful comparison of the efficiency of the two approaches, one would have to multiply the results with the probabilities that these characters sets occur. In the above example, the code size of the handwritten transition is much less than the solution created automatically. Determining the optimal algorithm that takes a parameters memory-size-speed-tradeoff and computes the optimal state transition would be an interesting contribution to the \queχ project. The author of this text welcomes any propositions to improve the generation of transition code.

Appendix: The Buffer

\framebox{vbox{NOTE: The information here is not complete, since the buffer handling was re-designed. Add to this chapter inconv support and input strategies.}} \vskip 0.2cm

Instead of relying on direct read and write operations from storage devices and transmission lines, it is generally advantegous to profit from the fact that data is usually transported faster when it is transported in some larger blocks. When doing lexical analysis one treats the data stream character by character. To send a request for each character and wait for the reply each time a new character is read is obviously not a very promissing approach. Independent on what device or line there is in the background the solution that comes handy is: a buffer in the system memory. A buffer loads a larger blocks of data and keeps them ready-to-read until its content is read completely. Then it loads the next larger block of data.

figures/basic-buffer.pdf
Figure: Structure of quex's buffer.

The structure of queχs buffer is shown in figure style=ref. The buffer itself is not more than a continous chunk of memory. It has two places to store border characters. One is located at the front to store begin of buffer or begin of file. The other is located at the end to store end of buffer or end of file. However, the end of file character may come anywhere from the start of the buffer to the end. Details about these limit characters are explained below. Thus the actual buffer size is two elements greater than the actual number of content elements. A current pointer iterates from left to right\footnote{In this figure it is left-to-right. This basically means from low memory addresses to higher memory addresses. It had absolutely nothing to to with the directionality of the character encoding.} to go forward. The element to which it points designates the current input to the lexical analyser state machine. If it reaches a border new content has to be loaded. The fallback area allows to prevent immediate loads in the opposite directions if one requests to go back immediately after loading.

The following sections explain the mechanisms of queχs buffer, some basic API functions for lexical analysis, the loading procedures, and the buffer creation procedure.

Programming Interface

The extreme efficiency mentioned in the last section is payed off, though, by some deveilement of the buffer internals to its API. This is not beautiful, from a design prespective, but---well, very efficient. The following operations need to be implemented for the buffer:

-get_forward(): increments the pointer to the current character and returns the content. If a limit is reached, then a limiting character code is returned. Again, this has not to be checked before the transition check, because if it is a limit character it drops out anyway. -get_backward(): decrements the pointer to the current character and returns the content. If a limit is reached, then a limiting character code is returned the same way as above.

-tell_adr(): returns the position of the current pointer _in memory_. -seek_adr(position): sets the current pointer to the position given as first argument. The first argument, therefore, designates _a memory address_---not a stream position. -seek_offset(const int): adds the given offset to the current pointer.

This interface requires some care to be taken\footnote{Note, that this API is only used by code that is autogenerated. The human end-user is not confronted with such a fragile interface.}. It is not possible to call get_forward() blindly and get the whole stream of data. If a limiting character is returned, it is mandatory to call {tt load_forward()} and {tt get_forward()} again. This spares the comparison against end of buffer and end of file at each get\footnote{Instead, a get is not more than a pointer increment and a dereferencation}. The function {tt get_backward()} works in exactly the same manner.

The positioning functions {tt tell_adr()} and {tt seek_adr()} are implemented with the same spirit of stingyness. The return and receive memory addresses. Thus, any address taken with tell needs to be adapted at after any call to {tt load_forward} or {tt load_backward()}footnote{Again, such a requirement on the usage is nearly insane to be demanded from a human user. Queχs code generator, though, has no problem with that.}. The {tt seek_offset()} function currently only adds a value to the current pointer\footnote{When the {tt NDEBUG} flag is not set, it is checked wether the result is inside the boundaries of the buffer}. This function, actually, only used for unit tests. During analysis the absolut addressing functions are applied.

Mechanisms

The buffer implementation in queχ takes advantage of the mechanism of lexical analysis, and manages to keep the overhead of buffering extremely low. This works as shown in figure style=ref. First of all, some content of the information stream in stored in a fixed chunk of memory, the buffer. During the lexical analysis one basically iterates through this chunk. Whenever a new input character is required to determine a state transition, the pointer to the current character, short the current pointer, is increased and the input is assigned the content to what this pointer points. The buffer limits, begin of file, and end of file is determined through special characters. Thus, if the input that is received equals those, then one touched either a buffer limit, the begin of the file, or the end of the file. The trick here is that the transition checks of any state will drop if those characters occur. In any other case the business can continue as usual.

figures/algorzthm.pdf
Figure: Interaction of buffer and the lexical analyser state machine.

The mechanism described above is highly efficient, since there is a zero overhead as long as no limit is reached. Only when the buffer limit is reached some overhead occurs for identifying that the drop out occured either due to buffer limit or end of file (begin of file, if one iterates backwards during pre-conditions). Already with a buffer size of 64KB, the buffer limit is only reached every 65536 characters. With an average of 4 operations per character the time overhead should not exceed 0.02\% for buffer loading. The speed is basically equivalent to iterating directly through system memory.

Loading from File

Figure style=ref shows the process of loading the buffer with fresh content in forward direction. Loading data from a device into a buffer is very costy. So, it is a good idea to introduce some fallback area where the end of the current buffer content is stored at the beginning. This prevents an immediate load backwards as soon as a character is required from before the current position. Also, since the current lexeme may be post-treated by some pattern action, the pointer to the lexeme start also has to lie inside the buffer. The maximum of both values defines the fallback border.

figures/load-forward.pdf
Figure: Loading new contents forwards from the stream into the buffer.

The new content from the file is then stored behind this border. If an end of file occurs, internally an end of file pointer is set to the position in the buffer where this occurs (most probably before the end of the buffer, but not necessarily). It is stored in a pointer, because it needs to be moved during backward loading. Running through the whole buffer searching for EOF would imply a tremendous slow-down. Also, the end of file character is stored at the position where end of file occured. This way, the`get_forward()` function will return it and the state transition can react on it.

The setting and unsetting of the buffer limit characters and end of file characters at the borders is handled by the load function. Basically, when a border of the buffer does not represent a begin or end of file, it is set to the buffer limit code character. Else, it contains either the begin of file or the end of file character. Note, that the begin of file character, if it occurs, occurs only at the begin of the buffer. The end of file character may occur at any position.

figures/load-backward.pdf
Figure: Loading new contents backwards from the stream into the buffer.

Figure style=ref shows the process of backward loading. Unlike loading forward not such a large amount of memory is newly loaded. Note, that backward loading is only necessary if the fallback buffer was not enough. Thus, it implies some abuse of design that is caught and treated seeminglessly. However, a basic assumption about lexical analysis is that the main direction is`forward\/`. Instead of loading a large bunch of memory before the current position, loading backwards only loads about the first third of the buffer with backwards content from the file. The two thirds of buffer remain intact. Thus, when lexical analysis finally goes forward again, there is still much room before a forward loading is necessary.

Creation of A Buffer Object

Queχs buffer is implemented as a template. By this means a maximum of flexibility is achieved. The following parameters can be passed to the template:

Those are the parameters to the class`basic_buffer`. The header, though, defines a default setting for the class`buffer`. If these default do not interfere with some abnormal goals (such as using EOF inside, not at the border, of a regular expression), then these values can be adapted. The easiest way to to so is to adapt the typedef' statement in the header file:

    typedef basic_buffer<0x1, 0x2, 0x0, MyOverflowPolicy> buffer;

Then, again`queχ::buffer` can be used to name the buffer class at any given place. Constructing a buffer is possible with two basic approaches. The first approach is suitable for systems where dynamic memory allocation is not an issue. It leaves the buffer allocation to the constructor of`buffer`. The second approach can be used for systems where dynamic memory allocation is an issue, such as for embedded systems. In these systems memory is a costy resource and there is no room for slow memory management routines. In this case a pointer to a location in memory can be passed together with the number of bytes that can be taken from it.

Input Methods

Editors

Martin Miller <martin.miller7@gmail.com>

Malathi Voodi <test@tst.com>