The queχ program generates a lexical analyser that scans text and identifies patterns. The result of this lexical analysis is a list of tokens. A token is a piece of atomic information directly relating to a pattern, or an event. It consists of a type-identifier, i.e. the token type, and content wich is extracted from the text fragment that matched the pattern.
Figure (here) shows the principle of lexical analysis. The lexical analyser receives a stream of characters "if( x> 3.1 ) { …" and produces a list of tokens that tells what the stream signifies. A first token tells that there was an if statement, the second token tells that there was an opening bracket, the third one tells that there was an identifier with the content x, and so on.
In compilers for serious programming languages the token stream is received by a parser that interpretes the given input according to a specific grammar. However, for simple scripting languages this token stream might be treated immediately. Using a lexical analyser generator for handcrafted ad-hoc scripting languages has the advantages that it can be developped faster and it is much easier and safer to provide some flexibility and power. This is to be demonstrated in the following chapters.
The following features distinguish queχ from the traditional lexical analysers such as lex or flex:
Ease. A simple as well as a complicated lexical analyzer can be specified in a very elegant and transparent manner. Do not get confused over the set of features and philosophies. If you do not use them, then simple skip the concerning regions in the text. Start from the ready-to-rumble examples in the ./demo subdirectory.
A generator for a directly coded lexical analyzer featuring pre- and post-condtions. The generated lexical analyzer is up to 2.5 times faster than an analyzer created by flex/lex.
Sophisticated lexical modes in which only exclusively specified patterns are active. In contrast to normal lex modes they provide the following functionality:
Inheritance relationships between lexical analyser modes. This allows the systematic inclusion of patterns from other modes, as well as convinient transition control.
transition control, i.e. restriction can be made to what mode a certain mode can exit or from which mode it can be entered. This helps to avoid that the lexical analyser drops by accident into an unwanted lexical analyses mode.
Mode transition events, i.e. event handlers can be defined for the events of exiting or entering from or to a particular mode.
Indentation events, i.e it is possible to provide an event handler for the event of the first appearing non-whitespace in a line. This event handling happens quasi-paralell to the pattern matching.
A default general purpose token class. Additionally, Queχ provides an interface to run the lexical analyser with a user-defined token class.
A token queue so that tokens can be communicated without returning from the analyser function. The token queue is a key for the production of implicit tokens, i.e. tokens that do not relate directly to characters in an analysed character stream. Those tokens are derived from context. This again, is a key for defining redundancy reduced languages.
Automatic line and column numbering. These features can be turned off, in case that one fears performance loss. It can be determined excactly from which line and which column a recognized pattern started and where it ended.
The present text first explains briefly the basic concepts of lexical analysis in queχ. Here, a short review is given on lexical analysis, but then it concentrates on the introduction of the features mentioned above. The subsequent chapter dicusses a simple example of a complete application for lexical analysis. Finally, the last chapter elaborates on the formal usage of all features of queχ.
Before, one can start with the installation of queχ, one has to make sure that Python (http://www.python.org) is installed. Most linux distributions provide handy rpm packages—so this should not be an issue. Then, only the following steps are to be followed:
Extract the file queχ-x.x.x.tar.gz into a directory that fits your little heart's desires.
Setting the environment variable QUEX_PATH in your system environment to the place where you installed queχ. If you are using a Unix system and the bash-shell add the following line to your .bashrc-file: export QUEX_PATH=the/directory/where/queχ/was/installed/ if you installed queχ in the directory given on the right hand side of the assignment.
Make a link from the file $QUEX_PATH/queχ-exe.py to $EXECUTABLE_PATH/queχ where $EXECUTABLE_PATH is a path where executables can be found by your system. If you work on a unix system, you might want to type ln -s $QUEX_PATH/queχ-exe.py /usr/local/bin/queχ You might want to ensure executable rights with chmod a+rx $QUEX_PATH/queχ-exe.py, chmod a+rx /usr/local/bin/queχ.
Supplying your C++ compiler with the include path $QUEX_PATH. For the vary vast majority of compilers this means that one has to add -I$(QUEX_PATH) to the list of compiler flags. An example of how this is done can be observed in the test applications which come with the distribution of queχ.
That is all. Now, you should either copy the directories ./demo/* to a place where you want to work on it, or simply change directory to there. These directories contain sample applications 000, 001, $\ldots$. Change to the directory of the sample applications and type make. If everything is setup propperly you will get your first queχ-made lexical analyser executable in the frame of some seconds.
The example applications depict easy ways to specify traditional lexical analysers, they show some special features of queχ such as mode transitions, and more. Each demo-application deals with a particular feature of queχ:
Each demo-application deals with a particular feature of queχ:
demo\000 shows how to setup a lexical analyzer very quickly.
demo\001 demonstrates basics on modes and mode transitions.
demo\002 contains a sample application for an indentation based language.
demo\003 implements a lexical analyzer handling Unicode.
demo\004 shows a lexical analyzer for C++. This directory shall contain a benchmarking system in order to compare queχ generated engines with others.
demo\005 explains how to do lexical analysis with files that include other files.
The author of this document suggests that the user might have a look at these sample applications first before continuing with the remainder of this text. With the background of easy-to-use examples that may serve as starting point for own efforts, it should be natural to get a feeling for the ease of queχ.
The Software is distributed under LPGL, provided the following restriction:
|
Note
|
There is no license for products targeting military purpose. |
Note, that if any part of the LPGL contradicts with the former restrictions, the former restrictions have precedence. The no warranty clause holds in any case. Here is a brief summary of the LPGL:
Queχ is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this software; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
The name queχ has obviously a lexo-graphical relation to lex and flex—the most known tools for generating lexical analysers at the time of this writing. The Qu at the beginning of the word stands for quick, so hopefully the new features of queχ and the ability to created directly coded engines will generate much faster lexical analysers. Also, the process of programming shall be much quicker by means of handy shortcuts that queχ provides and the elegance that problems can be solved.
The last letter χ is a the greek lowercase letter chi. It
is intended to remind the author of this text that he actually wanted to create
a better \TeX
[As Donald Knuth explains it\cite{}, the last letter in
\TeX is an uppercase chi so the name is to be pronounced like the German word
ach (it's actually more a sound Germans utter, rather than a word with a
dedicated meaning). Analogously, the χ at the end of queχ should
not be pronounced like the x of the German word nix.]
system with optimized
source code appearance. When he realized that traditional lexical analysers did
not provide the functionality to express solutions easily he started working on
queχ—and dived into the subject much deeper than he ever thought he would.
Lexical analysis is all about patterns. The following section recalls some basics on the mechanisms of pattern matching. For a detailed description see \cite{}, \cite{}, and \cite{}. It also explains how tokens are communicated to the caller of the lexical analyzer. Further, it introduces the queχ's concept of lexical analyzer modes. Eventually, a final section explains the control and handling of mode transitions.
Pattern matching can be described by a finite state automaton where incoming characters play the role of triggers. Each incoming character, i.e. trigger, determines wether the state machine transits into a new state or it remains in the same state. Consider figure [fig:pattern-state-machines] that depicts a state machine to match a floating point number. In its initial state it waits for a digit (the START state). If no digit arrives, the state machine goes into the FAILED state. It can be said for sure, that this was no number. If a digit arrives, though, in transits into DIGITS-A. As long as digits arrive, it remains in this state. If a non-digit comes in it transits into the ACCEPTANCE state, since already enough digits have arrived as to judge that it is a number. However, if a dot arrived, then the state machine enters into the DOT state. If after the dot a new digit arrives, it enters into the state DIGITS-B. Note, that in the states DIGITS-A, DOT, or DIGITS-B a character from the else set (non-expected characters) lets the machine still transit into an ACCEPTANCE state, since enough characters have arrived to judge that it is a number.
In real world lexical analysis a whole bank of state machines are active at the same time. Starting from a position $x_0$ in the character stream, a_ step in lexical analysis_ consists of the following procedures:
All state machines (one for each pattern) are set into their START state.
Starting at position $x_0$, characters are eaten and the state machines transit their internal states. Some of them enter the FAIL state and drop out, others enter the ACCEPTANCE state and drop out.
At some point in time $x_1$, each of state machines either enters the
FAIL state or the ACCEPTANCE state
[Let's assume that the
end-of-stream character be included into the pattern definitions, so that
this statement holds.]
. At this point in time a decision is to be made
about who is the winner.
If all state machines reach the FAIL state, then it can be set that from position $x_0$ the character stream is not sound. The lexical analyser could only send an error-token. One looks at the next character after $x_0$, i.e. $x_0 = x_0 \,+\,1$ and goes back to step 1.
If one or more state machines reache the ACCEPTANCE state, then the pattern that has eaten the most characters, i.e. lived the longest is considered the winner (the matcher). The show continues.
The matcher pattern has an associated pattern match action to be executed when the pattern matches. Usually, this interprets the character stream from position $x_0$ to position $x_1$ and puts the information into a token, i.e. some signal as output of the lexical analyser.
In brief, a step in lexical analysis starts at a position $x_0$ in the character stream. After eating some characters, at some position $x_1$, it can be decided what pattern matches the best, and eventually the action related to the winning pattern is executed. Then lexical analysis starts at position $x_1$ all over again until the end of the character stream.
In principle, the functioning of a lexical analyser is determined by a 1) a set of patterns and 2) by their related actions. For most practical applications though one requires additionally a definition of token identifiers because the signals need to be understood by another unit (e.g. the parser).
Traditionally, lexical analysers send a single token for an identified pattern in the character stream. Practically, this means that, after a pattern match has happened the token-getter-function returns the token that was identified. Queχ, though, implements a more flexible approach which is better suited for implicit tokens and mode transitions. The following section shows how tokens are communicated by a lexical analyser created with the queχ program.
Queχ promotes the idea of implicit tokens, i.e. tokens that are produced without any pattern match. This concept raised from the fact that some tokens are, in fact, redundant and can be derived from context. From the perspective of the source code appearance of a programming language, a redundancy free language is much clearer than one that forces the programmer to specify information over and over again. Consider the example of the \verb|\|begin-\verb|\|end-regions in {LaTeX} as shown in the subsequent listing.
\begin{itemize} \item this is the first item. \item this is the second one. ... \item this is the last item. \end{itemize}
Logically, the begin and end tokens could have been derived from the fact that an \verb|\|item token pops up in normal text mode. Implicit tokens means that the lexical analyser can read between the lines. It sees things that are meant to be there from the context. Implementing the concept of implicit token generation inside lexical analysis rather than in a parsers allows the parser to constitute a unit that reasons logically, i.e. it considers distinct signals as input, it has a clear set of grammatical rules, and produces a clear understanding of the information (e.g. a syntax tree of a program). In Queχ's philosophy words that are implicitly thought to be inserted into the context of a sentence, are treated by lexical analysis. Multiple-Meanings (TODO: check this wording), i.e. the interpretation of a signal/token due to its context is expressed by grammar.
Traditionally, the parser calls a function of the form get_token(); the lexical analyser does one single pattern match, and then returns the token that represents the matched pattern---or an error, if no pattern matched. This mechanism is not sufficient when implicit tokens are involved, because implicit tokens can pop up out of nowhere in the character stream. When a pattern matches a whole set of such implicit tokens may be ready to be sent.
Figure style=ref shows the information flow when a lexical analyser generated by queχ parses a character stream. The user begins and requests a new token using member function get_token(). The engine's object start then the lexical analysis. During the analysis, the pattern matching actions may send a whole set of tokens into the _token-queue_. When the pattern matching actions decide to come back, the get_token() function returns the first token that was entered into the queue.
The subsequent call to get_token(), though, does not trigger a lexical analysis since there are still tokens in the queue. Until the queue is empty, the function only returns tokens from the queue. When the queue is empty, the lexical analysis starts again in the same manner as described above. From the caller's side the process appears to be a reading from a continuous token stream. From inside the pattern match action the process appears to be a sending process, where tokens are sent to some recipient.
In many cases the behavior of the lexical analysers can be efficiently described by means of modes for lexical analysis. A standard example are format strings. Inside format strings, there is no need for the lexical patterns that are necessary when parsing an algorithm. They are actually better turned off. Thus, a mechanism is needed that is able to control the activation and deactivation of groups of patterns to be matched.
A lexical analyser mode is a state of the lexical analyser where only a particular set of patterns is actively waiting to be matched against. A lexical analyser needs to be in exactly one mode at a particular point in time. Thus, a lexical analyser mode corresponds to the way that the input stream is currently transformed into signals, i.e. tokens. To handle the detection and influencing of mode transitions, Queχ provides event handling mechanisms as described in the following sections.
A potential disadvantages of modes is confusion. With traditional lexical analyser generators, such as flex, the end-user does not see more than a token stream. He has no inside on the current lexical analyser modes. He cannot sense or control the mode transitions which are currently made. The mode transitions are somwhere hidden in the pattern match actions. GNU Flex's start conditions are similar to modes. But, the only way two modes A and B modes can be related in flex is via via inclusion, i.e. by letting a pattern be active in in A and B. There is no convienient mechanism to say: let B overtake all patterns of A. This is where mode inheritance relationships of Queχ provide clear convenience advantages as described in the section after the following.
Object oriented programming introduced the concept of_ inheritance_\cite{}, in the sense that two classes can have a is-a relationship. The class of cars may for example be related to the class of vehicles by an is-a relationship. This means that cars inherit all attributes of the broader class vehicles and may add some specific attributes to it. Such inheritance relationships are directional. In our example this means that every object that belongs to the class of cars also belongs to the class of vehicles (every car is a vehicle). However, not every object that belongs to the class of vehicles (e.g. a submarine) does belong to the the class of cars. Mode inheritance in queχ works similar and effects the following three mechanisms:
If a mode B is derived\footnote{The statements
B is derived from A, A is base class of B, and B inherits from A
are equivalent in meaning in the context of object oriented programming.
In the same way the term derivation, base class, and inheritance
shall express mode relations in mode oriented lexical analysis.}
from a mode A, then it overtakes all of its patterns and the correspondent
pattern match actions. This is practical if one wants to ensure that
a certain mode does react on certain patterns. In this case it has to be
derived from the mode that shows the desired behavior.
{bf Rule}: If mode B is derived from A and both have a common pattern
{sf XYZ} where mode B relates it to a pattern action $XYZ_B$ and mode A
relates it to a pattern action $XYZ_A$, then the pattern action $XYZ_B$ is
canceled. A base pattern-action pair overrides any derived pattern
action pair of the same pattern.
If a mode B is derived from a mode A, then B's
event handler will be executed after A's event handlers. For example, A may
have an event handler on_entry for entering mode A. Mode B itself
might have an event handler for on_entry. If the lexical analyser
now enters mode B, the event handler of its base mode A is first executed
and then the event handler of class B itself.
{bf Rule:} _Event handlers are executed in sequence from base to
derived classes._
If a mode B is derived from a mode A, then
any permission of entry or exit to A is also a permission for B to
enter or exit. The philosophy behind this is that that due to pattern
inheritance mode B will behave like mode A, so it can be trusted the same
way that one trusts A.
{bf Rule:} Entrys and exits that are allowed for a base class
are also allowed for all its derived classes.
Figure style=ref shows an example where a mode A acts as a base mode for mode B. Internally, queχ creates a new mode B* that contains the patterns that were explicitly givent by the definition of B and the patterns that are inherited from A. The pattern match action of {sf XYZ} from mode B is omitted, since mode B has to behave like mode A for patterns specified in mode A. The on_entry and on_exit functions are of mode A are pasted into the own event handlers so that they are executed before them. What is not displayed in the figure is access inheritance. When Queχ checks the entry and exits of a mode it consideres automatically also the base classes. If there is no base mode that acts like a door opener to a particular mode, then transits will cause a run-time error.
In conclusion, it can be said that inheritance is a very useful tool to implement some standard behavior in a base mode. Derived modes can then comply to this standard simply by being derived from this mode. Also, for the control of mode transitions the is-a relationships of inheritance provides much flexibility. Mode inheritance allows to think about mode relationships very naturally and helps to avoid code duplication.
In Queχ modes can be changed by two basic mechanisms. First, modes can simple be changed. This makes sense, if the new mode knows exactly to what modes it has to change and this does not depend on the previous mode. Another mechanism is that of pushing and popping. This makes sense, if the new mode shall return to the previous mode. It is therefore similar to a call-subroutine in procedural programming languages.
Figure style=ref shows a typical mode transition. Mode X has some patterns and one of them matched. Inside the pattern match action a transition to mode Y is requested. This triggers the event handler on_exit() of mode X before any transition is made. Here the user can send tokens, change some internal variables or write some debug output. Then the event handler on_entry() of entered mode Y is called. Again the user can provide some actions related to the entrance of that mode.
The ability to define event handlers for entering and exiting modes not only supports implicit token production, but also supports transparency. It can now exactly be traced how modes are transited. In order to restrict the transition from and to lexical modes, queχ provides transition control mechanisms. In the header of a definition of a mode X it can be specified from what modes the mode X can be entered and to what modes the mode X can exit. If a mode transition happens that is not conform to this restrictions a run-time error is triggered. This helps to prevent the lexical analyser to slide from one mode to another due to some thoughtlessly designed patterns.
Figure style=ref shows an example, where mode transition control comes very handy. Here, there is a lexical mode for parsing the content of a {sf FUNCTION}, a mode for {sf COMMENT} and one for {sf DOCU}-mentation parsing, a mode for parsing a character {sf STRING} and one for {sf MATH}. The design here says, that the comment mode can only be activated from the {sf FUNCTION} mode and can only leave to {sf FUNCTION} mode. Similarly for the {sf STRING} and the {sf MATH} mode. Queχ provides the feature, that for {sf COMMENT} mode, for example, one can specify that it can only be entered from {sf FUNCTION} mode, thus preventing the {sf DOCU}, {sf STRING} and the {sf MATH} mode from ever transiting into {sf COMMENT} mode. Also, it can be specified that the {sf COMMENT} mode can only exit to {sf FUNCTION} or {sf DOCU} mode, thus preventing it from dropping into {sf STRING} or {sf MATH} mode.
Extensive mode transitions without a control mechanism as the one implemented is queχ is highly error prone. The advantage here is that the restrictions on the mode transitions are explicitly specified in the description of the lexical analyser. The alternative would be to create an external document and review the behavior of the lexical analyser with respect to mode transitions whenever an error is assumed from being caused by straying mode transitions. Practically, solid handling of extensive mode transitions in lexical analysis are not possible except with a control mechanism as provided by queχ.
With the rise of the Python programming language the use of indentation as scope delimiter has become popular. Indeed, it maybe the most efficient method to delimit scope with the least amount of additional characters provides a convinient mechanism to handle indentations which runs quasi in paralell to the pattern matching: indentation events.
Figure style=ref displays in an example the principle of indentation events. Whenever the lexical analyser reaches the first non-whitespace in a line, an indentation event (indicated as a little star in the figure) is triggered. The lexical analyser engine then calls a user defined indentation handler. The numbers at the indentation events indicate the number of characters that the indentation spans. They are passed to the user's indentation handler as arguments. The indentation handler, then, can then keep track of indentation blocks or whatever his little heart desires.
Note, that it is not trivial to express indentation in terms of pattern action pairs solely based on regular expressions. It is not enough to define a pattern such as
P_INDENTATION_CATCHER "\n"[ ]*
That is a newline followed by whitespace. Imagine, one introduces a comment sign such as the traditional # for comment until newline. The comment eating pattern would be at first glance:
P_COMMENT_EATER "#"[^\n]*\n
That is a # followed by anything but newline and then one newline. The action related to this pattern would have to put pack the last newline. Otherwise the indentation catcher which starts with a newline can never trigger. In this particular case, this problem can be solved by deleting the last newline from the comment eater pattern, knowing that after as many not-newline as possible there must be a newline, i.e.
P_COMMENT_EATER "#"[^\n]*
The last newline is then eaten by the indentation catcher. However, the main problem remains: with such a design patterns need to know about each other to work propperly! In an environment of many different modes which are additionally related by inheritance, it is difficult to guarantee that all patterns respect the quirks of each other. They better work independently. The indentation events of queχ provide a means that allows one to avoid inter-pattern dependencies.
Similarly, catching indentation with pre-condition newline plus whitespace, i.e. \verb|^[ \t]*| is fragile, in a sense that another pattern that contains newline plus whitespace might hinder this pattern from triggering. In a lexical analyzer with dozens or hundreds of patterns this becomes quickly unmanageable. Errors that arise from patterns defined somewhere else are very hard to find and require a lot of insight into the actual process of lexical analysis. Using the on_indentation event handler ends up in a much clearer and safer design. For more information about the pre-condition newline pitfall see section \pageref{formal/patterns/context-dependent-pitfalls}.
{bf Caveat:}
Note, that if a pattern contains more than one newline then only the indentation event concerning the last newline ist triggered! Imagine a pattern
STRANGER " "*"hello"[\n]+" "*"world"[\n]+" "*"how are you?"
then the following pattern would match
hello
world
how are you?
If this matches, then the lines of hello and world do not trigger an indentation event. So, when dealing with indentation based scoping such strange things are best avoided\footnote{Probably, it is anyway hard to think of a case where such pattern definitions are not to be avoided.}.
Any compiler or lexical analyzer imposes rules on the text that is has to treat. Most of thoses texts are written by humans and humans make sometimes errors and disrespect rules. A gentle compiler tells its user about his errors and tells it also about the place where the error occured. Here line numbering becomes crucial. Additionally columns can be counted. This provides even more detailed information.
There is general default algorithm to count lines and columns which is always applied in case that there is no better alternative. Queχ analyzes the patterns and adapts the way of counting lines and columns according to special characteristics\footnote{Actually, even the indentation count algorithm is adapted to profit from knowledge about the patterns internal structure.}. If the pattern contains for example a fixed number of newlines, than only a fixed number is added and no newlines are counted at runtime. The mechanisms for line and column counting are optimized for the most reasonable pattern characteristics. There are strange cases\footnote{Example: A pattern that contains a newline which is followed by a fixed number of characters. The determination of this in the context of post-conditions is complicated. On the other hand, such patterns are considered strange and seldom, so the expected gain with an optimized algorithm was considered neglectable by the author.}, though, for which a slightly better counting mechanism might be found. For reasonable applications as known from popular programming languages the mechanisms of counting should be optimal.
Note, that line and column counting can be turned off individually by pre-processor switches. By these switches the generated analyzer is naked of any counting mechanism and it is likely that it runs a little faster. For serious applications, though, at least line number counting should be in place for error reporting.
This chapter discussed the major features of queχ. To the knowledge of the
author of this text, at the time of this writing, there is no lexical analyser
generator that matches these features. Moreover, once one gets used to think
in terms of sending tokens, mode-transitions, and once one gets used to
define modes that implement standard behavior for other modes using
inheritance, then it becomes hard to get back to the cumbersome thinking of
plain lex or flex. The little overhead required to learn new concepts comes
with the great advantage of being able to design the lexical analysis process
in a very natural way
[Well, at least natural to the author of this
text.]
.
This practical examples guides through the phases of creating a lexical analyser using queχ. The files of this example can be found in the directory demo/001/ of the queχ distribution. The file simple.qx contains the description of the lexical analyzer, lexer.cpp contains a lexical analyzer application, and a Makefile allows the convinient creation of the whole lexical analyzer application. Copy-pasting this example may be a good starting point for the work with queχ. A good approach is to devide the development of a lexical analyzer into the two steps of design and implementation. The first step consists of three minor activities:
Patterns can be defined directly on the pattern action pair line, but it is much more intuitive to define patterns in terms of regular expressions \cite{} in a separate data section. Those patterns act like constants that allow to understand quickly for what a pattern/action pair stands. Using a name {NOT_NEWLINE}+ is simply visually more appealing than writing [^\n]+}. Those patterns can be defined very easily in terms of regular expressions.
We finally want to send tokens and those tokens must have identifiers. For this reason, the user needs to specify names of tokens in a token section, so that queχ can create ids for them.
This is best done with a drawing program or by hand on paper. One should be clear about how lexer modes are named, how they relate to each other and what transitions are possible. This design will be the basis for the later coding of modes.
Once, these simple steps are accomplished the implementation of modes in terms of pattern-action pairs can be started. The example discussed in this section develops the following files that are required for generating a lexical analyser:
which contains all information queχ needs to create a lexical analyzer.
which uses the created lexical analyser in a mini example application.
To follow the explanations of this chapter the directory demo/001 is best copied into a user domain for further examination and modifications.
In the simple example, only very basic patterns have to be defined. There is a pattern for whitespace, consisting of space, tabulator and newline. There are two bracket types for opening and closing brackets, an assignment operator =, the two keywords if and struct, identifiers consisting of letters from A-Z in capital and lower-case, floating point numbers and string delimeters. The listing in figure [fig:pattern-code] shows the content of the define section in simple.qx describing all required patterns in terms of regular expressions.
define { // patterns of interest for all modes P_WHITESPACE [ \t\n] P_IDENTIFIER [_a-zA-Z]+ P_NUMBER [0-9]+ // string mode related patterns P_STRING_DELIMITER "\"" P_BACKSLASHED_STRING_DELIMITER "\\\"" P_BACKSLASHED_BACKSLASH "\\\\" }
The filename including the extension is, of course, totally freely chooseable. Note, that all pattern names are global. They are not local to a specific mode. That means, that all modes can access the pattern definitions. Note also, that inside the define-range only //-comments are allowed. The P_-prefix was chosen to indicate to the reader, that this is a pattern definition. There is no danger of interfering with the program namespace. These shorthand names are only considered for the definition of regular expressions. The later generated engine does not contain them as variables.
With a similar ease as patterns, token-ids can be specified. A token-section containing token-ids only has to contain a newline separated list of token names. Queχ will create a set of constants with unique numerical values. Figure [fig:token-ids-code] shows the definition of token-ids for the example.
token { BRACKET_O // appears as TKN_BRACKET_O BRACKET_C // TKN_BRACKET_C CURLY_BRACKET_O // ... CURLY_BRACKET_C OP_ASSIGNMENT IF STRUCT SEMICOLON IDENTIFIER NUMBER STRING }
Note, that queχ pastes a prefix in front of the numerical constants for the tokens. A token-id of STRUCT will be called TOK_STRUCT, if TOK is specified as token-prefix. This is discussed briefly in the section [sec:formal/command-line-options]. It is also important to remind that the token-ids are relevant to the namespace. They name constant variables inside the namespace queχ. It is not advisable to set the token-id prefix to an empty string and name a token-id i or x because those names are likely to interfer with names of other variables.
The design of modes, their relations and the transitions is something to be done informally with a drawing program or on a sheet of paper. It is basically a help for the user to develop a clear design of the lexical analysis process. In our simple example, there are only three modes:
This mode describes the reaction to end-of-file. All modes in the simple example share this behavior, so they are derived from this mode. On the other hand, this is more an abstract mode\footnote{In the C++ sense of the word, there is no real existing mode of that type. Only derived modes can exists.}. The lexical analyser shall never be in an END_OF_FILE mode, but in a derived mode of it.
In this mode, we detect program related tokens. It is derived from END_OF_FILE. Thus, an if an end-of-file arrives in this mode it behaves propperly. When a string delimiter arrives, this mode shall transit into the STRING_READER mode.
This mode shall also treat end-of-file in the standard way, so it is also derived from END_OF_FILE. However, all patterns for number detection, keywords and brackets are totally meaningless inside the string. A string-delimiter shall trigger the return to the PROGRAM mode.
The inheritance relationships in this example are displayed in figure style=ref. Both real modes PROGRAM and` STRING_READER` are derived from END_OF_FILE so that they implement the standard behavior for the end-of-file event.
In this examples their are only two modes in which the lexical analyser can be. The posible transitions are shown in figure style=ref. PROGRAM mode can transit into STRING_READER mode and vice versa. The mode inheritances and transitions are only pinpointed in non-machine readable form. This information can be used to write the first stubs of the modes in which one will later fill in the pattern-action pairs.
mode END_OF_FILE : <inheritable: only> { ... } mode PROGRAM : END_OF_FILE <exit: STRING_READER> <entry: STRING_READER> { ... } mode STRING_READER : END_OF_FILE <exit: PROGRAM> <entry: PROGRAM> { ... }
The three code-stubs for the three modes are shown in figure style=ref. Until now, there are no pattern action pairs defined inside the modes. The dots … inside the curly braces will later be replaced by the pattern-action pairs. First of all, mode END_OF_FILE is defined. It does not inherit any mode, but can only be used as a base mode. Thus the option
<inheritable: only>
is specified in edgy brackets. Second, the PROGRAM mode is derived from END_OF_FILE which is indicated by its name following the colon. The options
<exit: STRING_READER> <entry: STRING_READER>
tell that this mode can be entered from STRING_READER mode, and that it can exit to this mode. Any transition to another mode (that is not derived from STRING_READER) is detected as an error\footnote{Note, that inside queχ assert() commands are used. According to ISO9899 (ANSI-C) the command only takes effect, if NDEBUG is not defined. If it is, the mode transitions are consequently not checked by the queχ engine. The general rule applies here also: only define NDEBUG if it is 100\% safe to say that no use-case will violate the design.}. Analogously, the STRING_READER mode is defined specifying its base class and the allowed transitions to PROGRAM.
Note, that all options given here are indeed optional. They are not necessary to define pattern-action pairs, and they do not contribute to the functioning of the generated lexical analyser. However, they allow to check wether mode definitions and mode transitions are according to the design or not. The detection of usage against design specifications is of great help to avoid errors or find their causes.
The previous sections explained the definitions of patterns, token-ids, and mode characteristics. In this section the core of the lexical analysis is discussed: pattern-action pairs. Lexical analysis has as a result a stream of tokens that tell about atomic information chunks in the input stream. The most simple pattern-action pair consists of sending a specific token whenever a particular pattern occurs. Consider for example figure style=ref. It shows the definition of a mode that can handle the end of a file. The only pattern of this mode is <<`EOF`>>. The action to be executed if this pattern occurs is embraced by curly brackets. Inside the brackets a special token is sent, a token with the id:
quex::TKN_TERMINATION
telling that lexical analysis is over. Note, that in queχ one uses the keyword self and not a this-pointer to refer to the lexical analyser. Recall that queχ is a code generator, not a compiler. Internally, the code fragments that are specified as pattern-action pairs or event handlers may be pasted by queχ into objects other than the lexical analyser (e.g. the lexical-analyser-mode-information structures). Also the user might derive his own class from the lexical analyser class, so there would be a type-issue. To unify the access to the lexical analyser, all places were code is pasted allow to use self as a variable that represents the lexical analyser. Use self. not this->!
mode END_OF_FILE : <inheritable: only> { <<EOF>> { self.send(quex::TKN_TERMINATION); RETURN; } }
A final RETURN statement in the pattern-action pair says that the flow of control can be turned back to the caller of function` get_token(). Note again, that `return is not used. The RETURN macro ensures that the token queue is in a sound state before returning from the lexical analysis. The rule here is: say RETURN not return.
The pattern-action pairs of the PROGRAM mode displayed in figure style=ref are a little more diverse. Note also, that for simple patterns the operator =$>$ is used to express the sending of a token of a specific id. The whitespace pattern does not cause any token to be sent and one does not return to the user, thus it is followed by an empty pair of curly brackets. The pattern P_IDENTIFIER creates a token that contains the string that was matched. P_NUMBER also stores a number that corresponds the lexeme that was matched. The pattern` P_STRING_DELIMITER` does not send a token, but initiates a mode transition to the mode STRING_READER.
mode PROGRAM : END_OF_FILE <exit: STRING_READER> <entry: STRING_READER> { "{" => TKN_CURLY_BRACKET_O; "}" => TKN_CURLY_BRACKET_C; "=" => TKN_OP_ASSIGNMENT; "struct" => TKN_STRUCT; "if" => TKN_IF; ";" => TKN_SEMICOLON; {P_NUMBER} => TKN_NUMBER(atoi(yytext)); {P_IDENTIFIER} { self.send(TKN_IDENTIFIER, yytext); RETURN; } {P_WHITESPACE} { } {P_STRING_DELIMITER} { self << STRING_READER; } }
The STRING_READER mode's pattern-action pairs are shown in figure style=ref. This mode makes use of a special member of the lexical analyser object: _the accumulator_. This object allows to accumulate text from incoming patterns and flush them when desired. When this mode is entered, the accumulator is emptied (see on_entry). When this mode is left, then the accumulator is flushed (see on_exit). This means, that if the accumulated text is not empty, it is send as a token with a user defined token-id. The STRING_READER mode sends the accumulated text with a token-id TKN_STRING.
The arrival of a string delimiter causes a mode transition back to mode` PROGRAM`. A backslashed string delimiter allows to have a string delimter inside the the string, so it does not cause a return to PROGRAM mode. instead, it adds a quote to the accumulator. The default action . specifies that any other incoming text is simply accumulated. Even in this simple example it becomes clear that the specification of entry and exit event handlers facilitates the specification of mode transitions. It can be said for sure, that wherever in the code a command triggers an exit to another mode, the accumulator is flushed.
mode STRING_READER : END_OF_FILE <exit: PROGRAM> <entry: PROGRAM> { on_entry { self.accumulator.clear(); } on_exit { self.accumulator.flush(TKN_STRING); } {P_BACKSLASHED_STRING_DELIMITER} { self.accumulator.add('\"'); } {P_STRING_DELIMITER} { /*.........................................*/ self << PROGRAM; RETURN; } . { self.accumulator.add(Lexeme); } }
This section now finished the specification of file simple.qx as an input file for the queχ lexical analyser generator. The next section tells how to call queχ to generate a full fledged engine that does the lexical analysis as required.
With the files from directory \$(QUEX_PATH)/DEMO/001/ the creation of the example lexical analyser application is as simple as can be. Type make in the command line and the application is built. The Makefile that comes with the sample application is written in a way that makes it very easy to adapt and extend its contents for other user's needs. The core of this make file explains how to call queχ:
# (*) definitions MODE_FILES = ./in/simple.qx # ENGINE_NAME = tiny_lexer # ENGINE_SOURCES = $(ENGINE_NAME) \ $(ENGINE_NAME).cpp \ $(ENGINE_NAME)-internal.h \ $(ENGINE_NAME)-token_ids \ $(ENGINE_NAME)-core-engine.cpp ... # (*) build rule $(ENGINE_SOURCES): $(PATTERN_FILE) $(MODE_FILES) $(TOKEN_DB) quex --mode-files $(MODE_FILES) \ --engine $(ENGINE_NAME)
The build rule says that whenever one of the files containing mode descriptions changes the source code for the lexical analyser has to be rebuild. In our example the only mode file is simple.qx. Queχ receives the names of the input file names as no-minus followers of the command line option —mode-files.
The output of queχ is determined by the option —engine. The name that follows it specifies the name of the class that implements the engine. But, it also acts as filestem for all output files. In our example, the engine's name is tiny_lexer. Therefore, there will be four files created:
containing the header file that contains the class definition of class queχ::tiny_lexer. This is _the only output file that is of direct interest to the user_. It has to be included in each file that interacts with the produced lexical analyser.
containing the created lexical analyser engine. The user does not need to touch it, but it has to be compiled sometime and linked to the application. The example Makefile does this already. The user is not directly concerned with this file.
containing definitions of token-ids. That means it provides variables with the names of token-ids that carry unique numerical values. It also defines the member function string token::map_id_to_name(const id_type TokenID) which maps a token-id to a token-id name. This comes practical in many stages of the development. However, this file is included inside the header file for the lexer class definition and the user does not worry about this file at all.
is a generated C++ code file containing the generated lexical analyzer engines for all modes.
Once the engine sources are built by queχ, an example application can be built that uses the lexical analyser. The source code for the example is specified in file lexer.cpp. The content
#include<fstream> #include<iostream> // (*) include lexical analyser header #include <./tiny_lexer> using namespace std; int main(int argc, char** argv) { // (*) create token quex::token Token; // (*) create the lexical analyser // if no command line argument is specified user file 'example.txt' quex::tiny_lexer* qlex = new quex::tiny_lexer("example.txt"); // (*) print the version cout << qlex->version() << endl; cout << ",------------------------------------------------------\n"; cout << "| [START]\n"; int number_of_tokens = 0; // (*) loop until the 'termination' token arrives do { // (*) get next token from the token stream qlex->get_token(&Token); // (*) print out token information // -- line number and column number cout << "(" << qlex->line_number() << ", "; cout << qlex->column_number() << ") \t"; cout << Token.type_id_name() << endl; ++number_of_tokens; // (*) check against 'termination' } while( Token.type_id() != quex::TKN_TERMINATION ); cout << "| [END] number of token = " << number_of_tokens << "\n"; cout << ",------------------------------------------------------\n"; return 0; }
Initially, the lexical analyser needs to be created as an object of type` queχ::tiny_lexer`. The filename of the file to be analysed is directly passed to the constructor of the class. Then, one needs to set the initial lexical analysis mode. In our example, PROGRAM shall be the initial mode. Since it shall not trigger any mode transitions or protection algorithms, it is set using the function set_mode_brutally().
After printing out the version information, the loop over the token stream starts. In the sample application it does not more than getting the a token using get_token() and printing it out. In a more realistic application this token is to be passed to a parser that does its syntactic analysis on it. When a token arrives with an identifier equal to` queχ::TKN_TERMINATION`, then the loop exits and the program terminates.
At this point in time everything is ready for compilation. Using the Makefile coming with the example, typing make shall produce a ready-to-run application called lexer in the current directory. If one types now
> ./lexer example.txt
on the command line essentially the following output is produced:
tiny_lexer: Version 0.0.0-pre-release. Date Fri Sep 7 7:51:20 2007 Generated by Quex 0.15.3 ,----------------------------------------------------------------- | [START] (0, 0) STRUCT (0, 7) IDENTIFIER (0, 15) CURLY_BRACKET_O (1, 2) IDENTIFIER (1, 9) IDENTIFIER (1, 10) SEMICOLON (2, 2) IDENTIFIER (2, 9) IDENTIFIER (2, 10) SEMICOLON (3, 0) CURLY_BRACKET_C (3, 1) SEMICOLON (5, 0) IF (5, 3) IDENTIFIER (5, 12) CURLY_BRACKET_O (6, 2) IDENTIFIER (6, 7) IDENTIFIER (6, 14) NUMBER (6, 34) STRING (6, 35) SEMICOLON (7, 0) CURLY_BRACKET_C (7, 1) <TERMINATION> | [END] number of token = 21 `-----------------------------------------------------------------
This section has shown how to built a lexical analyser with queχ and provided a ready-to-run sample application. It was discussed how to specify patterns in a pattern file, how to announce names for token-identifiers and how to code mode-files that contain information about modes and pattern-action pairs. After an explanation of how to invoque queχ, a sample application was discussed that uses the created lexical analyser engine. The Makefile that comes with the example can be easily adapted to particular applications and reduces the required effort to the typing of make on the command line. In general, the example files simple.qx and lexer.cpp build a good basis for a new project. In the following chapters various features and options of queχ are discussed. Additionally some discussion will be made about the generated lexical analyser. The current chapter provided enough information to start coding a lexical analyzer. Following chapters become interesting once you tasted the ease of queχ.
The last chapter provided a functioning example for the usage of queχ. At this point the user should have a clear idea about the basic concepts and how they are applied. This chapter discusses the details of usage and minor features that may not have been mentioned in the previous chapters. The first sections discuss the specification of patterns, mode characteristics, pattern-action pairs, and mode transitions. Section queχ creates with all members and member functions that the user might want to use.
In some cases, the provided functionality of the generated lexical analyser discusses the formalities to implement a derived class and its usage. The generated code contains some macro switches that allow to control the behavior of explains all command line arguments that can be passed to queχ.
pattern file. In this section, it is described how to specify patterns by means of regular expressions. The regular expression syntax follows the scheme of what is usually called traditional UNIX syntax \cite{}, includes elements of extended POSIX regular expressions \cite{IEEE POSIX Standard 1003.2} and POSIX bracket expressions \cite{}. This facilitates the migration from and to other lexical analyzer generators and test environments. Additionally, queχ provides support for Unicode Propperties. A compliance to Unicode Regular Expressions \cite{Unicode 5.0 Technical Report Standard #18} is currently not targeted, though, because this expressive power is usually not required for compiler generation.
Nevertheless, queχ provides features that, for example, flex does not. If it is intended to maintain compatibility of regular expressions with flex, then please refer to the flex manual \cite{}, section Patterns and do not use queχ-specific constructs. This section discusses pure queχ syntax. The explanation is divided into the consideration of context-free expressions and context-dependent expressions.
Context free regular expressions match against an input independent on what come before or after it. For example the regular expression for will match against the letters f, o, and r independent if there was a whitespace or whatsoever before it or after it. This is the usual way to define patterns. More sophisticated techniques are explained in the subsequent section. This sections explains how to define simple chains of characters and operations to combine them into powerful patterns.
Chains of Characters
matches the character x. That means, lonestanding characters match simply the character that they represent. This is true, as long as those characters are not operators by which regular expressions describe some fancy mechanisms---see below.
matches any character (byte) except newline and EOF. Note, that on systems where newline is coded as $0D,\,0A$ this matches also the $0D$ character whenever a newline occurs (subject to possible change).
a "character class" or "character set"; in this case, the pattern matches either an `x', a `y', or a `z'. The brackets $[$ and $]$ are examples for lonestanding characters that are operators. If they are to be matched quotes or backslashes have to be used as shown below. Character sets are a form of alternative expressions— for one single character. For more sophisticated alternative expressions see the paragraphs below.
[: expression :] matches a set of characters that result from a character set expression expression. Section [formal/patterns/character-set-expressions] discusses this feature in detail. In particual [:alnum:], [:alpha:] and the like are the character sets as defined as POSIX bracket expressions.
a "character class" with a range in it; matches an a', a `b', any letter from `j' through `o', or a `Z'. The minus - determines the range specification. Its left hand side is the start of the range. Its right hand sinde the end of the range (here j-o means from j to o). Note, that - standards for range from to even if the left hand side is lesser than the right hand side. For example `a-z and z-a are equivalent expressions.
a "negated character class", i.e., any character but those in the class. The ^ character indicates negation at this point. This expression matches any character except an uppercase letter or newline.
the literal string: `$[$xyz$]$"foo'. That is, inside quotes the characters which are used as operators for regular expressions can be applied in their original sense. A $[$ stands for code point 91 (hex. 5B), matches against a $[$ and does not mean open character set. Note, than inside strings one can still use the ANSI-C backslashed characters \n, \t, etc. as well as the Unicode name property \N. However, general Unicode property expression \P that result in character sets are not dealt with inside strings.
a NULL character (ASCII/Unicode code point 0). This is to be used with extreme caution! The NULL character is also used aa buffer limitting character! Same as with the code for begin-of-file and end-of-file, if these codes are used inside regular expressions, their values have to be re-assigned. See section [sec:formal/command-line-options] for specifying a different value for the buffer limit code.
the character with hexadecimal value 11A0FF. A maximum of six hexadecimal digits can be specified. Hexadecimal numbers with less than six digits must either be followed by a non-hex-digit, a delimiter such as ", [, or (, or specified with leading zeroes (i.e. use \U00071F, for hexadecimal 71F). The latter choice is probably the best candidate for an established habit. Hexadecimal digits can contain be uppercase or lowercase letters (from A to F).
the character with hexadecimal value 7A27. A maximum of four hexadecimal digits can be specified. The delimiting rules are are ananlogous to the rules for \U.
the character with hexadecimal value 27. A maximum of two hexadecimal digits can be specified. The delimiting rules are are ananlogous to the rules for \U.
the character with octal value 123, a maximum of three digits less than 8 can follow the backslash. The delimiting rules are ananlogous to the rules for \U.
the ANSI-C interpretation of the backslashed character.
the set of characters for which the Unicode Property Expression holds. Note, that these expressions cannot be used inside quoted strings.
the code of the character with the given Unicode character name. This is a shortcut for \P{Name=UNICODE CHARACTER NAME}. For possible settings of this character see \cite{Unicode 5.0}.
the code of the character with the given General Category \cite{}. This is a shortcut for \P{General_Category=X}. Note, that these expressions cannot be used inside quoted strings. For possible settings of the General_Category property, see section [sec:formal-unicode-properties].
Any character specified as character code, i.e. using \, \x, \X, or \U are considered to be unicode code points. For applications in english spoken cultures this is identical to the ASCII encoding. For details about unicode [formal/ucs-properties] is dedicated to an introduction to Unicode properties.
Two special rules have to appear isolatedly, out of the context of regular expressions. With the following two rules the actions for the event of end of file and the failure event can be specified:
the event of an end-of-file (end of data-stream).
the event of failure, i.e. no single pattern matched. Note, this rule is of the lex style, but is only available with the queχ core engine.
This syntax is more in recognition of the traditional lex syntax. In fact the two event handlers on_failure and on_end_of_stream are a one-to-one correspondance to what is mentioned above. Possibly some later versions will totally dismiss the lex related engine core, and then also these constructs will disappear in favor of the mentioned two event handlers.
Operations
Let R and S be regular expressions, i.e. a chain of characters specified in the way mentioned above, or a regular expression as a result from the operations below. Much of the syntax is directly based on POSIX extended regular expressions \cite{}.
zero or more occurencies of the regular expression R.
one or more repetition of the regular expression R.
zero or one R. That means, there maybe an R or not.
anywhere from two to five repetitions of the regular expressions R.
two or more repetitions of the regular expression R.
exactly four repetitions of the regular expression R.
match an R; parentheses are used to group operations, i.e. to override precedence, in the same way as the brackets in $(a\, +\, b)\,\cdot\,c$ override the precedence of multiplication over addition.
the regular expression R followed by the regular expression S. This is usually called a concatenation or a sequence.
either an R or an S, i.e. R and S. This is usually called an alternative.
the expansion of the defined pattern "NAME". Pattern names can be defined in define sections (see section [sec:practical/patterns]).
Character set expression are a tool to combine, filter and or select character ranges conviniently. The result of a character set expression is a set of characters. Such a set of characters can then be used to express that any of them can occur at a given position of the input stream. The character set expression [:alpha:], for example matches all characters that are letters, i.e. anything from a to z and A to Z. It belongs to the POSIX bracket expressions which are explained below. Further, this section explains how sets can be generated from other sets via the operations union, intersection, difference, and inverse.
POSIX bracket expressions are basically shortcuts for some more regular expressions that would formally look a bit more clumpsy. Queχ provides those expressions bracketed in [: and :] brackets. They are specified in the table below.
| Expression | Meaning | Related Regular Expression |
|---|---|---|
| [:alnum:] | Alphanumeric characters | [A-Za-z0-9] |
| [:alpha:] | Alphabetic characters | [A-Za-z] |
| [:blank:] | Space and tab | [ \t] |
| [:cntrl:] | Control characters | [\x00-\x1F\x7F] |
| [:digit:] | Digits | [0-9] |
| [:graph:] | Visible characters | [\x21-\x7E] |
| [:lower:] | Lowercase letters | [a-z] |
| [:print:] | Visible characters and spaces | [\x20-\x7E] |
| [:punct:] | Punctuation characters | [!"#$%&'()*+,-./:;?@[\\\]_`{|}~] |
| [:space:] | Whitespace characters | [ \t\r\n\v\f] |
| [:upper:] | Uppercase letters | [A-Z] |
| [:xdigit:] | Hexadecimal digits | [A-Fa-f0-9] |
Caution has to be taken if these expressions are used for non-english languages. They are solely concerned with the ASCII character set. For more sophisticated property processing it is advisable to use Unicode property expressions as explained in section [formal/ucs-properties]. In particular,
The use of Unicode character set potentially implies the handling of many different properties and character sets. For convinience, queχ provides operations on character sets to combine and filter different character sets and create new adapted ones. The basic operations that queχ allows are displayed in the following table:
| Syntax | Meaning | Example |
|---|---|---|
| union(A0, A1, …) | Union of the list of sets. | union([a-z], [A-Z]) = [a-zA-Z] |
| intersection(A0, A1, …) | Intersection the list of sets. | intersection([0-9], [4-5]) = [4-5] |
| difference(A, B0, B1, …) | Subtracts B0, B1, from set A. | difference([0-9], [4-5]) [0-36-9] |
| inverse(A0, A1, …) | Inverse of the union of the list of sets. | inverse([\x40-\5A]) = [\x00-\x40\x5B-\U12FFFF] |
Note, that the difference and intersection operation can be used convienently to filter different sets. For example
[: difference(\P{Script=Greek}, \P{Digit}) :]
results in the set of greek characters except the digits. To allow only the numbers from the arabic code block intersection can be used as follows:
[: intersection(\P{Block=Arabic}, \P{Numeric}) :]
The subsequent section elaborates on the concept of Unicode properties.
Unicode is distinguished from other coding standards not only with respect to its level of completeness including all possibly used character sets. Moreover, much effort has been accomplished in categorizing characters, providing terminology for concepts related to letter systems and defining character properties. Unicode defines a fixed set of properties that a character can have. The most important property for a lexical analyzer is the code point, i.e. the integer value that represents a character. However, there are other interesting properties that simplify the description of regular expressions.
Queχ supports Unicode Standard Properties through the \P{..} expressions, where the P stands for property. For a queχ user properties can be devided into two categories:
Binary Properties, i.e. properties that a character either has or has not. For example, a character is either a whitespace character or it is not. Sets of characters having a binary property can be accessed through \P{binary_property}.
Non-Binary Properties, i.e. properties that require a particular value related to it. For example, each character belongs to a certain script. A character belonging to the greek script has the property Script=Greek. Sets of characters that have a certain property setting can be accessed via \P{property=value}.
Note, that the result of a \P{...} expression is always a set of characters. Therefore, it cannot be used inside a quotes string expression. For convinience, the properties Name and General_Category are provided through the shortcuts \N{…} and \G{…}. Thus, \N{MIDDLE DOT} is a shorthand for \PUppercase_Letter.
Unicode 5.0 provides more than 17000 character names, so please consult the
standard literature for settings of the Name property
[Alternatively, consider the file UnicodeData.txt that comes with
the queχ application]
. Note also, that names as defined in Unicode 1.0
can be accessed through the Unicode_1_Name property. This property also
contains names of control functions according to ISO 6429 \cite{}.
As for the General_Category property, the appendix provides the list of possible settings [sec:appendix-property-general-category]. At this place, more detailed information is specified about provided properties, their meaning, and settings of other likely-to-be-used properties.
When starting to write lexical analyzers using a wider range of unicode characters the reliance on properties becomes almost unavoidable. However, the huge volume of characters requires some sophisticated tool to browse through properties and related character sets. For this purpose queχ provides a query mode. This mode allows the user to specify some queries on the command line and get the response immediately on the console. This subject is handled in section [sec:query/intro].
As an option to facilitate the specification of property values, queχ allows you to use wildcards *, ? and simple character sets such as as [AEIOUX-Z] for the characters A, E, I, O, U, X, Y, and Z. If the first character is an ! then the inverse character set is considered. This is conform with unix file name matching. set specifications such as [a-z] such to facilitate the search. It is advisable, though, to use queχ's query functionality first [sec:query/intro] in order to get an impression to what value such a wild-card expression expands.
The most dangerous pitfall is related to precedence and length. Note, that a pattern that is defined {it before} another pattern has a higher precedence. Also, if a pattern can match a longer chain of characters it wins. Thus, if there are for example two patterns
[A-Z]+ => TKN_IDENTIFIER(Lexeme); "PRINT" => TKN_KEYWORD_PRINT;
then the keyword PRINT will never be matched. This is so, because [A-Z] matches also the character chain PRINT and has a higher precedence, because it is defined first. To illustrate the danger of greedy matching, i.e. the fact that length matters, let two patterns be defined as:
"Else" => TKN_KEYWORD_ELSE; "Else\tAugenstein" => TKN_SWABIAN_LADY(Lexeme);
Now, the Else statement may be matched, but only if it is not followed by tabulator and Augenstein. On the first glance, this case does not seem to be very probable. Sometimes it may be necessary, though, to define delimiters to avoid such confusion. In the very large majority of cases greedy matching is a convienient blessing. Imagine the problem with identifiers, i.e. any chain of alphabetic characters, and a keyword for. If there was no greedy matching (longest match), then any variable starting with for could not propperly be detected, since the first three letters would result in the for-keyword token.
Another pitfall is related to character codes that the lexical analyser uses to indicate begin-of-file, end-of-file, or buffer-limit. The values for those codes are chosen to be out of the range for sound regular expressions parsing human written text (0x0 for buffer-limit, 0x1A for end-of-file, and 0x19 for begin-of-file). If it is intended to parse binary files, and these values are supposes to occur in patterns, then these codes need to be changed. Section [sec:formal/command-line-options] mentions how to specify their codes on the command line.
Additionally to the specification of the pattern to be matched queχ allows to define conditions on the boundary of the pattern. This happens through pre- and post-conditions. First, the trivial pre- and post-conditions for begin of line and end of line are discussed. Then it is shown how to specify whole regular expressions to express conditions on the surroundings of the pattern to be matched. The traditional characters to condition begin and end of line are:
an regular expression R, but only at the beginning of a line. This condition holds whenever the scan starts at the beginning of the character stream or right after a newline character. This shortcut scans only for a single newline character \verb|\| (hexadecimal $0A$) backwards, independent on how the particular operating system codes the newline. In this case, there is no harm coming from different conventions of newline.
a regular expression R, but only at the end of a line or right before the end of the
file. Note, that the meaning of this shortcut can be adapted according to the
target operating system. Some operating systems, such as DOS and Windows, code
a newline as a sequence \r\n (hexadecimal 0D, 0A), i.e.
as two characters. If you want to use this feature on those systems, you need
to specify the —DOS option on the command line (or in your makefile).
Otherwise, $ will scan only for the newline character (hexadecimal 0A).
Note, that for the trivial end-of-line post condition the newline coding
convention is essential. If newline is coded as 0D, 0A then the
first 0D would discard a pattern that was supposed to be followed by
0A only.
For more sophisticated case real regular expressions can be defined to handle pre- and post-conditions. Note, that pre- and post-conditions can only appear at the front and rear of the core pattern. Let R be the core regular expression, Q the regular expression of the pre-condition, and S the regular expression for the post-condition.
matches an R, but only if it is followed by an S. If the pattern matches the input is set to the place where R. The part that is matched by S is available for the next pattern to be matched. R is post-conditioned. Note, that the case where the end of R matches the beginning of S cannot be treated by Version 0.9.0\footnote{The reason for this lies in the nature of state machines. Flex has the exact same problem. To avoid this some type of step back from the end of the post-condition must be implemented.}.
matches R from the current position, but only if it is preceeded by a Q. Practically, this means that queχ goes backwards in order to determine if the pre-condition Q has matched and then forward to see if R has matched. R is pre-conditioned. Note, with pre-conditions there is no trailing context problem as with post-conditions above.
matches R from the current position, but only if the preceeding stream matches a Q and the following stream matches an S. R is pre- and post-conditioned.
A somehow subtle pitfall is related to the begin-of-line pre-condition. When using the ^ sign at the beginning of a line it is tempting to sing "don't worry, be happy … this matches at any begin of line"—well, it does not! The important thing to understand is that it matches when the lexical analysis step starts at the beginning of a line. The alarm signal should ring if begin-of-line is triggered and a whitespace pattern is defined that includes newline. Consider the following patterns being defined:
define { WHITESPACE [ \t\n]+ .... GREETING ^[ \t]*hello } mode x_y_z : { {WHITESPACE} => TKN_WHITESPACE; {GREETING} => TKN_GREETING(Lexeme); }
Where the hello greeting is to be matched after newline and some possible whitespace. Now, given the character stream:
... something ... hello
will not send a GREETING token. This is because the whitespace pattern eats the newline before the hello-line and even the whitespace before hello. Now, the next analysis step starts right before hello and this is not the beginning of a line. Splitting the whitespace eater into newline and non-newline helps:
define { WHITESPACE_1 [ \t]+ WHITESPACE_2 [\n]+ GREETING ^[ \t]*hello } mode x_y_z : { {WHITESPACE_1} => TKN_WHITESPACE; {WHITESPACE_2} => TKN_WHITESPACE; {GREETING} => TKN_GREETING(Lexeme); }
Now, the first whitespace eating ends right before newline, then the second whitespace eating ends after newline, and the hello-greeting at the beginning of the line can be matched.
Another pitfall is, again, related to precedence. Make sure that if there are two patterns with the same core pattern R, then the pre- or post-conditioned patterns must be defined before the un-conditioned pattern. Otherwise, the pre- or post-conditioned patterns may never match. Recall section \ref{formal/patterns/context-free-pitfalls} for a detailed discussion on precedence pitfalls.
In most practical application it is not a good idea to start in a void initial mode, so queχ requires an initial mode explicitly to be specified. This happens using the start keyword, i.e. one has to type somewhere in between the mode definitions (but best at the very beginning):
start = PROGRAM
Let the mode-stubs be stored in a file called simple.qx. The previous sections defined patterns, token-ids, and modes. Together with these definitions one can now start with the specification of pattern-action pairs as discussed in the following section.
Queχ allows to specify characteristics of modes concerning transitions and inheritance. The keyword mode starts the definition of a lexical analyser mode. Then it can be directly followed by a curly bracket that starts the pattern-action pair definitions as shown in figure style=refa. This is the most simple way to define a mode.
a)
mode MY_MODE { ... // pattern-action pairs of MY_MODE ... }
b)
mode MY_MODE : END_OF_FILE WHITESPACE_EATER DOCUMENTATION_TRIGGER { ... // pattern-action pairs of MY_MODE ... }
c)
mode MY_MODE : END_OF_FILE <inheritable: yes> <exit: OTHER_MODE> <restrict: exit> { ... // pattern-action pairs of MY_MODE ... }
In order to define options or base modes the name of the mode has to be followed by a colon as in the examples in figure <<:formal-mode-definition, style=ref>>b and \ref{fig:formal-mode-definition, style=ref>>c. Note, that if a mode specifies base modes and options, the options are specified after the base modes.
Base modes directly follow the : sign. The list of base modes is whitespace separated. When the first $<$ sign arrives, queχ starts reading the options. Options are bracketed by $<$ and $>$ signs and multiple options are also listed whitespace separatedly. The following options can be specified:
restricts the inheritance relationships. If yes is specified, it is allowed for any other mode to be derived from this mode. If no is specified and another mode tries to derive from this mode, an error will be printed. Specifying only tells that this mode cannot be a lexical analyser mode. Its reason for existance is to support derived modes. This is similar to the concept of abstract classes in C++ or interfaces in Java.
tells that it is allowed that the lexical analyser enters this mode from mode mode-name or one of its derived modes. If no entry mode is specified, then all modes are allowed.
tells that it is allowed that the lexical analyser exits from this mode to mode mode-name or one of its derived modes. If no exit mode is specified, then all modes are allowed.
turns of the admissibility of derived modes for entry or exit. If entry is specified the entry is restricted, so any other mode than the the listed entry modes is forbidden even if it is derived from one of them. Respectively, specifying exit restricts the list of exit modes.
Options are optional, i.e. they do not need to be specified. But, they facilitate the debugging and add confidence that the mode transitions are safe and sound. The list of options is then to be followed by an opening curly bracket { that opens the list of pattern-action pairs. These are discussed in the following section.
Queχ only allows to specify pattern-action pairs inside modes. As the name says, a pattern-action pair consists of a pattern that is to be matched and a code fragment that is executed when the pattern matches. Patterns are best specified inside the pattern definition sections, so that inside the mode sections only names of patterns occur. However, one can also use regular expressions inside the mode definitions. The format for pattern-action pairs is as simple as can be:
The pattern is specified as a regular expression or a pattern name embraced by curly brackets.
Then comes a whitespace that separates the pattern from the action.
Then there must be a curly bracket followed by whitespace that delimits the start of the code fragment of the action. Queχ parses then until the matching closing curly bracket that terminates the action.
Examples for pattern-action pairs have been given enough in section variables to know for coding the action for a pattern match:
self: Use the reference self. + member_name in order to access a member of the lexical analyser. This works in any place - also in event handlers. Better do not try to use the this pointer in order to keep the code uniform. The self reference automatically referes to an object of the derived class if such exists, and it points to the lexical analyser even if the code is pasted into member functions of different objects.
RETURN: Using the RETURN macro to return from a pattern action pair ensures a synchronization with the token queue. If the token queue is empty, it does not return but continue the current analysis. If it is sure that the token queue is non-empty (because a token was recently sent, then a simple return is a little faster alternative.
Lexeme: The string that matched the pattern of the current action can be accessed through variable Lexeme.
LexemeL: The length of a lexeme that matched a pattern can be accessed through variable LexemeL.
In practical applications for programming languages, the large majority of patterns to be identified are simple and do not require large algorithms to handle them. Keywords, such as if, then, and foreach, operators, such as $+$ and * and delimiters such as ;, ( and )`' simply need to send a token telling that such and such appeared. Their pattern action pairs all look similar to the following:
... {P_KEYWORD_IF} { self.send(TKN_KEYWORD_IF); RETURN; } {P_KEYWORD_THEN} { self