Make It!

Grau, teurer Freund, ist alle Theorie und grün des Lebens goldner Baum. (German: All theory is gray, my dear friend, but green is the golden tree of activity)

—Mephisto, the devil, in “Faust”, J. W. von Geothe (1748-1832)

While tolerating an arbitrary level of ignorance, this section demonstrates how to come up quickly with a functional lexical analyzer. Here is a minimalist analyzer description.

Listing 1 tiny.qx
 token { OP_EQUAL; NUMBER; IDENTIFIER; }

 mode ONE_AND_ONLY
 {
     <<EOF>>     => QUEX_TKN_TERMINATION;

     [ \t\r\n]+  {}
     [0-9]+      => QUEX_TKN_NUMBER(Lexeme);
     [_a-zA-Z]+  => QUEX_TKN_IDENTIFIER(Lexeme);
 }

The token section defines token identifier names. The mode section defines the behavior of lexical analysis. The definition for <<EOF>> tells that upon ‘end-of-file’ the token QUEX_TKN_TERMINATION is sent. Whitespace in the form on space, tabulator any newline is ignored, it is associated with an empty action {}. QUEX_TKN_NUMBER is sent, when a number occurs QUEX_TKN_IDENTIFIER signalizes a bunch of letters. The tokens for both will carry the matching Lexeme along. Let the aforementioned content be stored in a file tiny.qx. The command line below generates a lexical analyzer.:

> quex -i tiny.qx -o tiny --language C

The result is located in the subdirectory tiny. Here is some code that uses it.

Listing 2 lexer.c
#include <stdio.h>
#include "tiny/tiny.h"

int main(int argc, char** argv)
{
    tiny_Token* token_p = 0x0;
    tiny        tlex;

    tiny_from_file_name(&tlex, "example.txt", /* Converter */NULL);

    while( tlex.error_code == E_Error_None ) {
        tlex.receive(&tlex, &token_p);
        printf("%s", tiny_map_token_id_to_name(token_p->id));
        if( token_p->id == QUEX_TKN_TERMINATION ) break;
        printf(": %s\n", token_p->text);
    }

    tiny_destruct(&tlex);
    return 0;
}

This is a C example where the constructor tiny_from_file_name() and the destructor tiny_destruct() need to be called explicitly. A while loop iterates over the incoming tokens produced from the input file example.txt. It ends when an error occurs or the terminating token arrives. Inside the loop, function tlex.receive() initiates an analysis step and receives a token pointed to by token_p. Let the code be stored in lexer.c. Then, the command line:

> gcc lexer.c tiny/tiny.c -I. -o lexer

produces an application lexer. Given a text file example.txt with the content below:

99 red balloons

and typing on the command line:

> ./lexer

delivers the output:

NUMBER: 99
IDENTIFIER: red
IDENTIFIER: balloons
<TERMINATION>

Done. Using C++ instead of C, one needs to omit the --language C option, i.e.:

> quex -i tiny.qx -o tiny

The code fragment to be stored in a file lexer.cpp would be

Listing 3 lexer.cpp
 #include <iostream>
 #include "tiny/tiny"

 int main(int argc, char** argv)
 {
     tiny_Token*  token_p = 0x0;
     tiny         tlex("example.txt", /* Converter */NULL);

     while( tlex.error_code == E_Error_None ) {
         tlex.receive(&token_p);
         std::cout << token_p->id_name();
         if( token_p->id == QUEX_TKN_TERMINATION ) break;
         std::cout << ": " << token_p->text << std::endl;
     }
     return 0;
 }

It may be compiled with:

> g++ lexer.cpp tiny/tiny.cpp -I. -o lexer

This command line for C++ causes the same results as before for C. The aforementioned examples for C and C++ were copy-pasted from the demos. Indeed, the demo subdirectories contain a variety of functional applications. Each one has its nitty-gritty problems solved. They are perfect starting points for someone’s own particular project.

Demo Applications

In the subdirectories of the distribution’s directory demo/ a set of example applications are located for each programming language. The following list associates the directory names with the subject on which the example elaborates.

00-Minimalist/:

The example explained in this section.

01-Trivial/:

A trivial example that goes slightly beyond the minimal.

02-ModesAndStuff/:

Modes, mode transitions, mode inheritance.

03-Indentation/:

Parsing scopes based on indentation (such as in Python).

04-ConvertersAndBOM/:

Character encoding conversions using ICU and IConv. The byte-order-make (BOM).

05-LexerForC/:

A lexer for the C programming language.

06-Include/:

Including files during lexical analysis.

07-TrailingPostContext/:

Dealing with the dangerous trailing context.

08-DeletionAndPriorityMark/:

Reordering pattern-action pairs in the mode inheritance hierarchy.

09-WithBisonParser/:

Connecting a lexical analyzer to a Bison generated parser.

10-SocketsAndTerminal/:

Feeding lexical analysis from sockets and by the console.

11-ManualBufferFilling/:

Feeding the lexical analysis’ buffer manually, rather than relying on input streams.

12-EngineEncoding/:

Encoding a lexical analyzer engine, rather than using converted input.

13-MultipleLexers/:

Using multiple lexical analyzers in one application.

14-MultipleLexersSameToken/:

Using a common generated token class in multiple lexical analyzers.

15-FuzzyMatch/:

Levenshtein, PseudoDamerau and edit distance functions for fuzzy matches.

16-OsalEmbedded/:

Example of an embedded application relying on Quex’s ‘tiny stdlib’.

Each directory contains a Makefile and a CMakeLists.txt file. For UNIX users, that means that typing:

> make

is sufficient to produce a functional application. In other cases, many IDEs can actually read CMakeLists.txt directly. Else, the -G option lets cmake generate the desired build environment, for example:

> cmake -G "Visual Studio 14 2015 ARM"

generates a build environment for Visual Studio™ for ARM™ devices.

Summary

The goal of this tiny chapter was to make it possible to quickly implement a lexical analyzer. However, recalling the quote from “Faust”, it was the devil who brings up the temptations of functional ignorance. For the sake of virtue and to avoid the dangers of deficient expertise, the following chapters shall provide insights to safely contain lexical analyzer generation.

العِلمُ قَبلَ القَولِ وَ العَملِ (Arabic: Science must always preceed speech and action)

—Famous chapter title in “As Saheeh”, M. Al Bukhary (810-870)