Work Stages

The implementation of a lexical analyzer requires two major steps. First, the lexical analysis needs to be specified. Second, the particular environment for its application must be prepared. Section Make It! showed the simplicity of the process. This section pinpoints the required actions in general.

Specification of lexical analysis

Lexical analysis requires a mapping from patterns to tokens. One might also specify what regions in the input stream to skip, e.g. because they are ‘white space’ or comments. Also, a customized the token class and the according token ids can be provided. In the example of section Make It!, a lexer was specified with the file tiny.qx and the content as shown below.

Listing 4 Example lexer specification in file tiny.qx.
 token { OP_EQUAL; NUMBER; IDENTIFIER; }

 mode ONE_AND_ONLY
 {
     <<EOF>>     => QUEX_TKN_TERMINATION;

     [ \t\r\n]+  {}
     [0-9]+      => QUEX_TKN_NUMBER(Lexeme);
     [_a-zA-Z]+  => QUEX_TKN_IDENTIFIER(Lexeme);
 }
Preparation of the application environment
  1. Build system setup: The procedure to compile the lexer specification and link it with application code must be defined. In a Makefile, this may look like the following.

    engine.o:  tiny.qx
        quex -i tiny.qx -o tiny    # output => subdirectory './tiny'
        g++ -c tiny/tiny.cpp -I. -o engine.o
    
    application: engine.o
        g++ application.o engine.o -o application
    

    The first target engine.o generates new lexer code as soon as tiny.qx changes. It also compiles it into the according object file. Assumed, that the application’s object file is also generated somewhere, the application can be built by linking it with the engine’s object file.

  2. Byte loaders and lexatom loaders (optional): If the Standard IO-library is not avaible and/or the input is encoded, then customized byte loaders and lexatom loaders need to be constructed. Practically in C/C++, this means, that pointers to those classes need to be passed to the lexer’s constructor.

    quex_ByteLoader* byte_loader     = quex_ByteLoader_POSIX_new(socket_fd);
    size_t           bit_per_lexatom = sizeof(CLexer_lexatom_t)<<3;
    quex_Converter*  lexatom_loader  = quex_Converter_IConv_new(bit_per_lexatom,
                                                                "UTF8", NULL);
    
    CLexer lexer = CLexer::from_ByteLoader(byte_loader, lexatom_loader);
    
  3. Token reception: A framework to receive tokens and abort upon termination must be implemented. If, for example, in a Bison generated parser the according function to receive tokens, CMyParser_yylex(), might look like the following.

    int CMyParser_yylex(YYSTYPE* yylval, CLexer* qlex)
    {
        Token* token = NULL;
    
        qlex->receive(qlex, &token);
    
        if( ! token || qlex->error_code != E_ErrorNone ) {
            return QUEX_TKN_TERMINATION;
        }
        else {
            // Take over ownership over the token's content (not copy)
            yylval->str = (const char*)token->text;
            // Token destructor won't free 'LexemeNull'
            token->text = &CLexer_LexemeNull;
            return token->id;
        }
    }
    

    Taking-over ownership over a token’s content instead of copying it is crucial for performance. As in the above example, a token’s member need to be assigned with a signal value (here CLexer_LexemeNull) which prevents the token’s destructor from de-allocating what is pointed to by its member, because it is owned by someone else.

With these work stages in mind, it is possible to estimate the effort of implementing a lexical analyzer. For many applications, implementing a lexical analyzer, practically, means to adapt one of the examples which come along with the distribution of Quex. They already provide functional solutions and solve issues related to particular subjects. Detailed knowledge on lexical analyzer construction is provided in the chapters to come.