Purpose and Repurpose

The purpose of a lexical analyzer is to produce tokens as an interpretation of some raw data provided in an input stream. The relation between input data and tokens is defined by a language. A language is practically specified by a list of pattern-action pairs and event handlers organized in a set of modes. As soon as the lexer is generated its language is immutable. The input, still can be specified and varied at runtime. There are three occasions which require input configuration or reconfiguration, namely: construction, reset, and inclusion.

Construction

Constructors prepare a void chunk of memory to hold a lexer object so that it can start lexical analysis. With the provided input source, a lexer’s purpose is fully defined. In C, a generated lexer comes with the following explicit constructor functions.

void Lexer_from_file_name(Lexer*           me,
                          const char*      Filename,
                          quex_Converter*  converter);
void Lexer_from_ByteLoader(Lexer*            me,
                           quex_ByteLoader*  byte_loader,
                           quex_Converter*   converter);
void Lexer_from_memory(Lexer*            me,
                       Lexer_lexatom_t*  Begin,
                       size_t            Size,
                       Lexer_lexatom_t*  EndOfFileP);

In C++, they are:

Lexer(const char*      Filename,
      quex::Converter* Converter = nullptr);
Lexer(quex::ByteLoader*  byte_loader,
      quex::Converter*   Converter = nullptr);
Lexer(Lexer_lexatom_t* BufferMemoryBegin,
      size_t           BufferMemorySize,
      Lexer_lexatom_t* BufferEndOfContentP = nullptr);

The easiest way to construct a lexer is by specifying simply a file name. If this is not possible, a byte loader configuration may be specified. The last alternative is to provide memory directly for analysis. A failure in construction is reported in a raised error code. Thus, it is advisable to check the .error_code before actually running any analysis.

A constructor is not the only means to setup an input source. As long as the language remains the same, a lexer may be repurposed to other input sources. The following two sections explain the concepts of ‘reset’ and ‘inclusion’.

Reset

A reset a lets a lexer drop its current input source and prepares it for a new one. Its complete state is reset and any information about previous input falls into oblivion. The set of member functions for a reset in C are the following function pointers of a lexer object:

bool reset_file_name(Lexer*          me,
                     const char*     Filename,
                     quex_Converter* converter);
bool reset_ByteLoader(Lexer*           me,
                      quex_ByteLoader* byte_loader,
                      quex_Converter*  converter);
bool reset_memory(Lexer* me,
                  Lexer_lexatom_t*  Begin,
                  size_t            Size,
                  Lexer_lexatom_t*  EndOfFileP);

All three reset functions are reminiscent of the constructor calls as they relate to the same procedures. A reset may be accomplished by simply providing the name of a file to be analyzed. Alternatively, a byte loader configuration may be given. Finally, a reset may be done by analyzing data in a specific chunk of memory. The return value of these functions notify about the operation’s success. In C++, reset is done by means of member functions, as they are

bool reset_file_name(const char*     Filename,
                     quex_Converter* converter);
bool reset_ByteLoader(quex_ByteLoader* byte_loader,
                      quex_Converter*  converter);
bool reset_memory(Lexer_lexatom_t*  Begin,
                  size_t            Size,
                  Lexer_lexatom_t*  EndOfFileP);

Since C++ implicitly provides the this pointer, the lexer object does not have to be passed as it is necessary in C.

Inclusion

In cases where repurposing is temporary, the history-cancelling reset is inappropriate. The classic #include statement in C, for example, redirects a compiler to consider another file and then return to the position right after the include statement. Potentially, the included file may also include other files, etc. In such an environment, the lexer must preserve its state from before the repurposing of the input source. The according process is called ‘inclusion’.

Inclusion is executed in terms of include-push and include-pop. Include-push is executed by means of a call to one of the following function pointers in C.

bool include_push_file_name(Lexer*           me,
                            const char*      Filename,
                            quex::Converter* converter);
bool include_push_ByteLoader(Lexer*            me,
                             const char*       InputName,
                             quex::ByteLoader* byte_loader,
                             quex::Converter*  converter);
bool include_push_memory(Lexer*             me,
                         const char*        InputName,
                         Lexer_lexatom_t*   Begin,
                         size_t             Size,
                         Lexer_lexatom_t*   OfFileP);

The according members in C++ only differ in that they do not receive a lexer as first argument. The input source provision scheme is the same as for constructors and reset functions. The return value notifies the operation’s success.

When an include-push is triggered, the lexer’s current state is stored in a ‘memento’. A memento is a brief version of the lexer’s state. The memento is then stored on a stack, to be popped later. Including a new input source requires some buffer handling and pointer adjustments. Whenever possible, the same buffer is used for included content. This avoids excessive memory usage when many levels of inclusions are involved.

When an input source is exhausted, an include-pop must be triggered. This revives a lexer from a memento popped from the include stack. Reviving means, that the lexer and its buffer is brought back into the state before the current input source has been included. Everything is prepared so that analysis may continue from where it left before the inclusion. There is only one function to execute an include-pop, namely:

bool include_pop(Lexer* me); // C
bool include_pop();          // C++

The return value tells whether a lexer has been revived. A false signalizes, that the utmost shell of inclusion has been reached and it is time to send a termination token.

The memento class/struct is generated along with the lexical analyzer. It contains all necessary members to hold essential information about a default lexer. If there are further members to be saved upon include-push, this must be declared and handled explicitly. The according sections in the Quex source file are:

``memento { ... }`` to declare additional members and possibly functions
in the memento class.
``memento_pack { ... }`` to define additional operations of copying from
lexer to memento. Code from this section is executed after all default
lexer state information has been saved.
``memento_unpack { ... }`` to define additional operations of copying
from memento to lexer.  The code in this section is executed after all
default revival operations are done.

Code from memento_pack is executed right before the explicit destructor call of the memento. If there are any additional destructor operations to be performed, this is the place for it. The memento is not going to be used afterwards.

Inside the memento_pack and memento_unpack sections, a reference to the lexer is available as self and the memento is available as a pointer named memento. Let, for example word_count be an additional member of the lexer’s class as defined in a body section and initialized in a construct section.

body      { long word_count; }
construct { self.word_count = 0; }

In order to make sure that the word count is stored away in a memento and reset to zero for the new file, the memento sections are used.

memento { long saved_word_count; }
memento_pack {
     memento->saved_word_count = self.word_count;
     memento->saved_word_count = 0;
}
memento_unpack {
     self.word_count = memento->saved_word_count;
}

With this setup, include-pushes and include-pops may be performed while ensuring that the word_count is really only related to the current file being analyzed.