Lexatom Loading

Byte loaders provide plain, uninterpreted, and unformatted data from an arbitrary input source. A lexer, however, runs on a sequence of lexatoms. Physically, they are stored as integers in consecutive cells of memory called: the buffer. The integer sequence needs to comply to the input encoding for which a lexer has been designed (ASCII/Unicode by default). The adaption of raw input data into an appropriate format for the lexer is accomplished by the lexatom loader.

At the beginning of human-machine interaction, somewhere before 2000 CE, machine readable texts were mostly stored in encodings which associate a letter with an integer that fits a single byte [#f0]_. A DFA’s state transitions would then trigger on integers that correspond to the encoding’s values of a character. For example, in an ASCII-based DFA, a state transition on character a manifests as a transition triggered by the integer 97. If the application is restricted to one such homogeneous encoding, it is sufficient to generate the lexer appropriately (see --encoding command line option). Then, lexatom loading does not do anything else, but passing-through the byte loader’s data.

Alternatively, a lexatom loader could reformat or convert incoming data. Such a configuration has several potential advantages:

  1. It separates an intermediate interpretation of raw data from lexical analysis.

  2. One and the same lexer may parse a large variety of encodings simply by switching the converters that feed it.

  3. Conversion can translate a longer sequence of code units to fewer lexatoms, thus reducing the number of state transitions.

A converting lexatom loader manages the buffer filling and consistent stream navigation and interaction with the byte loader. The core task of conversion is done, however, by an object that provides a Converter interface. A Converter’s only task is to translate a byte stream into a lexatom stream. Quex provides pre-cooked converters for the GNU IConv library and IBM’s ICU library.

A lexer implements plain passing through, when its constructor receives a null pointer for a converter.

#include <LexerUnicode>
...
LexerUnicode     lex("example.txt", nullptr);
...

When the constructor receives anything else but a null pointer, it takes it for an converter an implements converting lexatom loading. The ownership of the converter is transferred to the lexer. In the following example, a lexer is setup with a Converter adapting GNU’s IConv library to decode UTF8 data streams to Unicode.

#include <LexerUnicode>
#include <lib/quex/converter/iconv/Converter_IConv>
#include <lib/quex/converter/iconv/Converter_IConv.i>
#include <lib/quex/converter/iconv/Converter.i>

...
size_t           lexatomBitN = sizeof(LexerUnicode_lexatom_t) * 8;
quex_Converter*  converter   = quex::Converter_IConv_new(LextomBitN, "UTF8", nullptr);
LexerUnicode     lex("example.txt", converter);
...

After the inclusion of the lexer’s header LexerUnicode, the required header and implementation for the IConv converter is included. "Converter.i" implements the base class’ functionality. Passing a null pointer as third argument to the constructor of Converter_IConv, initiates a conversion with Unicode output. Once created, a pointer to a Converter can be passed to the lexer’s constructor. The number of bits occupied a lexatom must be passed explicitly, since a Converter is independent of a particular lexer implementation and has no knowledge of the lexatom type defintion. In C, relying on explicit function calls the aforementioned example looks like below.

Assuming that the engine for LexerUtf8 has been generated with --encoding UTF8, then LexerUtf8 can consume the same input relying on plain lexatom loading. Upon construction, nothing more is necessary than the nullptr as converter. The lexer can pass UTF8, but it is not possible to feed it with other encodings, as it was possible with a converter.

LexerUtf8   lex("example.txt", nullptr);

In contrast to a byte loader, a lexatom loader implements procedures that go beyond the plain passing through of data from one API to another. Lexatom loaders maintain consistency of the input source, the lexer’s buffer, and possibly an intermediate translation buffer under the procedures of loading and stream navigation.

Prematurely created convertes can be deleted with the following function:

void quex::Converter_delete(Converter** me); // C++
void quex_Converter_delete(Converter** me);  // C

If the provided pointer *me is a null pointer, the function does not destruct the object or free the according memory. Else the object is destructed and its memory is freed. The pointer is then set to a null pointer.

Byte Order Marks

A converter has the ability to adapt itself to the encoding of a file. Traditionally, a so called ‘byte-order-mark’ BOM at the beginning of a file conveys the file’s encoding. Quex provides a couple of functions to support BOM treatment.

E_ByteOrderMark quex::bom_snap(BytLoader* byte_loader);

This functions returns the byte order mark found at the current position. Depending on the length, the input position is set right behind the bom. The helper function quex::bom_name(BOM) delivers a constant string holding the name of a particular BOM as provided by quex::bom_snap(...). The pre-cooked conververters for ICU and IConv provide ‘new’ functions that directly digest a BOM and initialize for conversion towards Unicode. The example below shows how this is done in C relying on the IConv converter.

FILE*              fh          = fopen("example.txt");
quex::ByteLoader*  byte_loader = quex::ByteLoader_FILE_new(fh, true);
E_ByteOrderMark    bom_id      = quex_bom_snap(byte_loader);
quex_Converter*    converter;

if( bom_id == QUEX_BOM_NONE ) bom_id = QUEX_BOM_UTF_8;

converter = quex_Converter_IConv_new_from_BOM(sizeof(Easy_lexatom_t)<<3, bom_id);
if( ! converter ) {
    /* error handling */ ... return;
}

Lexer lex(byte_loader, converter);
...

Customization

Customization of lexatom loading means to specify a conversion procedure implemented in a derived class/struct of Converter. In contrast to the lexatom loader, a converter is not concerned with stream navigation (no tell, no seek). It provides an interface to an external tool which transforms a given array of input bytes into an array of output bytes. However, it may have to manage some kind of statefulness.

Copying the pre-cooked implementations for IBM’s ICU and GNU IConv provides a good starting point. A generated lexer comes with compilable versions of converters in "lexer/lib/quex/converter/" and its subdirectories. Implementing a customized converter, basically, means to provide a set of function pointers which need to be setup by an explicit constructor call to the base class’ constructor, namely:

bool
QUEX_NAME_LIB(Converter_construct)(QUEX_GNAME_LIB(Converter)* me,
                                   size_t       LexatomSize_bit,
                                   size_t       InputCodeUnitSize,
                                   E_LoadResult (*convert)(QUEX_GNAME_LIB(Converter)*,
                                                           uint8_t** source, const uint8_t* SourceEnd,
                                                           void**    drain,  const void*    DrainEnd),
                                   void         (*destruct)(QUEX_GNAME_LIB(Converter)*),
                                   ptrdiff_t    (*stomach_byte_n)(QUEX_GNAME_LIB(Converter)*),
                                   void         (*stomach_clear)(QUEX_GNAME_LIB(Converter)*),
                                   void         (*print_this)(QUEX_GNAME_LIB(Converter)*))

The initialization of the actual converter can only happen after the constructor returned a true. LexatomSize_bit is the number of bits occupied by a lexatom [#fm1]_. The InputCodeUnitSize must be either -1 or equal the required the granularity of the input stream. For example, UTF8 requires a byte granularity. The value, however, is only used detect inconsistent chunk sizes of the byte loader. The provided list of function pointers is explained in the following list.

E_LoadResult(*convert)(QUEX_GNAME_LIB(Converter)*, uint8_t** source, const uint8_t* SourceEnd, void** drain, const void* DrainEnd)

Converts data provided in a source memory region and writes the result into a drain memory region. *source points to the first byte of the source. *drain points to the beginning of the drain memory. The pointers SourceEnd and DrainEnd point to the first byte by after the last byte of the according region.

This function neither reads beyond the end of the source, nor writes beyond the drain’s end. After termination, *source points to the first byte after the last byte that was read. *drain is adapted to point behind the last written byte.

If the underlying converter inhibits some kins of statefulness, it must be handled by the functions stomach_byte_n() and stomach_clear().

ptrdiff_t    (*stomach_byte_n)(QUEX_GNAME_LIB(Converter)*),

Returns the number of unused bytes that the converter holds in its stomach to be used for later conversions.

void         (*stomach_clear)(QUEX_GNAME_LIB(Converter)*),

This function is called upon a conversion discontinuity [#f1]_. In case a converter has no statefulness, this function may be left empty [#f2]_.

The remaining two functions pointers are self-explanatory, i.e. destruct() performs the release of all resources of the converter–not freeing its own memory. print_this() prints the converter’s state in some human-readable form to the standard output.

When a converter has been implemented, it is advisable to adopt the existing unit tests for one of the pre-cooked converters located in "quex/code_base/quex/converter/*/TEST". As with byte loaders, passing these tests is a solid bases for application in a lexer. Finally, any new converter implementation make for a good Open Source contribution.