Lexatom Loading¶
Byte loaders provide plain, uninterpreted, and unformatted data from an arbitrary input source. A lexer, however, runs on a sequence of lexatoms. Physically, they are stored as integers in consecutive cells of memory called: the buffer. The integer sequence needs to comply to the input encoding for which a lexer has been designed (ASCII/Unicode by default). The adaption of raw input data into an appropriate format for the lexer is accomplished by the lexatom loader.
At the beginning of human-machine interaction, somewhere before 2000 CE,
machine readable texts were mostly stored in encodings which associate a
letter with an integer that fits a single byte [#f0]_. A DFA’s state
transitions would then trigger on integers that correspond to the encoding’s
values of a character. For example, in an ASCII-based DFA, a state
transition on character a
manifests as a transition triggered by the
integer 97. If the application is restricted to one such homogeneous
encoding, it is sufficient to generate the lexer appropriately (see
--encoding
command line option). Then, lexatom loading does not
do anything else, but passing-through the byte loader’s data.
Alternatively, a lexatom loader could reformat or convert incoming data. Such a configuration has several potential advantages:
It separates an intermediate interpretation of raw data from lexical analysis.
One and the same lexer may parse a large variety of encodings simply by switching the converters that feed it.
Conversion can translate a longer sequence of code units to fewer lexatoms, thus reducing the number of state transitions.
A converting lexatom loader manages the buffer filling and consistent stream
navigation and interaction with the byte loader. The core task of conversion is
done, however, by an object that provides a Converter
interface. A
Converter
’s only task is to translate a byte stream into a lexatom stream.
Quex provides pre-cooked converters for the GNU IConv library and IBM’s ICU
library.
A lexer implements plain passing through, when its constructor receives a null pointer for a converter.
#include <LexerUnicode>
...
LexerUnicode lex("example.txt", nullptr);
...
When the constructor receives anything else but a null pointer, it takes it for
an converter an implements converting lexatom loading. The ownership of the
converter is transferred to the lexer. In the following example, a lexer is
setup with a Converter
adapting GNU’s IConv library to decode UTF8 data
streams to Unicode.
#include <LexerUnicode>
#include <lib/quex/converter/iconv/Converter_IConv>
#include <lib/quex/converter/iconv/Converter_IConv.i>
#include <lib/quex/converter/iconv/Converter.i>
...
size_t lexatomBitN = sizeof(LexerUnicode_lexatom_t) * 8;
quex_Converter* converter = quex::Converter_IConv_new(LextomBitN, "UTF8", nullptr);
LexerUnicode lex("example.txt", converter);
...
After the inclusion of the lexer’s header LexerUnicode
, the required header
and implementation for the IConv converter is included. "Converter.i"
implements the base class’ functionality. Passing a null pointer as third
argument to the constructor of Converter_IConv
, initiates a conversion with
Unicode output. Once created, a pointer to a Converter
can be passed to the
lexer’s constructor. The number of bits occupied a lexatom must be passed
explicitly, since a Converter
is independent of a particular lexer
implementation and has no knowledge of the lexatom type defintion. In C,
relying on explicit function calls the aforementioned example
looks like below.
Assuming that the engine for LexerUtf8
has been generated with
--encoding UTF8
, then LexerUtf8
can consume the same input relying on
plain lexatom loading. Upon construction, nothing more is necessary than the
nullptr
as converter. The lexer can pass UTF8, but it is not possible
to feed it with other encodings, as it was possible with a converter.
LexerUtf8 lex("example.txt", nullptr);
In contrast to a byte loader, a lexatom loader implements procedures that go beyond the plain passing through of data from one API to another. Lexatom loaders maintain consistency of the input source, the lexer’s buffer, and possibly an intermediate translation buffer under the procedures of loading and stream navigation.
Prematurely created convertes can be deleted with the following function:
void quex::Converter_delete(Converter** me); // C++
void quex_Converter_delete(Converter** me); // C
If the provided pointer *me
is a null pointer, the function does not
destruct the object or free the according memory. Else the object is
destructed and its memory is freed. The pointer is then set to a null
pointer.
Byte Order Marks¶
A converter has the ability to adapt itself to the encoding of a file. Traditionally, a so called ‘byte-order-mark’ BOM at the beginning of a file conveys the file’s encoding. Quex provides a couple of functions to support BOM treatment.
- E_ByteOrderMark quex::bom_snap(BytLoader* byte_loader);
This functions returns the byte order mark found at the current position.
Depending on the length, the input position is set right behind the bom. The
helper function quex::bom_name(BOM)
delivers a constant string holding the
name of a particular BOM as provided by quex::bom_snap(...)
. The
pre-cooked conververters for ICU and IConv provide ‘new’ functions that
directly digest a BOM and initialize for conversion towards Unicode. The
example below shows how this is done in C relying on the IConv converter.
FILE* fh = fopen("example.txt");
quex::ByteLoader* byte_loader = quex::ByteLoader_FILE_new(fh, true);
E_ByteOrderMark bom_id = quex_bom_snap(byte_loader);
quex_Converter* converter;
if( bom_id == QUEX_BOM_NONE ) bom_id = QUEX_BOM_UTF_8;
converter = quex_Converter_IConv_new_from_BOM(sizeof(Easy_lexatom_t)<<3, bom_id);
if( ! converter ) {
/* error handling */ ... return;
}
Lexer lex(byte_loader, converter);
...
Customization¶
Customization of lexatom loading means to specify a conversion procedure
implemented in a derived class/struct of Converter
. In contrast to the
lexatom loader, a converter is not concerned with stream navigation (no tell,
no seek). It provides an interface to an external tool which transforms a
given array of input bytes into an array of output bytes. However, it may
have to manage some kind of statefulness.
Copying the pre-cooked implementations for IBM’s ICU and GNU IConv provides a
good starting point. A generated lexer comes with compilable versions of
converters in "lexer/lib/quex/converter/"
and its subdirectories.
Implementing a customized converter, basically, means to provide a set of
function pointers which need to be setup by an explicit constructor call to the
base class’ constructor, namely:
bool
QUEX_NAME_LIB(Converter_construct)(QUEX_GNAME_LIB(Converter)* me,
size_t LexatomSize_bit,
size_t InputCodeUnitSize,
E_LoadResult (*convert)(QUEX_GNAME_LIB(Converter)*,
uint8_t** source, const uint8_t* SourceEnd,
void** drain, const void* DrainEnd),
void (*destruct)(QUEX_GNAME_LIB(Converter)*),
ptrdiff_t (*stomach_byte_n)(QUEX_GNAME_LIB(Converter)*),
void (*stomach_clear)(QUEX_GNAME_LIB(Converter)*),
void (*print_this)(QUEX_GNAME_LIB(Converter)*))
The initialization of the actual converter can only happen after the
constructor returned a true
. LexatomSize_bit
is the number of bits
occupied by a lexatom [#fm1]_. The InputCodeUnitSize
must be either -1
or equal the required the granularity of the input stream. For example, UTF8
requires a byte granularity. The value, however, is only used detect
inconsistent chunk sizes of the byte loader. The provided list of function
pointers is explained in the following list.
- E_LoadResult(*convert)(QUEX_GNAME_LIB(Converter)*, uint8_t** source, const uint8_t* SourceEnd, void** drain, const void* DrainEnd)¶
Converts data provided in a source memory region and writes the result into a drain memory region.
*source
points to the first byte of the source.*drain
points to the beginning of the drain memory. The pointersSourceEnd
andDrainEnd
point to the first byte by after the last byte of the according region.This function neither reads beyond the end of the source, nor writes beyond the drain’s end. After termination,
*source
points to the first byte after the last byte that was read.*drain
is adapted to point behind the last written byte.
If the underlying converter inhibits some kins of statefulness, it must be
handled by the functions stomach_byte_n()
and stomach_clear()
.
- ptrdiff_t (*stomach_byte_n)(QUEX_GNAME_LIB(Converter)*),
Returns the number of unused bytes that the converter holds in its stomach to be used for later conversions.
- void (*stomach_clear)(QUEX_GNAME_LIB(Converter)*),
This function is called upon a conversion discontinuity [#f1]_. In case a converter has no statefulness, this function may be left empty [#f2]_.
The remaining two functions pointers are self-explanatory, i.e. destruct()
performs the
release of all resources of the converter–not freeing its own memory. print_this()
prints
the converter’s state in some human-readable form to the standard output.
When a converter has been implemented, it is advisable to adopt the existing
unit tests for one of the pre-cooked converters located in
"quex/code_base/quex/converter/*/TEST"
. As with byte loaders, passing
these tests is a solid bases for application in a lexer. Finally, any new
converter implementation make for a good Open Source contribution.