Byte Loaders

A byte loader connects to a low-level device of file interface and provides data in a normalized way to a lexatom loader. A byte loader loads consecutive equal-sized byte chunks into a memory section. Naturally, a chunk is of one byte in size. However, there are APIs where bytes can only be loaded in terms of two or even four bytes [#f0]_. Despite any granularity of byte chunks, a byte loader associates each byte it loads with a byte index, or a position. The byte index of the first byte in the stream is zero. For any two consecutive bytes it holds that if the position of the front byte is i,

the position of the subsequent byte is i+1 [#f1]_.

Implementations of byte loaders are provided in terms of derived classes/structs of ByteLoader. The set of pre-cooked byte loaders which are delivered along with Quex, is shown in table Implementations of prepared ``ByteLoader``s..

Table 16 Implementations of prepared ``ByteLoader``s.

Struct/Class

Interface

ByteLoader_FILE

C Standard I/O.

ByteLoader_stream

C++ Standard I/O for type char.

ByteLoader_OSAL

OSAL file I/O.

ByteLoader_POSIX

POSIX file I/O.

ByteLoader_Memory

Fast loading from plain memory.

ByteLoader_Monitor

Pass through and monitor interactions.

A byte loader is allocated and initialized with ‘new’ functions provided along with each implementation. Those functions allocate a byte loader (relying on the memory manager), initialize it, and raise internally a flag which allows the lexer to destruct and delete the byte loader when it is of no more use. The ByteLoader_FILE, for example, provides a byte loader that relies on the Standard C I/O routines as declared in in the header file stdio.h. It provides the following new functions:

quex_ByteLoader_FILE_new(FILE* fh, bool BinaryModeF);
quex_ByteLoader_FILE_new_from_file_name(const char* FileName);

Or, respectively in C++

quex::ByteLoader_FILE_new(FILE* fh, bool BinaryModeF);
quex::ByteLoader_FILE_new_from_file_name(const char* FileName);

Notably, all byte loaders are located in namespace quex, which means that they are independent of a particular lexer configuration.

In general, a byte loader takes over ownership over any non-constant argument passed to the constructor or new function. That is for example, file handles are being closed upon a byte loader’s destruction. The user must not close any file handle being plugged into a byte loader. Stream handles are closed, destructed and de-allocated along the same lines. The ByteLoader_Monitor takes another byte loader to be monitored. The monitored byte loader is destructed along with the monitoring byte loader’s destructor call. However, the ByteLoader_Memory behaves differently. It does not dare to deallocate the memory which has been pointed to from extern. The memory remains in the user’s ownership.

Warning

Local variables and ownership.

Since a byte loader takes ownership over non-constant arguments, it is particularly a bad idea to plug-in a pointer to an object setup as a local variable. A local variable’s memory is freed when the threat of control reaches the end of its scope. If upon the event of the byte loader’s destruction, the object’s memory was freed again this would, almost certainly, cause the program to crash.

Whenever, a byte loader takes over ownership over an argument it makes sense to have a look into its destruct() method to see which deallocation method is applied.

The .ownership flag inside a byte loader tells who has the right and the task to delete the byte loader. If the ownership is set to ‘lexer’ (C/C++: E_Ownership_LEXER), it is further assumed that is has been allocated with the Quex memory manager. The new functions provided along with an implementation, implicitly set the ownership to ‘lexer’. Any other method of creating a byte loader [#f2]_ sets the flag to ‘external’ (C/C++: E_Ownership_EXTERNAL).

There are cases, where it is necessary to manually delete a byte loader before it is actually plugged into a lexer. A typical example is the stepwise construction of lexer components. When a failure prevents the construction of a lexer, but the byte loader has already been created, the deletion must be accomplished outside the lexer. The simplest way to delete a byte loader with an ownership flag set to ‘lexer’ is to call the according delete function, namely

These functions set the byte loader pointer to zero after freeing the object. If the byte loader pointer is already zero, delete does nothing.

The byte loader interface, mainly, implements the functions tell(), seek(), and read(). tell() delivers the index of the next byte to be loaded. seek() sets a specific load position. load() loads a consecutive sequence of byte chunks from the stream starting at the byte with the current byte index. In other words, a byte loader is responsible for data-loading from an arbitrary input source and the according input stream navigation.

By default, an attempt to read data is aborted when the input stream does not deliver data. However, class ByteLoader provides the member function pointer to handle repeated attempts, namely:

bool  (*on_nothing)(ByteLoader*, size_t TryN, size_t RequestedByteN);

If this function pointer is set, a new attempt to read data is only done, if function returns true. This is important, for example, when socket connections are used for input, and errors need to be treated, as in the follwing example.

static bool
self_on_nothing(ByteLoader*  me, size_t TryN, size_t RequestedByteN)
{
    int       error  = 0;
    socklen_t len    = sizeof (error);
    int       retval = getsockopt(((quex::ByteLoader_POSIX*)me)->fd, SOL_SOCKET, SO_ERROR, &error, &len);
    (void)TryN; (void)RequiredToLoad;

    if( retval ) {
        // there was a problem getting the error code
        fprintf(stderr, "error getting socket error code: %s\n", strerror(retval));
        return  false; // No new attempt to read data
    }
    else if( error ) {
        // socket has a non zero error status
        fprintf(stderr, "socket error: %s\n", strerror(error));
        return  false; // No new attempt to read data
    }
    else {
        return false;
    }
}

This assumes, that the byte loader is a ByteLoader_POSIX. After creation, the byte loader the function pointer must be assigned, as follows.

if( socket_fd == -1 ) return; // Error accept() failed
quex::ByteLoader* byte_loader = quex::ByteLoader_POSIX_new(socket_fd);
byte_loader->on_nothing = self_on_nothing;

ByteLoader_Monitor

A ByteLoader_Monitor replaces a byte loader and monitors its traffic. Instead of plugging a byte loader directly into a lexer, a lexer takes a pointer to a ByteLoader_Monitor which inhibits the byte loader to be monitored. All request it receives are redirected to the inhibited byte loader. Its activities can be monitored, or even distorted, by customizable callback functions. A monitoring byte loader is a useful tool for tracing and testing the offstage, low-level data loading process. Its _new function is

quex::ByteLoader_Monitor_new(ByteLoader* subject, void* reference_object);

The function creates a byte loader that mimics the byte loader subject. The monitoring byte loader takes over the ownership over the subject. It does not take ownership over the optional reference_object. The pointer is available for callbacks as me->reference_object. If no callback requires a reference object, a null pointer may be passed as reference_object.

For C, the namespace specifier quex:: is replaced by a prefix quex_. Below, is a list of customizable callbacks which trigger upon specific events. pos_type shall be a placeholder for QUEX_TYPE_STREAM_POSITION. me is a pointer to the monitoring byte loader itself. A monitoring byte loader’s constructor initializes all callbacks to null pointers. Thus, any pointer not explicitly set, leaves the related activity unobserved.

pos_type on_tell(me, pos_type P);

The function is called upon a call to tell(). P carries the byte index returned by the subject’s tell() function. The monitoring byte loader finally returns what this callback returns. So, by means of this callback, the return value of tell() can be controlled.

pos_type on_seek(me, pos_type P);

This function is called upon a call to seek() where P is the position that was requested from the caller. This return value of this callback is taken as the actual position to be seeked by the monitored byte loader.

size_t   on_before_load(me, const size_t ByteN);

When load() function is called, before the load happens. ByteN the requested number of bytes. The return value determines the number of bytes that will be actually requested.

size_t  on_after_load(me, void* buffer, const size_t LoadedN, bool* end_of_stream_f);

When load() finished, this function is called. buffer is a pointer to the memory region to fill. LoadedN is the number of bytes which the load() loaded. end_of_stream_f is true if an end of stream has been detected. The return value of this callback is the number that will be reported to have been loaded.

The following example traces the byte loading process into a file. First, the tracing callbacks need to be defined. They all assume, that the reference_object passed to the constructor is the file handle of the trace file.

The following example traces the interactions of a ByteLoader_FILE while reading from a file example.txt.

quex::ByteLoader* subject  = quex::ByteLoader_FILE_new("example.txt", true);
FILE*             trace_fh = fopen("trace.log", "wb");
quex::ByteLoader* monitor  = quex::ByteLoader_Monitor_new(subject, trace_fh);

In this example, the pointer reference_object is used to pass a handle to a trace file. Once, the monitoring byte loader exists, the callback function can be set.

monitor->on_tell        = MyTracer_on_tell;
monitor->on_seek        = MyTracer_on_seek;
monitor->on_before_load = MyTracer_on_before_load;
monitor->on_after_load  = MyTracer_on_after_load;

It can now be plugged into a lexer, as any other byte loader.

MyLexer  lexer(monitor, nullptr);

With the monitoring byte loader under the hood, the lexer loads data into its buffer and seeks in the input stream according as if it was a ByteLoader_FILE. All this happens offstage. The monitoring byte loader of this example, though, records the all traffic and writes it in the file "trace.log".

ByteLoader_stream

A ByteLoader_stream is a byte loader which adapts a C++ std::basic_istream-like object. However, it does not relate to the stream handle as a derivative from std::basic_istream. The particularities of this class would impose new stream classes to rewrite the std::basic_streambuf member, which is a potentially tedious task. Instead, the byte loader relies on the templating mechanism. Templating fails, as soon as the provided stream does not comply to the basic streaming interface. Namely, any std::stream-like class T must provide the following members. Return values are left open, or integer to express that they are either ignored or cast to something compatible with integers.

clear();

Clears any flags inside the stream, such as failure or end-of-file.

seekg(pos_type);

Seeks a particular chunk position.

integer tellg();

Tells the position of a particular chunk. This function is, actually, only called upon initialization to provide the position of the chunk zero.

read(T::char_type* buffer, std::streamsize ChunkN);

This function reads ChunkN byte chunks into memory at the address buffer.

bool eof();

Returns true if and only if the end-of-file has been reached during loading.

integer gcount();

Returns the number of chunks loaded during the last call to read().

In C++, a stream can be imbued, i.e. provided with an object of type std::locale. Imbued streams demonstrate the real potential of class std::basic_istream vs. class std::basic_streambuf. The latter is only concerned with loading content, a C++ stream is concerned with format. An interesting application is the conversion of character encodings relying on std::codecvt*. When a locale is passed as second argument to a new function, it is imbued into the stream. The following example converts a UTF8 byte stream from UTF8 to Unicode characters packed in wchar_t-sized byte chunks.

static std::locale  empty;
std::locale*        locale_p = new std::locale(empty, new std::codecvt_utf8<wchar_t>);
quex::ByteLoader*   me = quex::ByteLoader_stream_new_from_file_name<wchar_t>("test.utf8", &locale);
delete locale_p;    /* stream has reference-count-copied 'locale' */

The locale takes ownership over the std::codecvt_utf8 object. A locale is copied upon imbue-ing in the stream object itself. The byte loader does not have to take over ownership.

In general, an imbued stream lacks any linear relationship between a byte’s position and the value returned by tellg() or given to seekg(). Consequently, the binary_mode_f is turned off in that case and seek-ing may become potentially slow. This weighs in, when files are huge and lexer buffers are small, so that loading backward and forward happens often. A configuration of a byte loaders and lexatom loader circumvents these shortcomings and provides and infrastructure for performant and flexible input handling.

Creating a Customized ByteLoader

In a situation, where the lexer’s environment is unable to interact with any of the pre-cooked byte loaders (Standard Lib, POSIX, or OSAL), customization becomes necessary. Then, a new byte loader can be provided as a derived class from ByteLoader. A good start is to copy-paste an existing byte loader header and its implementation. Any generated lexer comes with compilable versions of byte loaders in "lexer/lib/quex/byte_loader/". These implementation follow homogeneous schemes of allocation and ownership. Function-renaming and minor modifications of tell, seek, and read functions shall quickly provide something close to a complete solution.

The basic tasks of a byte loader are data loading and stream navigation, whereby the latter is optional [#f3]_. A derived byte loader must call the base constructor function of ByteLoader, namely:

void
quex::ByteLoader_construct(ByteLoader* me,
                           bool        BinaryModeF,
                           size_t      ChunkSizeInBytes,
                           pos_type    ChunkPositionOfReadByteIdxZero,
                           void        (*seek)(ByteLoader*, pos_type),
                           size_t      (*load)(ByteLoader*, void*, const size_t, bool*),
                           void        (*destruct)(ByteLoader*),
                           void        (*print_this)(ByteLoader*),
                           bool        (*compare_handle)(const ByteLoader*, const ByteLoader*));

The constructor function receives basic information about the stream and sets the function pointers. The stream’s characteristic is given by the following attributes:

BinaryModeF

tells whether the stream can be navigated with seek(). If set, it means that adjacent bytes in the stream relate to adjacent integers. In the byte loader, the information becomes available via me->binary_mode_f.

ChunkSizeInBytes

tells about the granularity of navigation, i.e. about the minimum number of bytes in a chunk to be read or seek-ed. The member me->chunk_size_in_bytes carries this information.

ChunkPositionOfReadByteIdxZero

provides the position of the chunk zero. It is the position in the input stream that is associated with byte index ‘0’. Practically, this is the return value of a tell()-alike function before the first byte is read. This value is stored in me->chunk_position_of_read_byte_i_zero.

A byte loader is not a byte loader, if it does not load bytes. Thus, the following function is mandatory.

size_t load(ByteLoader* me, void* adr, const size_t, bool* eos_f);

Loads N chunk of bytes into a buffer starting at adress adr, if the end-of-stream is detected the flag *eos_f is set to true. The return value is the number of chunks that have been loaded.

The remaining functions are optional, i.e. they can be replaced by null pointers.

void seek(ByteLoader* me, pos_type position);

Sets the chunk input position in the stream from where the next loading will happen. This function must at least be able to set the stream to the initial position, if provided. This helps in cases, where the input stream is not in binary mode and the position must be navigated manually.

If this function is not provided, navigation, or loading backwards is not possible.

Each byte loader implementation must further provide a means to construct and destruct it. For example, the byte loader for C standard I/O provides the following interface for construction.

void  destruct(ByteLoader*);

This function destructs shall free resources occupied or owned by the byte loader.

void  print_this(ByteLoader*);

When printing a lexer or its components, the print_this() function shall inform about its state to standard output. This is, particularly, useful for error-reporting or analysis during debugging sessions.

bool  compare_handle(const ByteLoader* me, const ByteLoader* other);

This function returns true if and only if me and other refer to the same input handle. It is useful for check of recursive inclusion.

As shown in the example of pre-cooked byte loaders, a byte loader should provide new functions that allocate a byte loader and allow the lexer to take over ownership. Indeed, the source code of the pre-cooked byte loaders provide sufficient instruction of how to accomplish this.

Upon completion of a byte loader, it is advisable to copy, adapt and execute the according unit tests in the directory "quex/code_base/quex/byte_loader/TEST". A byte loader that passes these excessive tests is ready to be plugged into a lexer. Eventually, a functioning byte loader makes for a great contribution to the Quex project [#f4]_.