Byte Loaders¶
A byte loader connects to a low-level device of file interface and provides data in a normalized way to a lexatom loader. A byte loader loads consecutive equal-sized byte chunks into a memory section. Naturally, a chunk is of one byte in size. However, there are APIs where bytes can only be loaded in terms of two or even four bytes [#f0]_. Despite any granularity of byte chunks, a byte loader associates each byte it loads with a byte index, or a position. The byte index of the first byte in the stream is zero. For any two consecutive bytes it holds that if the position of the front byte is i,
the position of the subsequent byte is i+1 [#f1]_.
Implementations of byte loaders are provided in terms of derived classes/structs
of ByteLoader
. The set of pre-cooked byte loaders which are delivered
along with Quex, is shown in table Implementations of prepared ``ByteLoader``s..
Struct/Class |
Interface |
---|---|
|
C Standard I/O. |
|
C++ Standard I/O for type |
|
OSAL file I/O. |
|
POSIX file I/O. |
|
Fast loading from plain memory. |
|
Pass through and monitor interactions. |
A byte loader is allocated and initialized with ‘new’ functions
provided along with each implementation. Those functions allocate a byte
loader (relying on the memory manager), initialize it, and raise internally a
flag which allows the lexer to destruct and delete the byte loader when it is
of no more use. The ByteLoader_FILE
, for example, provides a byte loader
that relies on the Standard C I/O routines as declared in in the header file
stdio.h
. It provides the following new functions:
- quex_ByteLoader_FILE_new(FILE* fh, bool BinaryModeF);
- quex_ByteLoader_FILE_new_from_file_name(const char* FileName);
Or, respectively in C++
- quex::ByteLoader_FILE_new(FILE* fh, bool BinaryModeF);
- quex::ByteLoader_FILE_new_from_file_name(const char* FileName);
Notably, all byte loaders are located in namespace quex
, which means that
they are independent of a particular lexer configuration.
In general, a byte loader takes over ownership over any non-constant argument
passed to the constructor or new function. That is for example, file handles
are being closed upon a byte loader’s destruction. The user must not close any
file handle being plugged into a byte loader. Stream handles are closed,
destructed and de-allocated along the same lines. The ByteLoader_Monitor
takes another byte loader to be monitored. The monitored byte loader is
destructed along with the monitoring byte loader’s destructor call. However,
the ByteLoader_Memory
behaves differently. It does not dare to
deallocate the memory which has been pointed to from extern. The memory
remains in the user’s ownership.
Warning
Local variables and ownership.
Since a byte loader takes ownership over non-constant arguments, it is particularly a bad idea to plug-in a pointer to an object setup as a local variable. A local variable’s memory is freed when the threat of control reaches the end of its scope. If upon the event of the byte loader’s destruction, the object’s memory was freed again this would, almost certainly, cause the program to crash.
Whenever, a byte loader takes over ownership over an argument it makes
sense to have a look into its destruct()
method to see which
deallocation method is applied.
The .ownership
flag inside a byte loader tells who has the right and the
task to delete the byte loader. If the ownership is set to ‘lexer’ (C/C++:
E_Ownership_LEXER
), it is further assumed that is has been allocated with
the Quex memory manager. The new functions provided along with an implementation,
implicitly set the ownership to ‘lexer’. Any other method of creating a
byte loader [#f2]_ sets the flag to ‘external’ (C/C++: E_Ownership_EXTERNAL
).
There are cases, where it is necessary to manually delete a byte loader before it is actually plugged into a lexer. A typical example is the stepwise construction of lexer components. When a failure prevents the construction of a lexer, but the byte loader has already been created, the deletion must be accomplished outside the lexer. The simplest way to delete a byte loader with an ownership flag set to ‘lexer’ is to call the according delete function, namely
These functions set the byte loader pointer to zero after freeing the object. If the byte loader pointer is already zero, delete does nothing.
The byte loader interface, mainly, implements the functions tell()
,
seek()
, and read()
. tell()
delivers the index of the next byte
to be loaded. seek()
sets a specific load position. load()
loads
a consecutive sequence of byte chunks from the stream starting at the byte
with the current byte index. In other words, a byte loader is responsible
for data-loading from an arbitrary input source and the according input
stream navigation.
By default, an attempt to read data is aborted when the input stream does not
deliver data. However, class ByteLoader
provides the member function
pointer to handle repeated attempts, namely:
bool (*on_nothing)(ByteLoader*, size_t TryN, size_t RequestedByteN);
If this function pointer is set, a new attempt to read data is only done, if
function returns true
. This is important, for example, when socket
connections are used for input, and errors need to be treated, as in the follwing
example.
static bool
self_on_nothing(ByteLoader* me, size_t TryN, size_t RequestedByteN)
{
int error = 0;
socklen_t len = sizeof (error);
int retval = getsockopt(((quex::ByteLoader_POSIX*)me)->fd, SOL_SOCKET, SO_ERROR, &error, &len);
(void)TryN; (void)RequiredToLoad;
if( retval ) {
// there was a problem getting the error code
fprintf(stderr, "error getting socket error code: %s\n", strerror(retval));
return false; // No new attempt to read data
}
else if( error ) {
// socket has a non zero error status
fprintf(stderr, "socket error: %s\n", strerror(error));
return false; // No new attempt to read data
}
else {
return false;
}
}
This assumes, that the byte loader is a ByteLoader_POSIX
. After creation, the
byte loader the function pointer must be assigned, as follows.
if( socket_fd == -1 ) return; // Error accept() failed
quex::ByteLoader* byte_loader = quex::ByteLoader_POSIX_new(socket_fd);
byte_loader->on_nothing = self_on_nothing;
ByteLoader_Monitor
¶
A ByteLoader_Monitor
replaces a byte loader and monitors its traffic.
Instead of plugging a byte loader directly into a lexer, a lexer takes a
pointer to a ByteLoader_Monitor
which inhibits the byte loader to be
monitored. All request it receives are redirected to the inhibited byte loader.
Its activities can be monitored, or even distorted, by customizable callback
functions. A monitoring byte loader is a useful tool for tracing and testing
the offstage, low-level data loading process. Its _new
function is
- quex::ByteLoader_Monitor_new(ByteLoader* subject, void* reference_object);
The function creates a byte loader that mimics the byte loader
subject
. The monitoring byte loader takes over the ownership over thesubject
. It does not take ownership over the optionalreference_object
. The pointer is available for callbacks asme->reference_object
. If no callback requires a reference object, a null pointer may be passed asreference_object
.
For C, the namespace specifier quex::
is replaced by a prefix quex_
.
Below, is a list of customizable callbacks which trigger upon specific events.
pos_type
shall be a placeholder for QUEX_TYPE_STREAM_POSITION
. me
is a pointer to the monitoring byte loader itself. A monitoring byte loader’s
constructor initializes all callbacks to null pointers. Thus, any pointer not
explicitly set, leaves the related activity unobserved.
- pos_type on_tell(me, pos_type P);
The function is called upon a call to
tell()
.P
carries the byte index returned by the subject’stell()
function. The monitoring byte loader finally returns what this callback returns. So, by means of this callback, the return value oftell()
can be controlled.
- pos_type on_seek(me, pos_type P);
This function is called upon a call to
seek()
whereP
is the position that was requested from the caller. This return value of this callback is taken as the actual position to be seeked by the monitored byte loader.
- size_t on_before_load(me, const size_t ByteN);
When
load()
function is called, before the load happens.ByteN
the requested number of bytes. The return value determines the number of bytes that will be actually requested.
- size_t on_after_load(me, void* buffer, const size_t LoadedN, bool* end_of_stream_f);
When
load()
finished, this function is called.buffer
is a pointer to the memory region to fill.LoadedN
is the number of bytes which theload()
loaded.end_of_stream_f
istrue
if an end of stream has been detected. The return value of this callback is the number that will be reported to have been loaded.
The following example traces the byte loading process into a file. First, the
tracing callbacks need to be defined. They all assume, that the reference_object
passed to the constructor is the file handle of the trace file.
The following example traces the interactions of a ByteLoader_FILE
while
reading from a file example.txt
.
quex::ByteLoader* subject = quex::ByteLoader_FILE_new("example.txt", true);
FILE* trace_fh = fopen("trace.log", "wb");
quex::ByteLoader* monitor = quex::ByteLoader_Monitor_new(subject, trace_fh);
In this example, the pointer reference_object
is used to pass a handle
to a trace file. Once, the monitoring byte loader exists, the callback function
can be set.
monitor->on_tell = MyTracer_on_tell;
monitor->on_seek = MyTracer_on_seek;
monitor->on_before_load = MyTracer_on_before_load;
monitor->on_after_load = MyTracer_on_after_load;
It can now be plugged into a lexer, as any other byte loader.
MyLexer lexer(monitor, nullptr);
With the monitoring byte loader under the hood, the lexer loads data into its
buffer and seeks in the input stream according as if it was a
ByteLoader_FILE
. All this happens offstage. The monitoring byte loader of
this example, though, records the all traffic and writes it in the file
"trace.log"
.
ByteLoader_stream
¶
A ByteLoader_stream
is a byte loader which adapts a C++
std::basic_istream
-like object. However, it does not relate to the stream
handle as a derivative from std::basic_istream
. The particularities of this
class would impose new stream classes to rewrite the std::basic_streambuf
member, which is a potentially tedious task. Instead, the byte loader
relies on the templating mechanism. Templating fails, as soon as the
provided stream does not comply to the basic streaming interface. Namely,
any std::stream
-like class T
must provide the following members.
Return values are left open, or integer
to express that they are either
ignored or cast to something compatible with integers.
- clear();
Clears any flags inside the stream, such as failure or end-of-file.
- seekg(pos_type);
Seeks a particular chunk position.
- integer tellg();
Tells the position of a particular chunk. This function is, actually, only called upon initialization to provide the position of the chunk zero.
- read(T::char_type* buffer, std::streamsize ChunkN);
This function reads
ChunkN
byte chunks into memory at the addressbuffer
.
- bool eof();
Returns
true
if and only if the end-of-file has been reached during loading.
- integer gcount();
Returns the number of chunks loaded during the last call to
read()
.
In C++, a stream can be imbued, i.e. provided with an object of type
std::locale
. Imbued streams demonstrate the real potential of class
std::basic_istream
vs. class std::basic_streambuf
. The latter is only
concerned with loading content, a C++ stream is concerned with format. An
interesting application is the conversion of character encodings
relying on std::codecvt*
. When a locale is passed as second argument to a
new function, it is imbued into the stream. The following example converts a
UTF8 byte stream from UTF8 to Unicode characters packed in wchar_t
-sized
byte chunks.
static std::locale empty;
std::locale* locale_p = new std::locale(empty, new std::codecvt_utf8<wchar_t>);
quex::ByteLoader* me = quex::ByteLoader_stream_new_from_file_name<wchar_t>("test.utf8", &locale);
delete locale_p; /* stream has reference-count-copied 'locale' */
The locale
takes ownership over the std::codecvt_utf8
object. A locale
is copied upon imbue
-ing in the stream object itself. The byte loader does
not have to take over ownership.
In general, an imbued stream lacks any linear relationship between a byte’s
position and the value returned by tellg()
or given to seekg()
.
Consequently, the binary_mode_f
is turned off in that case and seek-ing may
become potentially slow. This weighs in, when files are huge and lexer buffers
are small, so that loading backward and forward happens often. A
configuration of a byte loaders and lexatom loader circumvents these
shortcomings and provides and infrastructure for performant and flexible input
handling.
Creating a Customized ByteLoader¶
In a situation, where the lexer’s environment is unable to interact with any of
the pre-cooked byte loaders (Standard Lib, POSIX, or OSAL), customization
becomes necessary. Then, a new byte loader can be provided as a derived class
from ByteLoader
. A good start is to copy-paste an existing byte loader
header and its implementation. Any generated lexer comes with compilable
versions of byte loaders in "lexer/lib/quex/byte_loader/"
. These
implementation follow homogeneous schemes of allocation and ownership.
Function-renaming and minor modifications of tell, seek, and read functions
shall quickly provide something close to a complete solution.
The basic tasks of a byte loader are data loading and stream navigation,
whereby the latter is optional [#f3]_. A derived byte loader must call the
base constructor function of ByteLoader
, namely:
void
quex::ByteLoader_construct(ByteLoader* me,
bool BinaryModeF,
size_t ChunkSizeInBytes,
pos_type ChunkPositionOfReadByteIdxZero,
void (*seek)(ByteLoader*, pos_type),
size_t (*load)(ByteLoader*, void*, const size_t, bool*),
void (*destruct)(ByteLoader*),
void (*print_this)(ByteLoader*),
bool (*compare_handle)(const ByteLoader*, const ByteLoader*));
The constructor function receives basic information about the stream and sets the function pointers. The stream’s characteristic is given by the following attributes:
- BinaryModeF¶
tells whether the stream can be navigated with
seek()
. If set, it means that adjacent bytes in the stream relate to adjacent integers. In the byte loader, the information becomes available viame->binary_mode_f
.
- ChunkSizeInBytes¶
tells about the granularity of navigation, i.e. about the minimum number of bytes in a chunk to be read or seek-ed. The member
me->chunk_size_in_bytes
carries this information.
- ChunkPositionOfReadByteIdxZero¶
provides the position of the chunk zero. It is the position in the input stream that is associated with byte index ‘0’. Practically, this is the return value of a
tell()
-alike function before the first byte is read. This value is stored inme->chunk_position_of_read_byte_i_zero
.
A byte loader is not a byte loader, if it does not load bytes. Thus, the following function is mandatory.
- size_t load(ByteLoader* me, void* adr, const size_t, bool* eos_f);
Loads
N
chunk of bytes into a buffer starting at adressadr
, if the end-of-stream is detected the flag*eos_f
is set to true. The return value is the number of chunks that have been loaded.
The remaining functions are optional, i.e. they can be replaced by null pointers.
- void seek(ByteLoader* me, pos_type position);
Sets the chunk input position in the stream from where the next loading will happen. This function must at least be able to set the stream to the initial position, if provided. This helps in cases, where the input stream is not in binary mode and the position must be navigated manually.
If this function is not provided, navigation, or loading backwards is not possible.
Each byte loader implementation must further provide a means to construct and destruct it. For example, the byte loader for C standard I/O provides the following interface for construction.
- void destruct(ByteLoader*);
This function destructs shall free resources occupied or owned by the byte loader.
- void print_this(ByteLoader*);
When printing a lexer or its components, the
print_this()
function shall inform about its state to standard output. This is, particularly, useful for error-reporting or analysis during debugging sessions.
- bool compare_handle(const ByteLoader* me, const ByteLoader* other);
This function returns
true
if and only ifme
andother
refer to the same input handle. It is useful for check of recursive inclusion.
As shown in the example of pre-cooked byte loaders, a byte loader should provide new functions that allocate a byte loader and allow the lexer to take over ownership. Indeed, the source code of the pre-cooked byte loaders provide sufficient instruction of how to accomplish this.
Upon completion of a byte loader, it is advisable to copy, adapt and execute
the according unit tests in the directory
"quex/code_base/quex/byte_loader/TEST"
. A byte loader that passes these
excessive tests is ready to be plugged into a lexer. Eventually, a functioning
byte loader makes for a great contribution to the Quex project [#f4]_.