Tuning¶
This chapter is dedicated to the subject of how to improve a lexical analyzer in terms of computational speed and memory footprint. It is not intended as a ‘silver bullet’ that guides ultimately to the optimal configuration. However, it introduces some basic thoughts which are crucial for optimization. This chapter starts by the introduction of two mechanisms that reduce the code size of the lexical analyzer: template and path compression. Then the influence of the token queue is discussed. Finally, the involvement of converters via engine based codecs is investigated. Were possible the findings are supported by data accessed by some sample applications which are available in the quex distribution package directory ‘demo/tuning/’.
Before any measurements can be made the following points need to be assumed:
- .# Asserts are disabled, i.e the compile option
QUEX_OPTION_ASSERTS_DISABLED
- is in place.
- .# Avoid printing through system calls, i.e avoid
fprintf
, ``cout << `` - and their likes. If such a system call is performed for each token, the analyzer’s overhead is negligible and the results only measure the performance of the input/output stream library and its implementation on the operating system.
- .# The buffer size must be set to something reasonable. Define
QUEX_SETTING_BUFFER_SIZE=X
whereX
should be something in the range of at least 32Kb.- .# Avoid any compiler flag that produces additional code such as
-ggdb
- for debugging,
--coverage
for test coverage, or--fprofile-arcs
for profiling.
A tuning setup is mostly the contrary of a debug scenario. For debugging all asserts should be active, print-outs prosper, and the buffer size is as small as possible to check out reload scenarios. Using a debug setup for tuning, though, safely drives the development into nonsensical directions.
Template Compression¶
Quex introduces a powerful feature to reduce the code size of a lexical analyzer of a lexical analyzer: Template Compression. Template compression combines multiple states into a single state that changes slightly dependent on the particular state that is represents. It can be activated with the command line option:
> quex ... --template-compression
The aggressiveness of the compression can be controlled by the template compression ‘min-gain’ to be set by the command line argument:
> quex ... --template-compression-min-gain [number]
It represents the minimum of estimated bytes that could be spared before two states are considered for combination into a template state. If only states with the same entry and drop out section should be considered then the following option can be specified:
> quex ... --template-compression-uniform
Especially Unicode lexical analyzer with their large transition maps can profit
tremendously from template compression. If your compiler supports computed
gotos try to set QUEX_OPTION_COMPUTED_GOTOS
on the compilers’ command
line, e.g. with GNU C++:
> g++ -DQUEX_OPTION_COMPUTED_GOTOS ... MyLexer.cpp -c -o MyLexer.o
Principle of Template Compression¶
This is displayed in the figures Three states to be compressed by Template compression. and Template state representing states 0, 1, and 2 from figure fig-template-compression-before.. The first figure displays three states which have a somehow similar transition map, that is they trigger on similar states to similar target states. A so called ‘template state’ must be able to detect all character ranges involved. An additional matrix helps to find the particular target state for the represented state.
Three states to be compressed by Template compression.
Figure Template state representing states 0, 1, and 2 from figure fig-template-compression-before. displays a template state that
shall represent state 0, 1, and 2 from figure
Three states to be compressed by Template compression.. Every state is replaced by a small stub
that defines a ‘state key’. The role of the state key is to define the behavior
of the template state. In particular, it defines the target states that belong
to certain trigger sets. State 0, 1, and 2 differ with respect to the
character ranges [f-m]
and [o-s]
. For these ranges the transitions
differ. The target states for these ranges are stored in the arrays X0
and
X1
. X0[i]
represents the target state for [f-m]
if the template
needs to represent the state i
. Analogously, X1[i]
helps to mimik the
transitions for [o-s]
.
Template state representing states 0, 1, and 2 from figure Three states to be compressed by Template compression..
Template compression can combine any set of states but its efficiency depends
on the similarity. For this reason the template compression coefficient (see
option --template-compression-coefficient
)
has been introduced that defines how aggressively states shall be combined.
That means how much lack of similarity can is tolerated and states can still be
combined into sets.
Path Compression¶
Quex introduces another way of compression that profits from the sequential order of similar states. It identifies paths of single character sequences in the state machine. The command line argument:
--path-compression
activates the analysis and compression. With this compression all states are considered to be combined into a path. As a result some special handling is implemented to distinguish the particularities of each state. If only uniform states shall be considered, the command line flag:
--path-compression-uniform
may be provided. Then the overhead of particularities is avoided, but less states may be combined. Path compression requires a path-end character. By default it is set to the character code 0x16 (38 decimal, ‘SYNCHRONOUS IDLE’) assuming that this never occurs in the data stream. If this character occurs in the stream to be analyzed, then the path termination character must be defined explicitly by:
--path-termination [number]
Where the specified number must be different from the buffer limit code.
It applies if a sequence of single characters can be identified that guide
along a path of states with matching transition maps. This requirement seems
very rigid and thus the chance of hitting a state machine that contains such
states may appear low. However, in practical applications this exactly the case
where keywords, such as for
, while
, etc. intersect with the pattern of
identifiers, e.g. [a-z]+
. In other words, languages with many keywords may
profit from this approach.
Instead of implementing for each state of the path a full state, only one state is implemented a so called ‘pathwalker’. A pathwalker consists of the transition map which is common to all states of the path and the path itself.
As with template compression using the computed feature of your compiler might improve performance and code size.
Principle of Path Compression¶
State sequence to be compressed by path compression.
State sequence from figure State sequence to be compressed by path compression. implemented by pathwalker.
Combining Multiple Compression Types¶
It is possible to combine multiple compression types simply by defining multiple of them on the command line. The result of applying path compression before or after template compression may be significantly different. The sequence of analysis corresponds to the sequence that the activating command line options appear, i.e.:
> quex ... --template-compression ... --path-compression ...
determines that template compression is performed before path compression. Uniform and non-uniform compression can be treated as separate procedures. Thus, it is possible to say for example:
> quex ... --template-compression-uniform \
--path-compression \
--template-compression
which first does a template compression of very similar states, then a general path compression of the remainder. Then whatever remains of states is tried to be combined by aggressive template compression.
Token Queues¶
Memory Management¶
Additional Hints¶
It follows a list of general hints for performance tuning:
Note
The three most important things to consider when improving performance are the CPU’s cache, the CPU’s cache, and the CPU’s cache. At the time of this writing (2010 C.E.) a cache-miss is by factors slower then a normal cache read. A fast program can be slowed down to ‘snail speed’ simply by excessive cache miss scenarios.
Practically, this means that the data that is access frequently is best kept close together, so that cache misses are less probable.
Note
The fourth important thing about improving performance is to avoid frequent
system calls. For example, allocate memory in a chunk and then cut from it
when needed, instead of calling new
, delete
, malloc
or free
all the time. You might also consider to implement containers yourself
instead of relying in STL or similar libraries, if this allows you to
control memory placement.
Note
The fifth important thing is to use memcpy
and memmove
for copying
of content–especially for larger amounts of data. Few people can compete
with the insight expert knowledge that is put into this functions. Simply
compare a memcpy
operation with a for
loop doing the same thing. It
is not seldom a factor of 40 between the two. Use memmove
when source
and destination may overlap.
[to be continued]