The Process¶
The input to lexical analysis is a lexatom sequence and its output are atomic chunks of meaning, so called tokens Fig. 5. In the following, the portmanteau ‘lexer’ shall stand for ‘lexical analyzer’, that is the engine that performs the analysis.
- Token
A token is an object that informs about the pattern which matched a subsequence of lexatoms in the input stream. Optionally, it carries along information about the actual lexeme.
Practically, a token contains a token identifier, a scalar value which
identifies its category of meaning. A token is the result of an analysis step
as shown in Fig. 6. It shows how the DNA
restriction sequence GGATCC
is detected by means of a DFA. The analysis
step begins at position start with the DFA in the initial state named 0. The
nucleotide G
at the first position triggers a transition to state 1. The
input for the next transition comes from the subsequent position. This process
continues until the second C
which causes a transition from state 5 to the
acceptance state 6. The following G
does not trigger any state transition,
because state 6 has none. It is refused. The end position is set
right before the T
. The stretch from start to end
constitutes the matching lexeme. The lexer, in this case prepares
a token TKN_RESTRICTION_SEQ
with the matching lexem GGATCC
.
There are two approaches for pattern matching: longest match and shortest match []. With the shortest match approach, a lexer stops at the first acceptance state that it reaches. With the longest match approach a lexer only stops upon drop-out, i.e. when there is no further transition for the given lexatom.
Shortest match implies restrictions as can be seen in the example of two merged
DFAs to detect the sequences for
and forest
(Fig. 7). The sequence f
, o
, r
,
e
, s
, t
triggers acceptance after the third letter. With
shortest match, the lexer signalizes a match of pattern for
and stops.
It can never recognize a forest
. With longest match, it would try to
walk as long as possible along the paths until it finally reaches
forest
. It would only accept for
if it was not followed by e
,
s
, and t
. Clearly, shortest match restricts the set of treatable
pattern configurations. Longest match does not. For the sake of
generality, this text treats the approach of longest match.
Lexical analysis consists of a sequence repeated analysis steps as displayed in
fig:basic-concetps-the-process-flow-chart:
. First, the current
position is set to p_begin, i.e. the start position of the current step.
From there the first lexatom is read. Starting from the initial state it is now
determined to what state the DFA transits based on that lexatom. If there is
no possible successor state, the lexatom is refused and the analysis step ends.
Else, the position is set to the next position, a new lexatom is read fed into
the DFA. The iteration of the analysis step ends with the first lexatom being
refused. The according position p_end marks the end of the current lexatom.
At this point, it is clear what pattern has matched, and what the lexeme is
that matched the pattern. Now, a token can be delivered with that information.
As long as the input stream is not exhausted, the next analysis step is
initiated by setting p_begin to p_end of the current step, i.e. by starting
where the last step ended.
The lexical analysis process is deterministic. Given an input stream and a start position, a DFA triggers on the same lexatoms along its states. It refuses the same lexatom when a state is reached that does not trigger a further transition. The position of the refused lexatom is always the end position of the analysis step. This end position becomes the start position of the next step until the input is exhausted. There is no element of uncertainty in this process. A given stream of lexatoms always produces the same stream of tokens. This causal relationship is a solid bases for distinct interpretation.
For a lexical analyzer to run, it needs to have access to a source providing an input stream. When the stream is stored in a file, then passing a file name to a lexer’s constructor is all that needs to be done. However, in practical applications, environments such as embedded system which do not provide the C-Standard library, or receive data over a serial connection. Other scenarios may require the dynamic adaption to a large variety of character encodings. In order to cope with a maximum of possible configurations, Quex provides the infrastructure of byte loaders and lexatom loaders. Those are discussed in section sec:input in detail.
Given access to an input source, a lexer can identify atomic chunks of meaning and transform the input stream into a token stream. The further interpretation of those tokens is left to a syntax analyzer or parser, []. This subject, though, is beyond the scope of this text.
footnotes