The Process

The input to lexical analysis is a lexatom sequence and its output are atomic chunks of meaning, so called tokens Fig. 5. In the following, the portmanteau ‘lexer’ shall stand for ‘lexical analyzer’, that is the engine that performs the analysis.

../_images/lexer.svg

Fig. 5 A lexer and its input and output.

Token

A token is an object that informs about the pattern which matched a subsequence of lexatoms in the input stream. Optionally, it carries along information about the actual lexeme.

Practically, a token contains a token identifier, a scalar value which identifies its category of meaning. A token is the result of an analysis step as shown in Fig. 6. It shows how the DNA restriction sequence GGATCC is detected by means of a DFA. The analysis step begins at position start with the DFA in the initial state named 0. The nucleotide G at the first position triggers a transition to state 1. The input for the next transition comes from the subsequent position. This process continues until the second C which causes a transition from state 5 to the acceptance state 6. The following G does not trigger any state transition, because state 6 has none. It is refused. The end position is set right before the T. The stretch from start to end constitutes the matching lexeme. The lexer, in this case prepares a token TKN_RESTRICTION_SEQ with the matching lexem GGATCC.

../_images/state-machine-dna-ggatcc.svg

Fig. 6 DFA detecting the DNA restriction endonuclease BamH I, namely GGATCC.

There are two approaches for pattern matching: longest match and shortest match []. With the shortest match approach, a lexer stops at the first acceptance state that it reaches. With the longest match approach a lexer only stops upon drop-out, i.e. when there is no further transition for the given lexatom.

../_images/state-machine-for-forest.svg

Fig. 7 Two merged DFAs matching ‘for’ and ‘forest’.

Shortest match implies restrictions as can be seen in the example of two merged DFAs to detect the sequences for and forest (Fig. 7). The sequence f, o, r, e, s, t triggers acceptance after the third letter. With shortest match, the lexer signalizes a match of pattern for and stops. It can never recognize a forest. With longest match, it would try to walk as long as possible along the paths until it finally reaches forest. It would only accept for if it was not followed by e, s, and t. Clearly, shortest match restricts the set of treatable pattern configurations. Longest match does not. For the sake of generality, this text treats the approach of longest match.

Lexical analysis consists of a sequence repeated analysis steps as displayed in fig:basic-concetps-the-process-flow-chart:. First, the current position is set to p_begin, i.e. the start position of the current step. From there the first lexatom is read. Starting from the initial state it is now determined to what state the DFA transits based on that lexatom. If there is no possible successor state, the lexatom is refused and the analysis step ends. Else, the position is set to the next position, a new lexatom is read fed into the DFA. The iteration of the analysis step ends with the first lexatom being refused. The according position p_end marks the end of the current lexatom. At this point, it is clear what pattern has matched, and what the lexeme is that matched the pattern. Now, a token can be delivered with that information. As long as the input stream is not exhausted, the next analysis step is initiated by setting p_begin to p_end of the current step, i.e. by starting where the last step ended.

../_images/lexer-flow-diagram.svg

Fig. 8 Flow diagram of lexical analysis with a DFA.

The lexical analysis process is deterministic. Given an input stream and a start position, a DFA triggers on the same lexatoms along its states. It refuses the same lexatom when a state is reached that does not trigger a further transition. The position of the refused lexatom is always the end position of the analysis step. This end position becomes the start position of the next step until the input is exhausted. There is no element of uncertainty in this process. A given stream of lexatoms always produces the same stream of tokens. This causal relationship is a solid bases for distinct interpretation.

For a lexical analyzer to run, it needs to have access to a source providing an input stream. When the stream is stored in a file, then passing a file name to a lexer’s constructor is all that needs to be done. However, in practical applications, environments such as embedded system which do not provide the C-Standard library, or receive data over a serial connection. Other scenarios may require the dynamic adaption to a large variety of character encodings. In order to cope with a maximum of possible configurations, Quex provides the infrastructure of byte loaders and lexatom loaders. Those are discussed in section sec:input in detail.

Given access to an input source, a lexer can identify atomic chunks of meaning and transform the input stream into a token stream. The further interpretation of those tokens is left to a syntax analyzer or parser, []. This subject, though, is beyond the scope of this text.

footnotes