Source Code Reactions¶
The brief commands following =>
cover the most essential actions related to
pattern matches, namely token sending and mode transitions. For more
sophisticated operations actions can be specified in source code. Code to be
executed as reaction to a pattern is specified in curly brackets {
…
}
following directly the regular expression.
As with brief commands, the implicit variables Lexeme
, LexemeBegin
,
LexemeEnd
, and LexemeL
as shown in ref:fig_lexeme_variables are
available. The lexer object is available via the variable self
. The code
inside the curly brackets is only limitted by the syntax of the language
for which the lexer is generated. Since token sending and mode transitions
are essential reactions to pattern matches, the according member functions
are listed below.
The member functions for token sending in C++ are:
void send(TOKEN_ID Id);
void send_n(TOKEN_ID Id, size_t RepetitionN);
bool send_text(TOKEN_ID Id,
LEXATOM* BeginP, LEXATOM* EndP);
bool send_string(TOKEN_ID Id,
LEXATOM* ZeroTerminatedString);
In C, member functions are function pointers where a pointer to the lexer object is passed as first argument.
void (*send)(LEXER* me, TOKEN_ID Id);
void (*send_n)(LEXER* me, TOKEN_ID Id, size_t RepetitionN);
bool (*send_text)(LEXER* me,
TOKEN_ID Id, LEXATOM* BeginP, LEXATOM* EndP);
bool (*send_string)(LEXER* me,
TOKEN_ID Id, LEXATOM* ZeroTerminatedString);
Accordingly, a token with the current lexeme as text
content and the
token id QUEX_TKN_SOMETHING
is sent in C++ by
self.send(QUEX_TKN_SOMETHING, Lexeme);
and in C by
self.send(&self, QUEX_TKN_SOMETHING, Lexeme);
Note
Reflections on Zero-Terminated Strings
According to [] it dates back to the PDP-7
micro, where there was a string type ASCIZ
for zero-terminated ASCII
string. Possitive points about zero termination are:
Only one extra character is used to determine length.
No extra string type is required–just the pointer to character.
A single CPU register is enough to iterate over a string. Instead of comparing the iterating pointer against a limit, the character holding register is compared against zero.
Those arguments, are partly anachronistic and partly carry counter arguments with them. The advantage of length or end-pointer terminated strings are:
A length-terminated string can refer to immutable storage. When referring to parts of other strings, this might not only be faster but also use less memory.
The length can be determined without iterating over the string.
An accidentally left out, or overwritten, terminating zero does not cause an unbounded iteration.
Extra checking might be required if incoming strings contain early terminating zeros.
Comparisons on equality might happen quicker if lengths can be compared beforehand.
A string type provides some type safety.
Depending on a particular application, the decision of zero-termination or length termination can have a significant impact on the overall performance.
Tokens can be prepared before actually being sent. Safe access to the next token
to-be-sent is given via the member function token_p()
, i.e. in C++
self.token_p()->number = 4711;
self.send(QUEX_TKN_INTEGER);
sets the member number
in the token to-be-sent token to 4711. The same is
accomplished in C by
self.token_p(&self)->number = 4711;
self.send(&self, QUEX_TKN_INTEGER);
Once, the token is prepared, a call to a send
function puts the token into
the internal token queue. Mode transitions are controlled via the member
functions
const Mode* mode();
void enter_mode(const Mode* TargetModeP);
void push_mode(Mode* TargetModeP);
void pop_mode();
void pop_drop_mode();
In C, the according function pointers take the analyzer object as first argument.
const Mode* mode(LEXER* me);
void enter_mode(LEXER* me, const Mode* TargetModeP);
void push_mode(LEXER* me, Mode* TargetModeP);
void pop_mode(LEXER* me);
void pop_drop_mode(LEXER* me);
The current mode is delivered by mode()
. A transition into a
mode is triggered by enter_mode(TargetModeP)
, where TargetModeP
is a
pointer to the desired target mode. This corresponds to the GOTO
command,
as mentioned before. GOSUB
and RETURN
is implemented via the member
functions push_mode(TargetModeP)
and pop_mode()
. Inside the source code
sections, all mode pointers are available via their mode name, i.e. STRING
would be a pointer to the STRING
mode. In C++ a transition to this mode is
triggered by
self.enter_mode(STRING);
There are two special commands for source code sections:
- FLUSH
terminates the source code section and stops the filling of the token queue until it has been emptied completely.
- CONTINUE
terminates the source code section, but does not necesarrily return from the analyzer function.
Both commands ensure that the on_after_match
handler is executed, if
present. The top-level sections header
, body
, constructor
, etc.
(section sec_top_level) enable the management of customized member
variables of the lexer. Once properly setup, they can equally be references as
members of self
.