Source Code Reactions

The brief commands following => cover the most essential actions related to pattern matches, namely token sending and mode transitions. For more sophisticated operations actions can be specified in source code. Code to be executed as reaction to a pattern is specified in curly brackets {} following directly the regular expression.

As with brief commands, the implicit variables Lexeme, LexemeBegin, LexemeEnd, and LexemeL as shown in ref:fig_lexeme_variables are available. The lexer object is available via the variable self. The code inside the curly brackets is only limitted by the syntax of the language for which the lexer is generated. Since token sending and mode transitions are essential reactions to pattern matches, the according member functions are listed below.

The member functions for token sending in C++ are:

void send(TOKEN_ID Id);
void send_n(TOKEN_ID Id, size_t RepetitionN);
bool send_text(TOKEN_ID Id,
               LEXATOM* BeginP, LEXATOM* EndP);
bool send_string(TOKEN_ID Id,
                 LEXATOM* ZeroTerminatedString);

In C, member functions are function pointers where a pointer to the lexer object is passed as first argument.

void (*send)(LEXER* me, TOKEN_ID Id);
void (*send_n)(LEXER* me, TOKEN_ID  Id, size_t RepetitionN);
bool (*send_text)(LEXER* me,
                  TOKEN_ID Id, LEXATOM* BeginP, LEXATOM* EndP);
bool (*send_string)(LEXER* me,
                    TOKEN_ID Id, LEXATOM* ZeroTerminatedString);

Accordingly, a token with the current lexeme as text content and the token id QUEX_TKN_SOMETHING is sent in C++ by

self.send(QUEX_TKN_SOMETHING, Lexeme);

and in C by

self.send(&self, QUEX_TKN_SOMETHING, Lexeme);

Note

Reflections on Zero-Terminated Strings

According to [] it dates back to the PDP-7 micro, where there was a string type ASCIZ for zero-terminated ASCII string. Possitive points about zero termination are:

  • Only one extra character is used to determine length.

  • No extra string type is required–just the pointer to character.

  • A single CPU register is enough to iterate over a string. Instead of comparing the iterating pointer against a limit, the character holding register is compared against zero.

Those arguments, are partly anachronistic and partly carry counter arguments with them. The advantage of length or end-pointer terminated strings are:

  • A length-terminated string can refer to immutable storage. When referring to parts of other strings, this might not only be faster but also use less memory.

  • The length can be determined without iterating over the string.

  • An accidentally left out, or overwritten, terminating zero does not cause an unbounded iteration.

  • Extra checking might be required if incoming strings contain early terminating zeros.

  • Comparisons on equality might happen quicker if lengths can be compared beforehand.

  • A string type provides some type safety.

Depending on a particular application, the decision of zero-termination or length termination can have a significant impact on the overall performance.

Tokens can be prepared before actually being sent. Safe access to the next token to-be-sent is given via the member function token_p(), i.e. in C++

self.token_p()->number = 4711;
self.send(QUEX_TKN_INTEGER);

sets the member number in the token to-be-sent token to 4711. The same is accomplished in C by

self.token_p(&self)->number = 4711;
self.send(&self, QUEX_TKN_INTEGER);

Once, the token is prepared, a call to a send function puts the token into the internal token queue. Mode transitions are controlled via the member functions

const Mode*  mode();

void         enter_mode(const Mode* TargetModeP);
void         push_mode(Mode* TargetModeP);
void         pop_mode();
void         pop_drop_mode();

In C, the according function pointers take the analyzer object as first argument.

const Mode*  mode(LEXER* me);

void         enter_mode(LEXER* me, const Mode* TargetModeP);
void         push_mode(LEXER* me, Mode* TargetModeP);
void         pop_mode(LEXER* me);
void         pop_drop_mode(LEXER* me);

The current mode is delivered by mode(). A transition into a mode is triggered by enter_mode(TargetModeP), where TargetModeP is a pointer to the desired target mode. This corresponds to the GOTO command, as mentioned before. GOSUB and RETURN is implemented via the member functions push_mode(TargetModeP) and pop_mode(). Inside the source code sections, all mode pointers are available via their mode name, i.e. STRING would be a pointer to the STRING mode. In C++ a transition to this mode is triggered by

self.enter_mode(STRING);

There are two special commands for source code sections:

FLUSH

terminates the source code section and stops the filling of the token queue until it has been emptied completely.

CONTINUE

terminates the source code section, but does not necesarrily return from the analyzer function.

Both commands ensure that the on_after_match handler is executed, if present. The top-level sections header, body, constructor, etc. (section sec_top_level) enable the management of customized member variables of the lexer. Once properly setup, they can equally be references as members of self.