Brief Commands¶
A =>
following the regular expression of a pattern action pair initiates
either a token sending command or a mode transition command. A token sending
command starts with token id. A mode transitions command starts with one of the
keywords GOTO
, GOSUB
, or RETURN
.
Token Sending¶
A token id following the =>
operator, can be specified as a plain number
(section sec_number_specification), a unicode character, or as token-id
name. The following lines show token sending commands setup with numerical
token ids.
// pattern: token sending command: token-id (ucs-value):
//
Z => 27; // 0x1B
honey => 0x1000; // 0x1000
butter => 0o456; // 0x12E
bread => 0b1000.0110.1010.0101; // 0x86A5
Token ids can also be specified in terms of their unicode code point. This can
be done by specifying a character in single quotes, such as '+'
for the
code point 43 (0x2b), or with complete Unicode names following the UC
keyword. Below are listed some examples.
// pattern: token sending command: token-id (ucs-value):
//
"+" => '+'; // 0x2b
"-" => '-'; // 0x2d
ε => 'ε'; // 0x3b5
∞|infinity => '∞'; // 0x221b
𝄞 => UC MUSICAL_SYMBOL_G_CLEF; // 0x1d11e
\x23 => UC NUMBER_SIGN; // 0x23
The way that plus and minus operators are passed comes handy in the context of parser generators such as Yacc [], Bison [], Lemon [], or ANTLR [].
Finally, token sending commands may be specified with the token-id name. For
that, they must be specified with their full name, i.e. including the token
prefix. By default, the token prefix is QUEX_TKN_
. In order to avoid
warning messages, all token-id names should be specified in a token
section
sec_top_level_token or in an external token-id definition file.
Below are some examples of token sending commands with token-id names.
token{ PLUS; MINUS; EPSILON; INFINITY; }
mode EXAMPLE {
// pattern: token sending command:
//
"+" => QUEX_TKN_PLUS;
"-" => QUEX_TKN_MINUS;
ε => QUEX_TKN_EPSILON;
∞|infinity => QUEX_TKN_INFINITY;
}
When a token-id name is used, further content of the token to-be-sent can be specified as a comma-separated list in brackets. For example,:
[a-zA-Z_]+ => QUEX_TKN_VARIABLE(Lexeme);
sends a token with the id QUEX_TKN_VARIABLE
and the text
member
set to a copy of the lexeme that matched [a-zA-Z_]+
. Members of the
token class can also be set explicitly. For example,:
[0-9]+ => QUEX_TKN_NUMBER(number=(size_t)atoi((char*)Lexeme), text=LexemeNull);
matches an integer number and assigns its numeric interpretation, using
atoi()
, to the member number
in the token object to be sent. Details of
token class definition and token construction are discussed in chapter
sec_token_classes.
Warning
The token class’ constructor is not called upon token sending! Members of the token class, which are not specified are left untouched. This means, that tokens may contain ‘trash’ from previous usages, if it is not overwritten. For example:
[0-9]+ => QUEX_TKN_NUMBER(number=atoi(Lexeme))
sets the token class member .number
, but leaves .text
unchanged, so
that it still contains the content from the token’s last usage. The may have
surprising effects. If the same token has been used before to transport the
keyword lexeme print
, then this token, while being sent upon the match
of a list of digits, its .text
member still contains print
.
Nevertheless, the potentially confusing behavior of .text
carrying
unrelated content may be avoided by explicitly passing the LexemeNull
as an argument.:
[0-9]+ => QUEX_TKN_NUMBER(number=atoi(Lexeme), text=LexemeNull)
Token objects are comparable to cargo ships. They are produced (‘constructed’) once. They are loaded, transport their charge from origin to their destination where they are unloaded. This cycle repeats many times until they reach the end of their lifetime when they are, finally, are disposed (‘destructed’). Tokens are constructed upon start of lexical analysis. They are filled upon match of a pattern and their content is considered by the user of the lexical analyzer (e.g. a parser). Then, they are made available again to be filled upon another match. Only when lexical analysis is completely over, they are destructed. For the sake of computational efficiency, repeated construction and destruction is avoided.
Note
The default token class copies the lexeme when assiging it to the text
member. Alternatively, one might reference directly with a pointer to the lexeme
inside the buffer. In that case, however, it must be guaranteed that
one properly reacts on changes to the buffer content. That is,
inside the handler on_buffer_before_change
new habitats for existing
references to its content must be allocated and assigned.
Argument names may be omitted, if only the text
member is to be set, as
long as the token class provides setting it [1].
Footnotes