Brief Commands

A => following the regular expression of a pattern action pair initiates either a token sending command or a mode transition command. A token sending command starts with token id. A mode transitions command starts with one of the keywords GOTO, GOSUB, or RETURN.

Token Sending

A token id following the => operator, can be specified as a plain number (section sec_number_specification), a unicode character, or as token-id name. The following lines show token sending commands setup with numerical token ids.

// pattern:     token sending command:    token-id (ucs-value):
//
Z            => 27;                    // 0x1B
honey        => 0x1000;                // 0x1000
butter       => 0o456;                 // 0x12E
bread        => 0b1000.0110.1010.0101; // 0x86A5

Token ids can also be specified in terms of their unicode code point. This can be done by specifying a character in single quotes, such as '+' for the code point 43 (0x2b), or with complete Unicode names following the UC keyword. Below are listed some examples.

// pattern:     token sending command:        token-id (ucs-value):
//
"+"          => '+';                          // 0x2b
"-"          => '-';                          // 0x2d
ε            => 'ε';                          // 0x3b5
∞|infinity   => '∞';                          // 0x221b
𝄞            => UC MUSICAL_SYMBOL_G_CLEF;     // 0x1d11e
\x23         => UC NUMBER_SIGN;               // 0x23

The way that plus and minus operators are passed comes handy in the context of parser generators such as Yacc [], Bison [], Lemon [], or ANTLR [].

Finally, token sending commands may be specified with the token-id name. For that, they must be specified with their full name, i.e. including the token prefix. By default, the token prefix is QUEX_TKN_. In order to avoid warning messages, all token-id names should be specified in a token section sec_top_level_token or in an external token-id definition file. Below are some examples of token sending commands with token-id names.

token{ PLUS; MINUS; EPSILON; INFINITY; }

mode EXAMPLE {
   // pattern:     token sending command:
   //
   "+"          => QUEX_TKN_PLUS;
   "-"          => QUEX_TKN_MINUS;
   ε            => QUEX_TKN_EPSILON;
   ∞|infinity   => QUEX_TKN_INFINITY;
}

When a token-id name is used, further content of the token to-be-sent can be specified as a comma-separated list in brackets. For example,:

[a-zA-Z_]+  => QUEX_TKN_VARIABLE(Lexeme);

sends a token with the id QUEX_TKN_VARIABLE and the text member set to a copy of the lexeme that matched [a-zA-Z_]+. Members of the token class can also be set explicitly. For example,:

[0-9]+      => QUEX_TKN_NUMBER(number=(size_t)atoi((char*)Lexeme), text=LexemeNull);

matches an integer number and assigns its numeric interpretation, using atoi(), to the member number in the token object to be sent. Details of token class definition and token construction are discussed in chapter sec_token_classes.

Warning

The token class’ constructor is not called upon token sending! Members of the token class, which are not specified are left untouched. This means, that tokens may contain ‘trash’ from previous usages, if it is not overwritten. For example:

[0-9]+ => QUEX_TKN_NUMBER(number=atoi(Lexeme))

sets the token class member .number, but leaves .text unchanged, so that it still contains the content from the token’s last usage. The may have surprising effects. If the same token has been used before to transport the keyword lexeme print, then this token, while being sent upon the match of a list of digits, its .text member still contains print.

Nevertheless, the potentially confusing behavior of .text carrying unrelated content may be avoided by explicitly passing the LexemeNull as an argument.:

[0-9]+ => QUEX_TKN_NUMBER(number=atoi(Lexeme), text=LexemeNull)

Token objects are comparable to cargo ships. They are produced (‘constructed’) once. They are loaded, transport their charge from origin to their destination where they are unloaded. This cycle repeats many times until they reach the end of their lifetime when they are, finally, are disposed (‘destructed’). Tokens are constructed upon start of lexical analysis. They are filled upon match of a pattern and their content is considered by the user of the lexical analyzer (e.g. a parser). Then, they are made available again to be filled upon another match. Only when lexical analysis is completely over, they are destructed. For the sake of computational efficiency, repeated construction and destruction is avoided.

Note

The default token class copies the lexeme when assiging it to the text member. Alternatively, one might reference directly with a pointer to the lexeme inside the buffer. In that case, however, it must be guaranteed that one properly reacts on changes to the buffer content. That is, inside the handler on_buffer_before_change new habitats for existing references to its content must be allocated and assigned.

Argument names may be omitted, if only the text member is to be set, as long as the token class provides setting it [1].

Footnotes