Line and Column Number Counting

Any compiler or lexical analyzer imposes rules on the text that it has to apply. Most of those texts are written by humans, and humans occasionally make errors and disrespect rules. A gentle compiler tells its user about his errors and also tells it about the place where the error occured. Using quex’s ability to count lines and columns facilitates the task of pointing into the user’s code.

The first section elaborates on how to access line and column numbers for a given token. A second section explains how to resolve the problem that the line and column numbers are only precise at the time they match in case of the token policy ‘queue’. Finally, it is explained how the line and column number counting can be configured to fit very specific usages.

Usage

Line and column numbers are activated by means of the command line options

--line-count
--column-count

Both options can be set independently. Inside pattern actions, the following macros may be used to access line and column number counts:

self_line_number_at_begin()
self_line_number_at_end()
self_line_number()
self_column_number_at_begin()
self_column_number_at_end()
self_column_number()

They return the line and column number at the begin and the end of the currently matched lexeme. self_line_number() is equivalent to self_line_number_at_begin() and self_column_number() is equivalent to self_column_number_at_begin(). The C++ API provides the following analyzer member functions:

Queχ tries to determine the line and column count of a pattern beforehand, so that it must not be computed at runtime. This works for homogenous patterns which do not contain arbitrary repetition or optional paths. If the analyzing algorithm signalizes a run-time dependency, then code is produced which counts the line and columns at the time when the lexeme matches.

By default, a tab character, i.e. Unicode character 0x09, causes a grid step of four columns. That is, if there is a line starting with:

0123\t0\t01\t012\tx

will report:

(1)  "0123"
(8)  "0"
(12) "01"
(16) "012"
(20) "x"

where the term in brackets is the column number and the string in quotes is the matched lexeme. For some purposes, it might be necessary to set the line and column number actively. Then the following member functions may be used:

void        line_number_set(size_t Y);
void        column_number_set(size_t X);

Line and column counting can be turned off individually by pre-processor switches.

These switches turn the related counting mechanisms off. It is possible that it runs a little faster[#f3]_. For serious applications, though, at least line number counting should be in place for error reporting.

Warning

The member functions for reporting line and column numbers always report the current state. If the token policy queue (see Token Passing Policies) is used, then a these function only report correct values inside pattern actions!

From ouside, i.e. after a call to .receive(...) the line and column numbers represent the values for the last token in the queue. If precise numbers are required they are better stored inside the token at the time of the pattern match.

Stamping Tokens

Tokens can be stamped at the time that they are sent with the current line and/or column number. Indeed, this is what happens by default. If line or column counting is disabled, then also the stamping of the disabled value is disabled (see sec-line-column-count). The line and column numbers of a token can be accessed via the member functions

of each token object. The stamping happens inside the ‘send()’ functions. More precisely, whenever a token id is set automatically, the token will be stamped automatically with line and column numbers of the beginning of the lexeme. There fore, if specific line or column numbers need to be stamped into a token it makes sense to set them in the counter, before sending the token. Consider the following example:

self_column_number_at_begin_set(MyColumnN);
self_line_number_at_begin_set(MyLineN);

self_send(MY_TOKEN_ID);

As a result of preparing the line and column number inside the counter, the token stamping will refer to these values. Therefore, a token will be sent with the id MY_TOKEN_ID and the reported column and line numbers MyColumnN and MyLineN.

If the stamping procedure cannot provide the desired functionality it may be disabled by defining the macro

If line or column numbering is disabled, also the stamping of the corresponding value is disabled. Further, no member in the tokens is reserved to carry that value.

If a user customized token class is used, it may be necessary to stamp tokens with more information. The macro QUEX_ACTION_TOKEN_STAMP may be defined to specify an action to be exectuded each time when a stamping is required. For example, if more than one token is sent in a single pattern action, e.g.:

"Hello Universe" {
        self_send(TKN_GREETING);
        self_send2(TKN_SPEECH);
}

If it is required to stamp tokens with begin and end line and column numbers, then a stamping action may be defined as follows.

header {
#define QUEX_ACTION_TOKEN_STAMP(TOKEN_P)    \
        TOKEN_P->set_begin_line_number(self.line_number_at_begin());       \
        TOKEN_P->set_begin_column_number(self.column_number_at_begin()-1);
        TOKEN_P->set_end_line_number(self.line_number_at_end());       \
        TOKEN_P->set_end_column_number(self.column_number_at_end()-1);
}

The stamping is defined in a header section, so that it precedes the definition of the default token stamping action.

Customization

By default, the relation between characters and count actions is the following:

Add ‘1’ to column number: [-oo, ‘b’], [‘v’, oo] Add ‘1’ to newline number: ‘n’ Make a step on a ‘4’-er grid: ‘t’

When issues of ‘character fonts’ or Unicode character widths become an issue the default counting behavior may not be sufficient. To specify a different counting behavior the mode option counter may be used. Inside this option, pairs of character sets and their related action can be defined using the following syntax:

character-set '=>' action [ argument ] ';'

Available actions names are space, grid, and newline. The key word \else is a placeholder for the set of remaining characters which are not covered by explicitly. Using this syntax the aforementioned default definition can be specified as shown below.

mode X :
<counter:
   \else  => space 1;
   [\t]   => grid 4;
   [\n]   => newline 1;
>
{
    ...
}

The actions ‘space’, ‘grid’ and ‘newline are now described:

space [number|variable]

This defines what characters are accepted as a ‘space’. A space is something that always increments the column counter by a distinct number. The argument following space can either be a number or a variable name is specified, it will become a member of the lexical analyzer with the type ‘size_t’ as defined in ‘stddef.h’. Then the increment value can be changed at runtime by setting the member variable.

Multiple definitions of space are possible in order to define different space counts. Note, that in Unicode the following code points exist to represent different forms of white space:

  • 0x0020: Normal Space.

  • 0x00A0: Normal space, no line break allowed after it.

  • 0x1680: Ogham (Irish) Space Mark.

  • 0x2002: Space of the width of the letter ‘n’: ‘En Space’.

    This is half the size of an Em Space.

  • 0x2003: Space of the widht of the letter ‘m’: ‘Em Space’.

  • 0x2004: 1/3 of the width of an ‘m’: ‘Three-Per-Em Space’.

  • 0x2005: 1/4 of the width of an ‘m’: ‘Four-Per-Em Space’.

  • 0x2006: 1/6 of the width of an ‘m’: ‘Six-Per-Em Space’.

  • 0x2007: Size of a digit (in fonts with fixed digit size).

  • 0x2008: Punctuation Space that follows a Comma.

  • 0x2009: 1/5 of the width of an ‘m’: ‘Thin Space’.

  • 0x200A: Something thinner than 0x2009: ‘Hair Space’.

  • 0x200B: Zero-Width Space.

  • 0x202F: Narrow No-Break Space, no line break allowed after it.

  • 0x205F: Medium Mathematical Space.

  • 0x2060: Word Joiner (similar to 0x200B)

  • 0x2422: Blank Symbol ().

  • 0x2423: Open Box Symbol ().

  • 0x3000: Ideographic Space, size of a Chinese, Japanese,

    or Korean letter.

Provided that the editor supports it the ‘m’ based spaces could for example be parameterized as:

<indentation:
    [\X2003] => space 60; /* Em Space           */
    [\X2002] => space 30; /* En Space           */
    [\X2004] => space 20; /* Three-Per-Em Space */
    [\X2005] => space 15; /* Four-Per-Em Space  */
    [\X2009] => space 12; /* Thin Space         */
    [\X2006] => space 10; /* Six-Per-Em Space   */
>
grid [number|variable]

Characters associated with a ‘grid’ set the column number according to a grid of a certain width. Tabulators are modelled by grids. For example, if the grid width is four and the current indentation count is 5, then a tab character will set the column count to 8, because 8 is the closest grid value ahead.

As with ‘space’, a run-time modification of the grid value is possible by specifying a variable name instead of a number. For example,

[\t]  => grid  tab character_width;

results in a member variable tab character_width inside the analyzer that can be changed at run-time, e.g.

...
MyLexer   qlex(...);
...
if( file_format == MSVC ) qlex.tab character_width = 8;
else                      qlex.tab character_width = 4;
...
newline [number|variable]

By the parameter name newline a character set is defined that increments the line number. As with the parameter before, the value may be even constant or controlled by a member variable of the analyzer.

Note

By default Quex only considers \n as newline character that increments the line number by one. In Unicode, though, there are several code points that may are related to newline, as they are:

  • 0x0A: Line Feed.
  • 0x0B: Vertical Tab.
  • 0x0C: Form Feed.
  • 0x0D: (Carriage Return)
  • 0x85: Next Line.
  • 0x2028: Line Separator.
  • 0x2029: Paragraph Separator.

The 0x0D character does actually not increment the line number, it rather resets the input to the beginning of the line. The according counting command would be:

\n => newline 0;

which means that the line number does not increase, but the column number is reset to 1.

Each of the aforementioned parameters may occur, of course, multiple times. Their character sets, though, may not intersect.

Note

It makes sense to use a mode with a counter definition to multiple derived classes as a means to share with them the same counting behavior. However, Quex only allows for one counter definition in a mode hierarchy. The rationale behind this was to prevent the information about the counting behavior being scattered around different modes. Such configurations are prone to be confusing.

Footnotes

[1]Even the indentation count algorithm is adapted to profit from knowledge about the patterns internal structure.
[2]There are exceptions cases, for which a slightly better counting mechanism might be found. Example: A pattern that contains a newline which is followed by a fixed number of characters. The determination of this in the context of post-conditions is complicated. On the other hand, such patterns are considered strange and occur rarely. Thus, the expected gain with an optimized algorithm was considered negligible by the author. No optimal handling for this case has been developed.
[3]The author of this text has experienced several cases where analyzers with the line and column counting active performed faster then without it. This might be caused by the different caching strategies of modern CPUs. Before deleting the line and column counting a benchmark always helps to get an impression if it’s really worth it.