Indentation Counting / Off-Side Rule¶

With the rise of structured programming [], [], the concept of nested statement blocks emerged as a means to express control flow. The most explicit, most direct, way to design the interpretation is by means of of opening and closing delimiters. Many programming languages use the curly braces { and } for opening and closing of if-blocks, while-loops, and the like.

For the sake of readability, nested blocks are often indented. Comparing the two following code fragments

while(1+1==2){if(check(This,That)){something(Else);break;}make();}

and

while( 1 + 1 == 2 ) {
    if( check(This, That) ) {
        something(Else);
        break;
    }
    make();
}

it becomes obvious that the latter is clearer to the human reader than the former-with less visual noise. The efficiency of human reflection on content relates directly to the clarity of its expression. Independent of its application the design of a language must strive for, or at least support, clarity of expression, if the goal is to provide a useful tool for communication. Since indentation is such a crucial means for clarity in languages with nested syntactic blocks, it does not come as a surprise that indentation of nested blocks has become widely accepted common practise.

Indentation:: Indentation is a measure for the extend of whitespace between the beginning of a line and its first non-white space character.

Conclusively, lexical markers for blocks in the input stream become redundant once indentation block boundaries can be automatically detected. This realization has influenced the design of a variety of programming languages such as Python [] and Haskell [] (in some cases when braces are omitted), Occam [], CoffeeScript [], F# [], Inform 6/7 [], reStructuredText [], and YAML []. In a language designed with this approach, the aforementioned code might be rewritten as follows.

while 1 + 1 == 2:
   if check(This, That):
       something(Else)
       break
   make()

The removal of the redundant lexical markers { and }, in favor of invisible white space, results in less visual noise. To achieve this, the lexer must automatically produce block delimiters from indentation. The specification of an <indentation:> tag in a mode activates the so-called off-side rule behavior [].

mode EXAMPLE :
     <indentation:>   // Activate indentation-based scope handling.
{
    ...
}

Indentation-based scope detection relies on the comparison of a line’s indentation with the indentation of its predecessor. If the indentation has become greater, the INDENT token is sent. If the indentation has become lower, the DEDENT token is sent. If the indentation has remained the same, the token NODENT is sent. For a fine-grained control, the event handlers on_indent, on_dedent, and on_nodent may be customized. Table Token identifiers and event handlers related to indentation-based scoping. shows token-ids related to indentation counting, their correspondent event handlers and their meaning.

Table 14 Token identifiers and event handlers related to indentation-based scoping.¶
Token-Id	Meaning	Related event handler
`INDENT`	nested block opens	`on_indent`
`DEDENT`	nested block closes	`on_dedent`
`NODENT`	indentation same as previous line	`on_nodent`

Inside the <indentation: ...> tag the character sets related to concepts of indentation-based scoping can be specified following the scheme below.

<regex> ‘=>’ keyword ‘;’

where keyword can be one of: newline, whitespace, suppressor, suspend, or bad as shown in table Parameters of the <indentation: ...> tag.. The <regex> may be any regular expression that results in a DFA without pre- or post-context.

Table 15 Parameters of the `<indentation: ...>` tag.¶
Parameter	Meaning
`newline`	Pattern after which whitespace counts as indentation.
`whitespace`	Definition of ‘whitespace’ at the beginning of a line.
`suspend`	If whitespace is followed by suspend, no count.
`suppressor`	If `newline` is preceded by `suppressor`, no counting.
`bad`	If line begins with `bad`, notify ‘bad indentation’.

newline defines what notifies the beginning of a newline, i.e. what triggers indentation counting. whitespace defines the patterns that may appear at the beginning of a line to be considered as indentation space. suppressor permits to define characters that, if preceding, prevent newline from triggering indentation counting. suspend allows to define patterns, that prevent indentation counting of reporting if the first non-whitespace matches them. The bad keyword defines patterns, that when they occur at the beginning of a line are considered ‘bad indentation’. This is particularly useful to implement one’s preferred philosophy of tabs vs. spaces (see section sec_tabs_vs_spaces).

The next code fragment shows an example mode setup for indentation-based scoping.

mode EXAMPLE :
<indentation:
    \r?\n                 => newline;    // what triggers indentation counting
    "\\\\"*"\\"[ \t]*     => suppressor; // backslash + whitespace
    [ ]*|[\t]*            => whitespace; // only space or only tab
    ([ ]+"\t")|([\t]+" ") => bad;        // space and tab mixed
    "#"                   => suspend;    // bash-like comment until newline
    "//"                  => suspend;    // C++-like comment until newline
>
<skip_range: "#" "\n">
<skip_range: "//" "\n">
{
    ...
}

Here, whitespace is defined so that indentation can be either spaces or tabulators exclusively, but not mixed. Whenever a line begins with a mix of spaces and tabs, it is considered bad. A newline can either consist of a sole line feed (\n, Unicode 0x0A), or carriage return followed by line feed (\r\n, Unicode 0x0D, 0x0A). suppressor is defined, so that an odd number of backslashes followed by any or none space character prevents newline from triggering indentation counting.

The suspend keyword defines circumstances where indentation shall not trigger indentation events. In the example above, indentation events are blocked, if a line’s first non-whitespace starts with # (bash/Python-like comments) or // (C++ comments). This expresses the idea, that a line containing only whitespace and comments is irrelevant for code, and therefore irrelevant for indentation. When suspend triggers, the input stream pointer is set to the begging of what triggered suspend. For example, if // appeared at the beginning of a line, the next analysis step starts right before //. The mode EXAMPLE is prepared for this event with the <skip_range: ...> tag, which lets it ignore anything until newline.

Note

warning

No lexemes matched by whitespace must contain a sub-lexeme of the lexemes matching newline, suppressor or suspend. Otherwise, it would ‘eat’ their occurrence and produce unexpected results. This requires some care to be taken. For example,

<indentation: :[^:]+: => whitespace; > // NOT GOOD!

Seems to define indentation whitespace as : at the beginning of the line until the next occurence of :. However, the expression [^:] matches anything but : which includes the newline character. In this setup, the newline character must be explicitly exempted.

<indentation: :[^:\n]+: => whitespace; > // BAD!

Aspects of Indentation Handling¶

The following subsections discuss mechanics of indentation handling. For that, code fragments in a pseudo-programming language are discussed. Let ⏎ denote newline, → denote tabulator, and ␣ denote space. Let column counting be setup, so that spaces count as 1 column and tabulators are aligned on a grid of 4 columns width (see sec_linen_and_column_counting).

Basic Functioning¶

The triggering of basic indentation handling events shall be explained by means of the code fragment below.

if True:⏎
␣␣␣␣print "hello world"⏎
␣␣␣␣␣⏎
→   print "bonjour, le monde"⏎
exit

When the lexer detects ⏎ after the colon in the first line, it starts indentation counting. It walks along the four spaces ␣ of the second line until it reaches p. It then resets the input pointer to the position right before it and notifies the indentation handler that an indentation of 4 has occurred. The indentation handler recognizes an indented block and sends the INDENT token. After greeting the world, the next newline ⏎ lets the lexer re-enter indentation counting. However, since the line contains nothing but spaces and newline, no indentation is detected. Only, line 4, indentation counting walks along the tabulator → until it reaches the second print command. The indentation handler is notified about an indentation of 4. It recognizes no difference to the last relevant line and sends NODENT. The last line directly begins with non-whitespace. The indentation handler is notified, it recognizes lesser indentation and sends DEDENT.

Misfitting Indentation¶

Not every indentation is valid. Namely, a misfitting indentation triggers the on_indentation_misfit event handler. The following fragment shows such a case.

if True:⏎
␣␣␣while True:⏎
␣␣␣␣␣␣do_something()⏎
␣␣␣␣do_more()⏎

Misfitting indentation occurs, if the indentation is less than the indentation of the previous line, but does not fit any indentation of an opening block. This happens in line 4 of the example above (the one with do_more()). Its indentation of 4 spaces does not fit any open indentation level, not 0 of the first line, not 3 of the second line, and not 6 of the third line. In such a case, the on_indentation_misfit event handler is triggered.

Suppressed Newline¶

For languages that require a logical continuation of a line which is terminated by a newline character, a newline suppressor pattern can be specified. When a pattern matching suppressor precedes directly the newline pattern, indentation counting is skipped for the next line. This is shown in the example below.

print "hello "␣␣\⏎
␣␣␣␣␣␣"world"⏎
if True:␣␣␣␣\\⏎
do_more()⏎
exit

The suppressor in mode EXAMPLE triggers upon an odd number of backslashes plus and arbitrary number (possibly none) of whitespace characters. In this example, the lines 1 ends with an odd number of backslashes before the newline. Consequently, the initial whitespace in lines 2 are not considered indentation. Line 3, however, has an odd number of backslashes, which does not disable newline. The even number of backslashes before newline does not deactivate Indentation counting. In line 5 the INDENT token is sent upon the first non-whitespace character. Since, the newline there is preceded by a backslash, line 6 is not subject to indentation counting. Only, the last line, which starts with non-whitespace, triggers the sending of a DEDENT token.

Suppressed newline after indentation does not add up indentation. The following pseudo code does trigger the on_indent event handler before do_something().

repeat:⏎
␣␣␣␣\⏎
do_something()⏎

Suppressed newline does also not trigger on_dedent upon lower indentation.

while:⏎
␣␣␣␣if True:⏎
\⏎
␣␣␣␣else⏎

Here, no indentation event triggers between if True: and else. It does not trigger a misfitting indentation event, in case it occurs as in the code below.

while:⏎
␣␣␣␣if True:⏎
␣␣\⏎
␣␣␣␣else⏎

Indentation with suppressed newline does not add up in the next line.

while:⏎
␣␣␣␣if True:⏎
␣␣␣\⏎
␣␣␣\⏎
␣␣␣␣if False:⏎

If indentation was to add-up in lines ending with \, then, in the example above, the if False: had an indentation of 10 (3+3+4). However, no indentation event occurs between the two if lines. The indentation of if False: is simply 4 as it is for if True:.

Suspended Indentation¶

The effect of a suspend definition is explained alongside the following example.

if is_ok():⏎
␣␣# first comment⏎
␣␣␣␣do_something()⏎
␣␣# second comment⏎
␣␣␣␣do_more()⏎
exit

Line 2 starts, after 2 whitespaces with # which signalizes a comment until newline to be skipped. That line does not trigger an indentation event. Line 4 is also indented with only 2 whitespaces. The fact that it has less whitespace at the beginning then its predecessor, however, is irrelevant. Since its starts with what matches suspend, the line is exempted from indentation handline.

Bad Indentation Characters¶

Mode EXAMPLE defined the pattern of bad indentation as a mixture of tabs and spaces. In the code fragment below, line 2 and 3 start with a mixture of spaces and tabs. As soon as the indentation is detected, the on_indentation_bad handler is called.

if True:⏎
␣␣→ do_something()⏎
→   ␣␣do_more()⏎

Token Sending¶

Previous paragraphs indicated how the indentation counting process can be customized. What remains is to discuss the customization of automated token sending. In particular, the NODENT token might be redundant in may cases. One way to deal with it, is by assimilating its token-id to something carrying the meaning of ‘statement delimiter’.

token {
… NODENT = STATEMENT_END; …

}

Thus, a NODENT arrives at the parser as a statement delimiter. Another way to adapt the token sending process is to write event handlers related to indentation counting. The list of available event handlers has already been given in table Token identifiers and event handlers related to indentation-based scoping.. Earlier, section :ref:`` discussed the event handlers related to indentation handling.

on_dedent: (CloseN, Indentation, ColumnN, LineN)

Whenever a lower indentation closes one or more nested indentation block, this event handler is called. The default behavior is to send either CloseN tokens of type DEDENT. If the DEDENT token-id is setup for repetition, a repeated token is sent with CloseN as repetition count. The default behavior is either,

   on_dedent {
       for(int i=0; i < ClosenN ; ++i) {
           self.send(QUEX_TKN_DEDENT);
       }
   }

Or, if ``DEDENT`` has been setup to carry repetition counts,

.. code-block:: cpp

   on_dedent {
       self.send_n(QUEX_TKN_DEDENT, CloseN);
   }

It has not been discussed yet, that by a single lower indentation more than one indentation block may be closed. This is demonstrated here.

if True:⏎
␣␣␣␣while True:⏎
␣␣␣␣␣␣␣␣if True:⏎
␣␣␣␣␣␣␣␣␣␣␣␣do_something()⏎
exit

The exit in the last line closes three indentation blocks: the one at column 4 (opened in line 2), the one at column 8 (line 3), and the one at column 12 (line4). Consequently, three DEDENT tokens need to be sent. Instead of stacking 3 tokens into the token queue, it may make sense to allow DEDENT to carry the repetition count.:

repeated_token { DEDENT; }

Then, automatically the default on_dedent handler will be set it up accordingly.

on_nodent: (Indentation, ColumnN, LineN): This handler is called on the incidence that the current line does neither close nor open an indentation block, nor is there a newline, nor is there a suppressed newline.

on_indentation_misfit

raises: E_Error_OnIndentationBad

Whenever an indentation occurs, which is lower than the currently open indentation block but which does not coincide with any nesting indentation block, then this handler is called.

Indentation: The indentation that has occured.

IndentationStackSize: The number of currently open indentation blocks.

IndentationStack(i): Delivers the indentation number ‘i’ from the current indentation blocks.

IndentationUpper: Delivers the smallest indentation level that is greater than the current.

IndentationLower: Delivers the greatest indentation level that is smaller than the current.

ClosedN: Number of closed indentation levels.

on_indentation_bad

raises: E_Error_OnIndentationBad

Whenever the DFA matches which identifies a bad indentation line, this handler is called.

In some cases, it might be desirable not to stop lexical analysis upon bad indentation characters. Then, however, it must be secured, that something else catches it. If bad matches a mixture of spaces and tabs, then there must be either a skipper or a pattern inside the mode that digests it. A simple <skip: [ \t\n]> as shown below, does the job. The detection of a bad indentation character should have its consequences. The treatment must implemented in the on_indentation_bad event handler. A setup of bad-character skipping and error-tolerating can be seen below.

mode EXAMPLE :
<indentation: (" "*"\t")|("\t"*" ") => bad;>
<skip: [ \t\n]>
{
    ...
    on_indentation_bad {
        self.error_code_clear_this(E_Error_OnIndentationBad);
        std::cout << self.line_number() << ":";
        std::cout << "Warning: Bad indentation character!\n";
    }
}

As a brief overview, the code fragment below sets up the event handlers to mimic the default behavior of automatic indentation-based token sending.

mode EXAMPLE : <indentation: ...>
{
    on_indent { self_send(QUEX_TKN_INDENT); }
    on_dedent { for(int i=0; i<ClosedN ; ++i) self.send(QUEX_TKN_DEDENT); }
    on_nodent { self_send(QUEX_TKN_NODENT); }
    on_indentation_misfit {
        self.error_code_set_if_first(E_Error_OnIndentationMisfit);
        self.send(QUEX_TKN_TERMINATION);
    }
    on_indentation_bad {
        self.error_code_set_if_first(E_Error_OnIndentationBad);
        self.send(QUEX_TKN_TERMINATION);
    }
    ...
}

Interference with Pattern Matching¶

Indentation counting is triggered by a howsoever defined marker ‘newline’. For it to trigger, it must not be devoured by something else. Let a mode contain a pattern like this.:

"\n"+  => QUEX_TKN_VULTURE();

Here, indentation counting can never trigger–since the VULTURE pattern eats it away. A newline at the end of a pattern, which is not put back also prevents indentation counting. This happens, for example, with a pattern designed to skip over a ‘comment until newline’.:

"#"[^\n]*\n  => QUEX_TKN_COMMENT(Lexeme);   // dysfunctional

Such a situation can be healed requesting newline as a post-context instead of being part of the core pattern.:

"#"[^\n]*/\n  => QUEX_TKN_COMMENT(Lexeme);  // functional

Since newline devouring patterns are an issue for indentation handling, Quex warns about any interferences. The skippers are designed to cooperate with the indentation counting automatically, if possible. Else, a warning is issued.

Tabs vs. Spaces¶

Surveying the discussions on the web on tabs vs. spaces, that the debate on this issue might not settle any time soon. The ‘tolerating consistency’ approach, as discussed earlier in this section, might pacify peaceable zealots. Nevertheless, the key points of discussion are listed below as a guideline for language design decisions.

Pro Spaces

A space is always one column in any editor, in a file, or on any hardware. There is no definite width of a tabulator.
Spaces are visible as what they are. Tabs are only visible when being interpreted as some number of spaces.

Pro Tabulators

Tabs provide an encoding-inherent compression mechanism: Spaces require N characters for an indentation, tabulators only 1.
Misfitting indentation is impossible.
A viewer may adapt the indentation width to the user’s preferences simply by specifying a ‘tabulator display width parameter’.
On viewers with non-mono-spaced fonts, tabs are the only means to achieve vertically aligned columns.
Traditionally, tabs jump to predefined columns. Since, this behavior is similar to indentation, it may cause the least amount of astonishment.

Once a development team has made a decision, then editing tools should be configured to support the translation of tabulators into spaces or spaces into tabulators. Then, the difference only exists underneath the hood and the urge of discussion may dissolve. After all this pacifying talk, the following script ‘subdue.sh’ is intended to cater the needs of the not so peaceable fanatic, who wants to focus on actions and not on debates.

Listing 5 subdue.sh¶

#! /usr/bin/env bash
# PURPOSE: Convert a text file towards the *only true* indentation type.
#          NO WARRANTY. USE AT OWN RISK.
#     $1 target file
#     $2 needed indentation type: 'tabs' or 'spaces'
#     $3 number of spaces per tab.
# EXAMPLE:
#     for file_name in $(find ./ -name "*.c"); do
#          bash ./subdue.sh $file_name spaces 4
#     done
# (C) Frank-Rene Schaefer; License: MIT.
#______________________________________________________________________________
IFS=; read -r -d '' program <<- \
_______________________________________________________________________________
BEGIN {
    while( spaces_per_tab-- ) { spaces = spaces " "; }
    if     ( need=="spaces" ) { bad="\t";   good=spaces; }
    else if( need=="tabs" )   { bad=spaces; good="\t"; }
    else                      { print "tabs or spaces?"; exit(-1); }
}
match(\$0, /^[ \t]*[^ \t]/) {                   # Only replace at begin of line
    tmp=substr(\$0, RSTART, RLENGTH-1);
    gsub(bad, good, tmp);
    \$0=tmp substr(\$0, RLENGTH);
}
{ print; }
_______________________________________________________________________________
tmp_file=$(mktemp)
awk -v need=$2 -v spaces_per_tab=$3 "$program" $1 > $tmp_file
mv $tmp_file $1