Indentation Counting / Off-Side Rule¶
With the rise of structured programming [],
[], the concept of nested statement blocks
emerged as a means to express control flow. The most explicit, most
direct, way to design the interpretation is by means of of opening
and closing delimiters. Many programming languages use the curly braces
{
and }
for opening and closing of if-blocks, while-loops, and the like.
For the sake of readability, nested blocks are often indented. Comparing the two following code fragments
while(1+1==2){if(check(This,That)){something(Else);break;}make();}
and
while( 1 + 1 == 2 ) {
if( check(This, That) ) {
something(Else);
break;
}
make();
}
it becomes obvious that the latter is clearer to the human reader than the former-with less visual noise. The efficiency of human reflection on content relates directly to the clarity of its expression. Independent of its application the design of a language must strive for, or at least support, clarity of expression, if the goal is to provide a useful tool for communication. Since indentation is such a crucial means for clarity in languages with nested syntactic blocks, it does not come as a surprise that indentation of nested blocks has become widely accepted common practise.
- Indentation:
Indentation is a measure for the extend of whitespace between the beginning of a line and its first non-white space character.
Conclusively, lexical markers for blocks in the input stream become redundant once indentation block boundaries can be automatically detected. This realization has influenced the design of a variety of programming languages such as Python [] and Haskell [] (in some cases when braces are omitted), Occam [], CoffeeScript [], F# [], Inform 6/7 [], reStructuredText [], and YAML []. In a language designed with this approach, the aforementioned code might be rewritten as follows.
while 1 + 1 == 2:
if check(This, That):
something(Else)
break
make()
The removal of the redundant lexical markers {
and }
, in favor of
invisible white space, results in less visual noise. To achieve this,
the lexer must automatically produce block delimiters from indentation. The
specification of an <indentation:>
tag in a mode activates the so-called
off-side rule behavior [].
mode EXAMPLE :
<indentation:> // Activate indentation-based scope handling.
{
...
}
Indentation-based scope detection relies on the comparison of a line’s
indentation with the indentation of its predecessor. If the indentation has
become greater, the INDENT
token is sent. If the indentation has become
lower, the DEDENT
token is sent. If the indentation has remained the same, the
token NODENT
is sent. For a fine-grained control, the event handlers
on_indent
, on_dedent
, and on_nodent
may be customized. Table
Token identifiers and event handlers related to indentation-based scoping. shows token-ids related to indentation counting,
their correspondent event handlers and their meaning.
Token-Id |
Meaning |
Related event handler |
---|---|---|
|
nested block opens |
|
|
nested block closes |
|
|
indentation same as previous line |
|
Inside the <indentation: ...>
tag the character sets related to concepts of
indentation-based scoping can be specified following the scheme below.
<regex> ‘=>’ keyword ‘;’
where keyword
can be one of: newline
,
whitespace
, suppressor
, suspend
, or bad
as shown in table
Parameters of the <indentation: ...> tag.. The <regex>
may be any regular
expression that results in a DFA without pre- or post-context.
Parameter |
Meaning |
---|---|
|
Pattern after which whitespace counts as indentation. |
|
Definition of ‘whitespace’ at the beginning of a line. |
|
If whitespace is followed by suspend, no count. |
|
If |
|
If line begins with |
newline
defines what notifies the beginning of a newline, i.e. what
triggers indentation counting. whitespace
defines the patterns that may
appear at the beginning of a line to be considered as indentation space.
suppressor
permits to define characters that, if preceding, prevent
newline
from triggering indentation counting. suspend
allows to define
patterns, that prevent indentation counting of reporting if the first
non-whitespace matches them. The bad
keyword defines patterns, that when
they occur at the beginning of a line are considered ‘bad indentation’. This is
particularly useful to implement one’s preferred philosophy of tabs vs.
spaces (see section sec_tabs_vs_spaces).
The next code fragment shows an example mode setup for indentation-based scoping.
mode EXAMPLE :
<indentation:
\r?\n => newline; // what triggers indentation counting
"\\\\"*"\\"[ \t]* => suppressor; // backslash + whitespace
[ ]*|[\t]* => whitespace; // only space or only tab
([ ]+"\t")|([\t]+" ") => bad; // space and tab mixed
"#" => suspend; // bash-like comment until newline
"//" => suspend; // C++-like comment until newline
>
<skip_range: "#" "\n">
<skip_range: "//" "\n">
{
...
}
Here, whitespace
is defined so that indentation can be either spaces or
tabulators exclusively, but not mixed. Whenever a line begins with a mix of
spaces and tabs, it is considered bad
. A newline
can either consist of
a sole line feed (\n
, Unicode 0x0A), or carriage return followed by
line feed (\r\n
, Unicode 0x0D, 0x0A). suppressor
is defined, so that
an odd number of backslashes followed by any or none space character prevents
newline
from triggering indentation counting.
The suspend
keyword defines circumstances where indentation shall not
trigger indentation events. In the example above, indentation events are
blocked, if a line’s first non-whitespace starts with #
(bash/Python-like
comments) or //
(C++ comments). This expresses the idea, that a line
containing only whitespace and comments is irrelevant for code, and therefore
irrelevant for indentation. When suspend triggers, the input stream pointer
is set to the begging of what triggered suspend
.
For example, if //
appeared at the beginning of a line, the next analysis
step starts right before //
. The mode EXAMPLE
is prepared for this event with
the <skip_range: ...>
tag, which lets it ignore anything until newline.
Note
warning
No lexemes matched by whitespace
must contain a sub-lexeme of the
lexemes matching newline
, suppressor
or suspend
. Otherwise, it
would ‘eat’ their occurrence and produce unexpected results. This requires
some care to be taken. For example,
<indentation: :[^:]+: => whitespace; > // NOT GOOD!
Seems to define indentation whitespace as :
at the beginning of the line
until the next occurence of :
. However, the expression [^:]
matches
anything but :
which includes the newline character. In this setup, the
newline character must be explicitly exempted.
<indentation: :[^:\n]+: => whitespace; > // BAD!
Aspects of Indentation Handling¶
The following subsections discuss mechanics of indentation handling. For that,
code fragments in a pseudo-programming language are discussed. Let ⏎
denote newline, →
denote tabulator, and ␣
denote space. Let column
counting be setup, so that spaces count as 1 column and tabulators are aligned
on a grid of 4 columns width (see sec_linen_and_column_counting).
Basic Functioning¶
The triggering of basic indentation handling events shall be explained by means of the code fragment below.
1if True:⏎
2␣␣␣␣print "hello world"⏎
3␣␣␣␣␣⏎
4→ print "bonjour, le monde"⏎
5exit
When the lexer detects ⏎
after the colon in the first line, it starts
indentation counting. It walks along the four spaces ␣
of the second line
until it reaches p
. It then resets the input pointer to the position right
before it and notifies the indentation handler that an indentation of 4 has
occurred. The indentation handler recognizes an indented block and sends the
INDENT
token. After greeting the world, the next newline ⏎
lets the
lexer re-enter indentation counting. However, since the line contains nothing
but spaces and newline, no indentation is detected. Only, line 4, indentation
counting walks along the tabulator →
until it reaches the second print
command. The indentation handler is notified about an indentation of 4. It
recognizes no difference to the last relevant line and sends NODENT
. The last line
directly begins with non-whitespace. The indentation handler is notified, it
recognizes lesser indentation and sends DEDENT
.
Misfitting Indentation¶
Not every indentation is valid. Namely, a misfitting indentation triggers
the on_indentation_misfit
event handler. The following fragment shows
such a case.
1if True:⏎
2␣␣␣while True:⏎
3␣␣␣␣␣␣do_something()⏎
4␣␣␣␣do_more()⏎
Misfitting indentation occurs, if the indentation is less than the indentation
of the previous line, but does not fit any indentation of an opening block.
This happens in line 4 of the example above (the one with do_more()
). Its
indentation of 4 spaces does not fit any open indentation level, not 0 of the
first line, not 3 of the second line, and not 6 of the third line. In such a
case, the on_indentation_misfit
event handler is triggered.
Suppressed Newline¶
For languages that require a logical continuation of a line which is terminated
by a newline character, a newline suppressor
pattern can be specified. When
a pattern matching suppressor
precedes directly the newline
pattern,
indentation counting is skipped for the next line. This is shown in the example below.
1print "hello "␣␣\⏎
2␣␣␣␣␣␣"world"⏎
3if True:␣␣␣␣\\⏎
4do_more()⏎
5exit
The suppressor in mode EXAMPLE
triggers upon an odd number of backslashes
plus and arbitrary number (possibly none) of whitespace characters. In this
example, the lines 1 ends with an odd number of backslashes before the newline.
Consequently, the initial whitespace in lines 2 are not considered indentation.
Line 3, however, has an odd number of backslashes, which does not disable
newline. The even number of backslashes before newline does not deactivate
Indentation counting. In line 5 the INDENT
token is sent upon the first
non-whitespace character. Since, the newline there is preceded by a backslash,
line 6 is not subject to indentation counting. Only, the last line, which
starts with non-whitespace, triggers the sending of a DEDENT
token.
Suppressed newline after indentation does not add up indentation. The following
pseudo code does trigger the on_indent
event handler before do_something()
.
1repeat:⏎
2␣␣␣␣\⏎
3do_something()⏎
Suppressed newline does also not trigger on_dedent
upon lower indentation.
1while:⏎
2␣␣␣␣if True:⏎
3\⏎
4␣␣␣␣else⏎
Here, no indentation event triggers between if True:
and else
. It
does not trigger a misfitting indentation event, in case it occurs as in
the code below.
1while:⏎
2␣␣␣␣if True:⏎
3␣␣\⏎
4␣␣␣␣else⏎
Indentation with suppressed newline does not add up in the next line.
1while:⏎
2␣␣␣␣if True:⏎
3␣␣␣\⏎
4␣␣␣\⏎
5␣␣␣␣if False:⏎
If indentation was to add-up in lines ending with \
, then, in the example
above, the if False:
had an indentation of 10 (3+3+4). However, no indentation
event occurs between the two if
lines. The indentation of if False:
is
simply 4 as it is for if True:
.
Suspended Indentation¶
The effect of a suspend
definition is explained alongside the following
example.
1if is_ok():⏎
2␣␣# first comment⏎
3␣␣␣␣do_something()⏎
4␣␣# second comment⏎
5␣␣␣␣do_more()⏎
6exit
Line 2 starts, after 2 whitespaces with #
which signalizes a comment until
newline to be skipped. That line does not trigger an indentation event. Line 4
is also indented with only 2 whitespaces. The fact that it has less whitespace
at the beginning then its predecessor, however, is irrelevant. Since its starts
with what matches suspend
, the line is exempted from indentation handline.
Bad Indentation Characters¶
Mode EXAMPLE
defined the pattern of bad
indentation as a mixture of
tabs and spaces. In the code fragment below, line 2 and 3 start with a mixture
of spaces and tabs. As soon as the indentation is detected, the
on_indentation_bad
handler is called.
1if True:⏎
2␣␣→ do_something()⏎
3→ ␣␣do_more()⏎
Token Sending¶
Previous paragraphs indicated how the indentation counting process can be
customized. What remains is to discuss the customization of automated token
sending. In particular, the NODENT
token might be redundant in may cases.
One way to deal with it, is by assimilating its token-id to something carrying
the meaning of ‘statement delimiter’.
- token {
… NODENT = STATEMENT_END; …
}
Thus, a NODENT
arrives at the parser as a statement delimiter. Another way
to adapt the token sending process is to write event handlers related to
indentation counting. The list of available event handlers has already been
given in table Token identifiers and event handlers related to indentation-based scoping.. Earlier, section :ref:``
discussed the event handlers related to indentation handling.
- on_dedent: (CloseN, Indentation, ColumnN, LineN)
Whenever a lower indentation closes one or more nested indentation block, this event handler is called. The default behavior is to send either
CloseN
tokens of typeDEDENT
. If theDEDENT
token-id is setup for repetition, a repeated token is sent withCloseN
as repetition count. The default behavior is either,on_dedent { for(int i=0; i < ClosenN ; ++i) { self.send(QUEX_TKN_DEDENT); } } Or, if ``DEDENT`` has been setup to carry repetition counts, .. code-block:: cpp on_dedent { self.send_n(QUEX_TKN_DEDENT, CloseN); }
It has not been discussed yet, that by a single lower indentation more than one indentation block may be closed. This is demonstrated here.
1if True:⏎
2␣␣␣␣while True:⏎
3␣␣␣␣␣␣␣␣if True:⏎
4␣␣␣␣␣␣␣␣␣␣␣␣do_something()⏎
5exit
The exit
in the last line closes three indentation blocks: the one at
column 4 (opened in line 2), the one at column 8 (line 3), and the one at
column 12 (line4). Consequently, three DEDENT
tokens need to be sent.
Instead of stacking 3 tokens into the token queue, it may make sense to
allow DEDENT
to carry the repetition count.:
repeated_token { DEDENT; }
Then, automatically the default on_dedent
handler will be set it up
accordingly.
- on_nodent: (Indentation, ColumnN, LineN)
This handler is called on the incidence that the current line does neither close nor open an indentation block, nor is there a newline, nor is there a suppressed newline.
- on_indentation_misfit
raises:
E_Error_OnIndentationBad
Whenever an indentation occurs, which is lower than the currently open indentation block but which does not coincide with any nesting indentation block, then this handler is called.
- Indentation
The indentation that has occured.
- IndentationStackSize
The number of currently open indentation blocks.
- IndentationStack(i)
Delivers the indentation number ‘i’ from the current indentation blocks.
- IndentationUpper
Delivers the smallest indentation level that is greater than the current.
- IndentationLower
Delivers the greatest indentation level that is smaller than the current.
- ClosedN
Number of closed indentation levels.
- on_indentation_bad
raises:
E_Error_OnIndentationBad
Whenever the DFA matches which identifies a bad indentation line, this handler is called.
In some cases, it might be desirable not to stop lexical analysis upon bad
indentation characters. Then, however, it must be secured, that something else
catches it. If bad
matches a mixture of spaces and tabs, then there must be
either a skipper or a pattern inside the mode that digests it. A simple
<skip: [ \t\n]>
as shown below, does the job. The detection of a bad
indentation character should have its consequences. The treatment must
implemented in the on_indentation_bad
event handler. A setup of
bad-character skipping and error-tolerating can be seen below.
mode EXAMPLE :
<indentation: (" "*"\t")|("\t"*" ") => bad;>
<skip: [ \t\n]>
{
...
on_indentation_bad {
self.error_code_clear_this(E_Error_OnIndentationBad);
std::cout << self.line_number() << ":";
std::cout << "Warning: Bad indentation character!\n";
}
}
As a brief overview, the code fragment below sets up the event handlers to mimic the default behavior of automatic indentation-based token sending.
mode EXAMPLE : <indentation: ...>
{
on_indent { self_send(QUEX_TKN_INDENT); }
on_dedent { for(int i=0; i<ClosedN ; ++i) self.send(QUEX_TKN_DEDENT); }
on_nodent { self_send(QUEX_TKN_NODENT); }
on_indentation_misfit {
self.error_code_set_if_first(E_Error_OnIndentationMisfit);
self.send(QUEX_TKN_TERMINATION);
}
on_indentation_bad {
self.error_code_set_if_first(E_Error_OnIndentationBad);
self.send(QUEX_TKN_TERMINATION);
}
...
}
Interference with Pattern Matching¶
Indentation counting is triggered by a howsoever defined marker ‘newline’. For it to trigger, it must not be devoured by something else. Let a mode contain a pattern like this.:
"\n"+ => QUEX_TKN_VULTURE();
Here, indentation counting can never trigger–since the VULTURE
pattern eats it away. A newline at the end of a pattern, which is not put back
also prevents indentation counting. This happens, for example, with a pattern
designed to skip over a ‘comment until newline’.:
"#"[^\n]*\n => QUEX_TKN_COMMENT(Lexeme); // dysfunctional
Such a situation can be healed requesting newline as a post-context instead of being part of the core pattern.:
"#"[^\n]*/\n => QUEX_TKN_COMMENT(Lexeme); // functional
Since newline devouring patterns are an issue for indentation handling, Quex warns about any interferences. The skippers are designed to cooperate with the indentation counting automatically, if possible. Else, a warning is issued.
Tabs vs. Spaces¶
Surveying the discussions on the web on tabs vs. spaces, that the debate on this issue might not settle any time soon. The ‘tolerating consistency’ approach, as discussed earlier in this section, might pacify peaceable zealots. Nevertheless, the key points of discussion are listed below as a guideline for language design decisions.
- Pro Spaces
A space is always one column in any editor, in a file, or on any hardware. There is no definite width of a tabulator.
Spaces are visible as what they are. Tabs are only visible when being interpreted as some number of spaces.
- Pro Tabulators
Tabs provide an encoding-inherent compression mechanism: Spaces require N characters for an indentation, tabulators only 1.
Misfitting indentation is impossible.
A viewer may adapt the indentation width to the user’s preferences simply by specifying a ‘tabulator display width parameter’.
On viewers with non-mono-spaced fonts, tabs are the only means to achieve vertically aligned columns.
Traditionally, tabs jump to predefined columns. Since, this behavior is similar to indentation, it may cause the least amount of astonishment.
Once a development team has made a decision, then editing tools should be
configured to support the translation of tabulators into spaces or spaces into
tabulators. Then, the difference only exists underneath the hood and the urge
of discussion may dissolve. After all this pacifying talk, the following script
‘subdue.sh
’ is intended to cater the needs of the not so peaceable fanatic,
who wants to focus on actions and not on debates.
#! /usr/bin/env bash
# PURPOSE: Convert a text file towards the *only true* indentation type.
# NO WARRANTY. USE AT OWN RISK.
# $1 target file
# $2 needed indentation type: 'tabs' or 'spaces'
# $3 number of spaces per tab.
# EXAMPLE:
# for file_name in $(find ./ -name "*.c"); do
# bash ./subdue.sh $file_name spaces 4
# done
# (C) Frank-Rene Schaefer; License: MIT.
#______________________________________________________________________________
IFS=; read -r -d '' program <<- \
_______________________________________________________________________________
BEGIN {
while( spaces_per_tab-- ) { spaces = spaces " "; }
if ( need=="spaces" ) { bad="\t"; good=spaces; }
else if( need=="tabs" ) { bad=spaces; good="\t"; }
else { print "tabs or spaces?"; exit(-1); }
}
match(\$0, /^[ \t]*[^ \t]/) { # Only replace at begin of line
tmp=substr(\$0, RSTART, RLENGTH-1);
gsub(bad, good, tmp);
\$0=tmp substr(\$0, RLENGTH);
}
{ print; }
_______________________________________________________________________________
tmp_file=$(mktemp)
awk -v need=$2 -v spaces_per_tab=$3 "$program" $1 > $tmp_file
mv $tmp_file $1