Lexical Analysis in 11l / Хабр

This article discusses the lexical analyzer, which is an integral part of any compiler.

The task of the lexical analyzer is to split the source code of the program into tokens.

So for example the code

print(1 + 2)

will be tokenized as
print, (, 1, +, 2 and )

Significance of indentation

Historically, compilers for most programming languages strip away all whitespace characters such as space, tab, and newline.
What does this mean? It means the compiler sees the source code like this:

if (foo)
   if (bar)
      m1();
else
   m2();

↓
if, (, foo, ), if, (, bar, ), m1, (, ), ;, else, m2, (, ), ;

This code is taken from the Nemerle programming language documentation. It contains a so-called dangling else bug: when looking at this code, you might think that the “else” branch refers to the condition “foo”, whereas in programming languages with C-like syntax (including C++, C#, Java and others) the “else” branch refers to the condition “bar”.
As a solution to this problem, the developers of the Nemerle language introduced a rule that an “if” statement must always have an “else” branch, and if the “else” branch is not required, then the “when” statement should be used instead of “if”.
In many new programming languages, such as Swift, Rust, and Go, the use of curly braces is mandatory even if the body of the “if” [or other control statement] is only one line. When curly braces are used, the problem disappears:

if (foo) {
   if (bar) {
      m1();
   }
}
else {
   m2();
}

Some coding standards, for example those from Oracle, from Stanford University or Carnegie Mellon University, require the bodies of “if” and “else” statements to be enclosed in curly braces, to avoid the possibility of this bug:

int login;

if (invalid_login())
    login = 0;
else
    printf("Login is valid\n");
    login = 1;

Here the line login = 1; will be executed in any case, no matter what value is returned by the invalid_login function.

But in both of the above cases, the problem is not at all that “if” [allowing an “else” branch to be either absent or present] expresses the wrong logic, or even that the curly braces were forgotten: it is… the discrepancy between the way this code is perceived by the human programmer and by the compiler. A human being perceives block boundaries visually, using indentation as a guide; but the compiler makes use of symbols [which a human hardly notices]. What is more, the compiler [for C/C++, C#, Java, Nemerle, etc.] treats all whitespace characters in the same way, and, thus, completely ignores the indentation. This is where the root of the problem lies: the compiler and the human see code like this differently.

And to solve this problem, the compiler must somehow take indentation into account [so that the code examples given earlier either work as the programmer expects, or give an error at the compilation stage (or at least a warning)].

For this reason, it was decided to provide the new programming language 11l with a lexical analyzer that takes indentation into account.

Now let's look at the implementation process. A brief description of the algorithm for parsing indentation is given in the Python programming language documentation. In short, the algorithm works like this: at the beginning of each line, the indentation level is compared to the value on the top of a special stack. If it is larger, then it is pushed onto the stack and the lexer generates an INDENT token. If it is smaller, then all values exceeding the current line indentation level are removed from the stack and a DEDENT token is generated for each value removed. Further, at the end of the source file, a DEDENT is generated for each value remaining on the stack.

In addition to the fact that the lexical analyzer of the 11l language takes indentation into account, like the lexical analyzer of the Python language, the new language adds support for curly braces, which makes it possible to write code on one line or without indentation:

With indentation:

fn sum(a, b)
   return a + b

On one line:

fn sum(a, b) {return a + b}

Without indentation:

fn sum(a, b)
{
return a + b
}

In addition, code can be written using both indention and curly braces.

Here is how it works (in brief)

The indentation levels stack (indentation_levels) in 11l, in contrast to the lexical analyzer of the Python language, stores not just the indentation level but also a flag that is set to true whenever a new block of code is formed by the symbol {, rather than by increasing the indentation level. The indentation level is then set to -1.
The { character is immediately followed by a new, arbitrary {i.e. lowering the indentation level is allowed} indentation level {its level is set instead of -1}, which remains in effect up to the matching } character.
Here is a link to the corresponding source code.

This solution is the most versatile, and will suit both those who prefer using curly braces and those who prefer indentation.
All popular indentation styles are supported:

Allman K&R GNU

Allman	K&R	GNU
`if x == y { something() somethingelse() }`	`if x == y { something() somethingelse() }`	`if x == y { something () somethingelse () }`
Whitesmiths	Ratliff
`if x == y { something() somethingelse() }`	`if x == y { something() somethingelse() }`

if x == y
{
    something()
    somethingelse()
}

if x == y {
    something()
    somethingelse()
}

if x == y
  {
    something ()
    somethingelse ()
  }

Whitesmiths Ratliff

if x == y
    {
    something()
    somethingelse()
    }

if x == y {
    something()
    somethingelse()
    }

Automatic line joining

In addition to splitting the source code of a program into tokens, the lexical analyzer is also charged with determining statement boundaries. [Although in some programming languages (like JavaScript or Kotlin) assistance from the parser is required.]

Traditionally, a semicolon is used in C-like programming languages (C++, C#, Java, D) as a statement terminator. However, most new programming languages (e.g. Swift, Go, Kotlin, CoffeeScript, Nim, Julia, Crystal, and others) allow the semicolon to be omitted, by using the newline character as the end of a statement. [I am not the only one to have noticed this trend.]

But the question then arises: when do two consecutive lines of source code represent parts of the single statement, and when are they two different statements? And different programming languages solve this problem in different ways.

Python uses one simple rule for implicit line joining: if a line contains an unclosed parenthesis, square bracket, or curly brace, then it concatenates with following lines until all matching parentheses/brackets/braces are closed.

Go uses the rule: if the newline comes after a token that could end a statement, then the statement is ended. However, this rule imposes significant restrictions on coding style.
For example, in an “if” statement, the curly brace that opens a block must be on the same line — moving it down to the next line is not allowed. The “else” statement must also appear on the same line as the closing curly brace.

Right	Wrong
`if i < f() { g() }`	`if i < f() { g() }`
`if x < 0 { return -1 } else { return 1 }`	`if x < 0 { return -1 } else { return 1 }`

JavaScript uses a rather complex system of rules to carry out automatic semicolon insertion.

a = 1
b = 2

In this example, a semicolon will be automatically inserted at the end of the first line, since a = 1 b = 2 is not a valid statement.
However, in some cases a semicolon is not inserted, leading the code to function incorrectly.

For example, the following two lines of JavaScript code will be mistakenly joined into one.

a = b + c
[d, e] = [e, d]

And in this example, on the contrary, a semicolon will be erroneously inserted immediately after return.

return
  "something";

That is, instead of returning a string "something", the value undefined will be returned.

In most programming languages that allow semicolons to be omitted (for example, Python, Go, Kotlin, Ruby, Julia, and Nim), this code will not work:

r = 1
  + 2

Moreover, in Python, Go and Nim an error message will be displayed, while in Kotlin, Ruby and Julia the value of the variable r will be set to 1.

To solve this problem, 11l uses the following 3 simple rules that are easily understood by both the programmer and the lexical analyzer:

If a line ends with a binary operator, then it is joined with the next line.
If a line begins with a binary operator, then it is joined with the previous line.
(It is necessary to check that it is not a unary operator!)
```
a = b
++i // the plus character at the beginning of this line should not be mistaken for the binary operator `+`
```
And also, just as in Python, a newline within expressions in parentheses or square brackets is ignored.

// 1.
if condition1 & // this line will be joined with the next one,
   condition2   \\ since it ends with the binary operator `&`
    some_func()

// 2.
some_variable = 2
              + 3 // this line will be joined with the previous one,
                  \\ since it starts with the binary operator `+`

// 3.
some_func(    // since this line includes an unclosed parenthesis, all
   argument1, \\ subsequent lines will be joined until
   argument2) \\ the parenthesis is closed

Semicolons

While it is not necessary to use a semicolon to mark the end of statements in 11l, semicolons can be used to place multiple statements on one line:

a = 1; b = 2; c = 3

But, for example, in Python it is not obvious what this line of code corresponds to:

if b: print(1); print(2)

This code in Python:	`if b: print(1) print(2)`
Or this:	`if b: print(1) print(2)`

This code in 11l:	`if b {print(1); print(2)}`
Or this:	`if b {print(1)}; print(2)`

Python's behavior in this case is logical in principle, but not obvious.
The behavior of 11l, by contrast, is both logical and obvious.

Furthermore, it is impossible to write a one-line function like this in Python:

def f(x): if x: print(x)

But in 11l, it is possible:

fn f(x) {if x {print(x)}}

Always room for improvement

In my _{anything but} humble opinion, 11l implements the most advanced lexical analyzer of all currently existing programming languages. However, it can still be improved.

For example, the third automatic line joining rule could be dropped, to support code like this:

set_timeout(
            1.0,
            fn ()
               alert(‘!’)
            ,
            0
           )

In this case, the third rule must be replaced with the following two rules:

If a line ends with an opening parenthesis (() or square bracket ([), or a comma (,), then it is joined with the next line.
If a line begins with a closing parenthesis ()) or square bracket (]), then it is joined with the previous line.

But since at the moment 11l parser does not support constructs of this kind {in particular, anonymous functions are not supported [instead, you can use "arrow" functions (for example, (x, y) -> x + y)]}, their implementation at the level of the lexical analyzer would make no practical sense.

In conclusion, I will provide links to the source code for the 11l lexical analyzer, which is available in three languages: Python, 11l and C++.