Pull to refresh

Tree-sitter and Preprocessing: A Syntax Showdown

Level of difficultyMedium
Reading time5 min
Views571
Original author: pinbraerts

According to the description,


Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited.

But how does Tree-sitter handle languages that require a preprocessing stage?


Since the preprocessor affects the textual content, it's challenging to incorporate it into the language grammar. Therefore, one must devise a way to support the preprocessor with minimal loss, without performing the preprocessing itself.


C/С++


tree-sitter-cpp inherits from tree-sitter-c and does not alter the rules for preprocessor directives. In tree-sitter-c, the approach is principled: the parser should consider the preprocessor as an integral part of the grammar. However, any preprocessor directive that modifies text (#if, #include) can appear in the middle of a grammatical rule and change it to something entirely different. Therefore, to fully support #if in a single grammar, it's necessary to generate a unique preprocessor directive rule for each combination of rules. This can be done using one of Tree-sitter's advantages: scriptability via JavaScript. In this parser, they limited themselves to only four cases:


    ...preprocIf('', $ => $._block_item),
    ...preprocIf('_in_field_declaration_list', $ => $._field_declaration_list_item),
    ...preprocIf('_in_enumerator_list', $ => seq($.enumerator, ',')),
    ...preprocIf('_in_enumerator_list_no_comma', $ => $.enumerator, -1),

The preproc_if rule is used in rules for expressions inside blocks and the global scope. The preproc_if_in_enumerator_list and preproc_if_in_enumerator_list_no_comma rules appear in enumeration lists, while preproc_if_in_field_declaration_list is used in structures, unions, and classes.


This set of rules works well with simple examples:


#if 9            // (preproc_if condition: (number_literal)
int a = 3;       //   (declaration)
#else            //   alternative: (preproc_else
int b = 3;       //     (declaration)))
#endif           //

int main(void) { // (function_definition body: (compound_statement
#if 9            //   (preproc_if condition: (number_literal)
    int a = 3;   //     (declaration)
#else            //     alternative: (preproc_else
    int b = 3;   //       (declaration)))
#endif           //
}                // ))

struct {         // (struct_specifier body: (field_declaration_list
#if 9            //   (preproc_if condition: (number_literal)
    int a;       //     (field_declaration)
#else            //     alternative: (preproc_else
    int b;       //       (field_declaration)))
#endif           //
};               // ))

enum {           // (enum_specifier body: (enumerator_list
#if 9            //   (preproc_if condition: (number_literal)
    a = 2,       //     (enumerator)
#else            //     alternative: (preproc_else
    b = 3,       //       (enumerator)))
#endif           //
};               // ))

However, a small change in the last example can choke tree-sitter-c:


enum {           // (enum_specifier body: (enumerator_list
#if 9            //   (preproc_if condition: (number_literal)
    a = 2,       //     (enumerator)
#else            //     alternative: (preproc_else)
    b = 3        //       (ERROR (enumerator)))
#endif           //
};               // ))

A perfectly valid C code snippet without a trailing comma contains different grammatical rules for different branches of the preprocessor directive: an enumerator with a comma and without.


A more complex example:


int a =          // (ERROR)
#if 1            // (preproc_if condition: (number_literal)
    3            //   (ERROR (number_literal))
#else            //   alternative: (preproc_else
    4            //     (expression_statement (number_literal)
#endif           //       (ERROR))))
;                //

In this case, tree-sitter-c can't even correctly handle #else:


int a            // (declaration)
#if 1            // (preproc_if condition: (number_literal)
    = 3          //   (ERROR (number_literal)
#else            //   )
    = 4          //     (expression_statement (number_literal)
#endif           //       (ERROR)
;                // )))

While the result of #if substitution can be predicted based on the source code, the result of #include substitution is entirely unpredictable for the parser. Nevertheless, in C and C++ grammars, the #include directive is allowed only in the global scope and inside blocks.


#include "a"     // (preproc_include path: (string_literal))
int main(void) { // (function_definition body: (compound_statement
    #include "b" //   (preproc_include path: (string_literal))
}                // ))
int a =          // (declaration (init_declarator
    #include "c" //   (ERROR) value: (string_literal)
;                // ))

Csharp


In tree-sitter-c-sharp a similar approach was taken but with slightly more diversified contexts:


    ...preprocIf('', $ => $.declaration),
    ...preprocIf('_in_top_level', $ => choice($._top_level_item_no_statement, $.statement)),
    ...preprocIf('_in_expression', $ => $.expression, -2, false),
    ...preprocIf('_in_enum_member_declaration', $ => $.enum_member_declaration, 0, false),

This allows parsing such an example, thanks to a special rule for preprocessor directives inside expressions:


int a =          // (variable_declaration
#if 1            //   (preproc_if condition: (integer_literal)
    3            //     (integer_literal)
#else            //     alternative: (preproc_else
    4            //       (integer_literal))))))
#endif           //
;                //

However, it breaks a working example with enumeration in tree-sitter-c:


enum A {         // (enum_declaration body: (enum_member_declaration_list
#if 9            //   (preproc_if condition: (integer_literal)
    a = 2,       //     (enum_member_declaration) (ERROR)
#else            //     alternative: (preproc_else
    b = 3,       //       (enum_member_declaration) (ERROR)))
#endif           //
};               // ))

enum A {         // (enum_declaration body: (enum_member_declaration_list
#if 9            //   (preproc_if condition: (integer_literal)
    a = 2,       //     (enum_member_declaration) (ERROR)
#else            //     alternative: (preproc_else
    b = 3        //       (enum_member_declaration)))
#endif           //
};               // ))

In this case, the error nodes correspond only to commas, so a successful attempt is counted.


Nevertheless, more complex rules, like operators, are still not accounted for:


int a            // (ERROR (variable_declaration)
#if 1            //   (preproc_if condition: (integer_literal)
    = 3          //     (ERROR) (integer_literal)
#else            //     alternative: (preproc_else
    = 4          //       (ERROR) (integer_literal))
#endif           //   ))
;                // (empty_statement)

Other directives


What distinguishes the grammar for C# is the interpretation of other preprocessor directives. In Tree-sitter, there's a field in the grammar called extras, which allows marking special rules that can appear anywhere. Typically, this list includes spaces and comments. The grammar can be significantly simplified by adding directives to this list:


  extras: $ => [
    /[\s\u00A0\uFEFF\u3000]+/,
    $.comment,
    $.preproc_region,
    $.preproc_endregion,
    $.preproc_line,
    $.preproc_pragma,
    $.preproc_nullable,
    $.preproc_error,
    $.preproc_define,
    $.preproc_undef,
  ],

Thus, these directives are still included in the syntax tree and participate in syntax highlighting but do not affect the other rules.


int a                                 // (variable_declaration (variable_declarator
#pragma warning disable warning-list  //   (preproc_pragma)
    = 3                               //   (integer_literal)
#pragma warning restore warning-list  //   (preproc_pragma)
;                                     // ))

Despite a minor bug in the preproc_pragma rule, everything else was interpreted correctly.


Before this pull request, #if was also in extras, which allowed parsing files with fewer errors.


Conclusion


Overall, grammars for C/C++ and C# work quite well, and thanks to Tree-sitter's robustness to errors, invalid constructs do not affect the parsing of subsequent text. Parsing errors can indeed be noticed through incorrect syntax highlighting or the malfunctioning of other editor features implemented with Tree-sitter, but when using a language server, highlighting can be slightly improved through Semantic Tokens. For instance, clangd marks missing #if branches as comments:


semantic tokens


One might even say that Tree-sitter, in a sense, penalizes excessive use of the preprocessor. Personally, I prefer the approach of placing directive rules in extras. In the next article, I will discuss how I solved the preprocessing issue while writing the grammar for FastBuild using this approach.


References


Tags:
Hubs:
Total votes 2: ↑2 and ↓0+2
Comments1

Articles