rsashka May 10 at 14:50

About the C++ static analyzer as a Clang plugin

Medium

9 min

1.2K

Retrospective

This article is based on the experience of developing the memsafe library, which, using the Clang plugin, adds safe memory management and invalidation control of reference data types to C++ during source code compilation.

Choosing a Clang plugin architecture (AST Matcher vs. RecursiveASTVisitor)

There are many examples of creating plugins for CLang on the Internet. But when trying to use them in a real project, features are discovered that initially made it much easier to get started, but later served as a serious obstacle, complicating further development of the project. In other words, easier and simpler methods of developing a Clang plugin simplify the start of a project, but require many times more costs for its subsequent development and support.

After several iterations of the plugin code, I came to the conclusion that, despite the fact that searching for data in the AST using the AST matcher is done with less code, the overall final solution is much more complex, since it is sometimes impossible to take into account the various mutual dependencies of the plugin code in different branches (conditions) of several different matchers.

When searching with the AST Matcher, MatchCallback is only called for found nodes, but creating and clearing the plugin context information on each call is expensive, and using global plugin state during analysis requires sequentially traversing all AST nodes to be able to dynamically create and then remove context information that is no longer needed.

Because of this, after several unsuccessful experiments with AST Matcher, I decided to make a plugin the old-fashioned way, implementing it as a RecursiveASTVisitor template. In addition, RecursiveASTVisitor allows you to interrupt, repeat or execute an alternative branch of the plugin analyzer depending on the algorithm parameters, context information or settings (options) that are themselves in the program source code.

And the most important thing. Matcher processes only specifically specified AST nodes, the indication of which is not always obvious, and under complex search conditions, some of them can be missed, while RecursiveASTVisitor sequentially traverses all AST nodes, and a missed node is easily detected when debugging and testing the plugin.

Print AST fragment

Whatever plugin architecture we use, the AST remains the same. Only the way to traverse/search its nodes differs. And in order to find something, you need to know what to look for and how it relates to other AST nodes. And for a person who is not particularly versed in these subtleties, the presence of various temporary or sugar AST nodes can become a real headache. For example, for me, the biggest difficulty was understanding which AST nodes a particular expression consists of.

The architecture of the AST Matcher plugin suggests using a special query language and using the clang-query utility to check the search terms. This is a good tool, but it is inconvenient to use, since you have to switch between the plugin source code, the source code being analyzed, and the clang-query utility to check the search terms in the query language.

For myself, I solved this problem by setting the start and end marks for printing the AST dump in the analyzed code. The plugin switches to dump printing mode when the first mark is found and switches it off when the second one is found. This method of outputting individual AST fragments is more convenient, since it does not require the use of third-party tools, and you do not have to search for the desired fragment in the full AST dump.

For example, the definition of the SharedArrayInt structure:

MEMSAFE_PRINT_AST("*"); // Starting AST dump output
struct SharedArrayInt : public Shared<std::vector<int> > {
};
MEMSAFE_PRINT_AST("");  // Disable dump output

as an AST dump it looks like this:

_cycles.cpp:87:12  struct SharedArrayInt : public Shared<std::vector<int>>   dump:
CXXRecordDecl 0x7ab1e3aaeda8 <_cycles.cpp:87:5, line:88:5> line:87:12 struct SharedArrayInt definition
|-DefinitionData aggregate standard_layout can_const_default_init
| |-DefaultConstructor exists non_trivial needs_implicit
| |-CopyConstructor simple non_trivial needs_overload_resolution
| |-MoveConstructor exists non_trivial needs_overload_resolution
| |-CopyAssignment simple non_trivial has_const_param needs_implicit implicit_has_const_param
| |-MoveAssignment exists simple non_trivial needs_overload_resolution
| `-Destructor simple non_trivial needs_implicit
|-public 'Shared<std::vector<int>>':'memsafe::Shared<std::vector<int>>'
|-CXXRecordDecl 0x7ab1e3a62a78 <col:5, col:12> col:12 implicit struct SharedArrayInt
|-CXXConstructorDecl 0x7ab1e3a62bd0 <col:12> col:12 implicit SharedArrayInt 'void (SharedArrayInt &)' inline default noexcept-unevaluated 0x7ab1e3a62bd0
| `-ParmVarDecl 0x7ab1e3a62d08 <col:12> col:12 'SharedArrayInt &'
|-CXXConstructorDecl 0x7ab1e3a62ee8 <col:12> col:12 implicit constexpr SharedArrayInt 'void (SharedArrayInt &&)' inline default_delete noexcept-unevaluated 0x7ab1e3a62ee8
| `-ParmVarDecl 0x7ab1e3a63028 <col:12> col:12 'SharedArrayInt &&'
`-CXXMethodDecl 0x7ab1e3a630c8 <col:12> col:12 implicit operator= 'SharedArrayInt &(SharedArrayInt &&)' inline default noexcept-unevaluated 0x7ab1e3a630c8
  `-ParmVarDecl 0x7ab1e3a631f8 <col:12> col:12 'SharedArrayInt &&'

And a function with a single return statement

MEMSAFE_PRINT_AST("*"); // Starting AST dump output
memsafe::Shared<int> memory_test_9() {
    return Shared<int>(999);
}
MEMSAFE_PRINT_AST(""); // Disable dump output

looks like an AST dump like this:

_example.cpp:169:26  memsafe::Shared<int> memory_test_9()   dump:
FunctionDecl 0x7435ed31f4e8 <_example.cpp:169:5, line:171:5> line:169:26 memory_test_9 'memsafe::Shared<int> ()'
`-CompoundStmt 0x7435ed31f8a8 <col:42, line:171:5>
  `-ReturnStmt 0x7435ed31f898 <line:170:9, col:31>
    `-ExprWithCleanups 0x7435ed31f880 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>'
      `-CXXFunctionalCastExpr 0x7435ed31f858 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>' functional cast to Shared<int> <ConstructorConversion>
        `-CXXBindTemporaryExpr 0x7435ed31f838 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>' (CXXTemporary 0x7435ed31f838)
          `-CXXConstructExpr 0x7435ed31f800 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>' 'void (const int &)'
            `-MaterializeTemporaryExpr 0x7435ed31f7b8 <col:28> 'const int' lvalue
              `-ImplicitCastExpr 0x7435ed31f7a0 <col:28> 'const int' <NoOp>
                `-IntegerLiteral 0x7435ed31f780 <col:28> 'int' 999
_example.cpp:170:9  return Shared<int>(999)  dump:
ReturnStmt 0x7435ed31f898 <_example.cpp:170:9, col:31>
`-ExprWithCleanups 0x7435ed31f880 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>'
  `-CXXFunctionalCastExpr 0x7435ed31f858 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>' functional cast to Shared<int> <ConstructorConversion>
    `-CXXBindTemporaryExpr 0x7435ed31f838 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>' (CXXTemporary 0x7435ed31f838)
      `-CXXConstructExpr 0x7435ed31f800 <col:16, col:31> 'Shared<int>':'memsafe::Shared<int>' 'void (const int &)'
        `-MaterializeTemporaryExpr 0x7435ed31f7b8 <col:28> 'const int' lvalue
          `-ImplicitCastExpr 0x7435ed31f7a0 <col:28> 'const int' <NoOp>
            `-IntegerLiteral 0x7435ed31f780 <col:28> 'int' 999

Output clang plugin messages and logs for debugging

In the first version of the plugin, I was impressed by the ease of use of libTooling, as its developers continue to invent new tools to make the job easier. In addition to the matching architecture (finding AST nodes with given conditions), I used Rewriter from RefactoringTool to modify the source code and print different messages when the required nodes are found in the AST.

This is a nice and simple approach, but it has an unexpected problem. The output of messages is always tied to the right place in the source file (which seems quite logical), but it is necessary to separate messages to the user and debug messages for the developer of the plugin itself. And for the output of the latter type of messages, it turned out to be much more convenient to group them at the very end of the output.

This makes it easier to separate control over the level of user messages and the verbosity of debug output, it's much easier to automate tests (all the necessary messages are grouped in one place instead of being mixed in with Clang's own output), and the problem of message position in the source code is solved by displaying where it was generated, so that the IDE can jump to the right place in the source file by simply clicking on the debug message.

Example of log output:

#memsafe-log
_cycles.cpp:17:11: #log #102 Detected shared type 'ns::Ext' registered at _cycles.cpp:17:11
_cycles.cpp:18:19: #log #103 Shared class definition 'ns::A' used from another translation unit.
_cycles.cpp:17:11: #log #102 Class 'ns::Ext' checked for cyclic references
_cycles.cpp:30:22: #log #1003 Field with reference to structured data type 'cycles::CircleSelf'
_cycles.cpp:29:12: #log #1002 Detected shared type 'cycles::CircleSelf' registered at _cycles.cpp:29:12
_cycles.cpp:30:22: #err #1003 Class cycles::CircleSelf has a reference to itself through the field type cycles::CircleSelf
_cycles.cpp:30:22: #err #1003 Field type raw pointer
_cycles.cpp:33:12: #log #1006 Detected shared type 'cycles::CircleShared' registered at _cycles.cpp:33:12
_cycles.cpp:34:30: #err #1007 Class cycles::CircleShared has a reference to itself through the field type cycles::CircleShared
_cycles.cpp:38:37: #log #1011 Field with reference to structured data type 'cycles::CircleSelf'
_cycles.cpp:37:12: #log #1010 Detected shared type 'cycles::CircleSelfUnsafe' registered at _cycles.cpp:37:12
_cycles.cpp:30:22: #log #1003 Field with reference to structured data type 'cycles::CircleSelf'
_cycles.cpp:30:22: #err #1003 The class 'cycles::CircleSelfUnsafe' has a circular reference through class 'cycles::CircleSelf'
_cycles.cpp:38:37: #warn #1011 UNSAFE field type raw pointer
...

I also thought it would be nice to tie debug messages to a specific location in the source file, if such marks didn't change every time the number of lines upstream in the source file changed. However, I had to take into account that inside macros SourceLocation specifies where the macro is defined, not where it is used, so I had to do something like this:

    SourceLocation getLocation(Decl * decl){
        if (decl->getLocation().isMacroID()) {
            return CI.getSourceManager().getExpansionLoc(decl->getLocation());
        } else {
            return decl->getLocation();
        }
    }

Tracking custom C++ attributes in source code

Marking objects in the source code and controlling the plugin's operation is done using C++ user attributes, which are designed to extend the language and pass additional information to the compiler. They have a standard syntax and are fully compatible with the C++ syntax.

But since attributes can be used almost anywhere in C++ code and applied to almost anything: types, variables, functions, names, code blocks, or entire translation units, the plugin must be able to not only allow the use of new attributes with the correct arguments, but also check that they are applied correctly.

The Clang plugin contains two entities: an attribute parser, which is responsible for checking them, and the plugin itself, which performs AST analysis. These are two completely different classes, each of which is responsible for its own functionality.

But while working on the plugin, it turned out that distributing attribute checks between two classes is very inconvenient, since some of them depend on the context that is formed in another class. As a result, some attribute argument checks became very confusing, and after several unsuccessful attempts, I solved this problem as follows.

The attribute parser only checks the number and type of attribute arguments and simply adds them to the AST elements, but does not parse them in any other way, which allows moving all the checking logic to one place (the AST parser class).

Although this solution entailed another problem — some attributes can be skipped during subsequent analysis, although it turned out to be easy to protect against this. In order not to accidentally forget to process attributes, it is enough to save information about the place of their appearance in the source text of the C++ program, and set the flag for their processing when the analyzer is running. And if some attributes are still skipped during analysis, and the flag is not reset (for example, some complex scenario of their use was not taken into account), then when outputting a debug dump, a list of skipped (not processed) attributes will be automatically output.

The remaining little things

When developing a Clang plugin, it's a bad idea to use std::cout or std::cerr output for tracing. Different caching settings can result in messages with different, strange nuances (especially if the compiler crashes when debugging the plugin, which sometimes happens). And mixing output streams for debug messages can lead to very deep trouble. And the solution is very simple: when debugging a plugin, you should only use llvm::outs() or llvm::errs() as output streams.

I also liked the color highlighting of important messages when the plugin starts, which saves time searching for the right line in the same type of console output. But this advice is in the realm of "to each his own".

In conclusion

This is probably the most important thing I wanted to note when developing the Clang plugin. Maybe I missed something or considered it unimportant, but in any case, the plugin sources are available on GitHub, which you can always use.

Hubs: