C++ Static Analysis using Clang

Introduction

Large code bases typically develop rules around how various code constructs should be used. These rules help eliminate bugs resulting from common mistakes. C++ gives programmers a good amount of power over enforcing such rules using the facilities that the language provides. As a simple example, if you have a class that should not be inherited from, you can mark the class as final. Typically one finds themselves in a situation where the language doesn't provide an easy way to enforce something. In that case, the solution is typically enforcing the rules through documentation, code review, etc. Besides the obvious downside of people making mistakes when checking for these rules, detecting violations of the some rules may be completely impractical. Let's look at two examples.

RAII Classes

RAII is an extremely useful idiom. Typically you'll have a class that looks like this:

class AutoResource {
public:
  AutoResource() { /* acquire resource */ }
  ~AutoResource() { /* free resource */ }
};

void foo() {
  AutoResource raii;
}

The goal here is to use the C++ object lifetime rules to ensure that the resource is deallocated when foo() returns. But what prevents someone to write code like this?

void foo() {
  auto raii = new AutoResource(); // ?!?!
}

Here, the programmer has clearly misunderstood the semantics of AutoResource, and in order for the code to function properly, the object needs to be freed manually, which defeats the purpose behind using RAII.

memmoving objects with pointers to their own members

Let's say that you have an array class like this:

class Array {
  int* array;
  int buffer[1024];
  size_t size;
  bool usingOwnBuffer;
  // ...
};

The intention here is to avoid memory allocation for small arrays by having the array member to point to buffer if the size of the array is less or equal to 1024 items, and otherwise allocate the array on the heap. (This is how our nsAutoTArray class works.) It's clear that memmoving such an object in memory is dangerous, since if array is pointing to the internal buffer, then after the memmove, it will be a dangling pointer. Now, suppose that you had a container class that was known to memmove the objects it contained when the size of the container changes. Using such a container class to hold objects including your Array class would be disastrous, especially since now we have a bug that depends on the size of an array, so we may not be able to detect it when testing using smaller values, but the users of our software would hit it in practice if the array got larger than 1024 entries, and a crash report would show evidence of memory corruption without any direct correlation to where the bug is. What makes this terrible is that detecting this manually is extremely difficult, since you need to look for things such as classes which have a member that inherits from a class that has a member which is a struct containing an Array member (as such as class will include an Array object in its in-memory object layout). Nobody wants to debug such issues in practice. It turns out that we can do much better, by rejecting such programs at compile time.

Static Analysis based on the program AST

There are a number of different strategies for detecting problems in code before it runs. One technique that is relatively simple and quite powerful is based on analyzing the program AST looking for problematic patterns and rejecting such programs. This allows you to narrow down the scope of a generic programming language to the specifics of your environment, and it reassures that once the program passes the static analyses successfully, it is unaffected by such problems. Over the past two and a half years, we have developed a custom analyzer implemented as a clang plugin to check Firefox for a vast number of issues. We have hooked this up in Treeherder so that if a new commit violates a rule that the analyzer checks, it will create a custom compiler error. Here are two real examples of me trying to compile code that violates the above rules:

 ServiceWorkerEvents.cpp:508:21: error: variable of type 'mozilla::dom::workers::AutoCancel' only valid on the stack
   auto autoCancel = new AutoCancel(this, mRequestURL);
                     ^
 ServiceWorkerEvents.cpp:508:21: note: value incorrectly allocated on the heap

 nsTArray.h:660:34: error: Cannot instantiate 'nsTArray_CopyChooser' with non-memmovable template argument 'mozilla::PaintedLayerData'
 struct MOZ_NEEDS_MEMMOVABLE_TYPE nsTArray_CopyChooser
                                  ^
 nsTArray.h:791:42: note: instantiation of 'nsTArray_CopyChooser' requested here
   : public nsTArray_base<alloc, typename="" nstarray_copychooser&lt;e="">::Type>
                                          ^
 FrameLayerBuilder.cpp:433:17: note: 'mozilla::PaintedLayerData' is a non-memmove()able type because member 'mLog' is a non-memmove()able type 'nsAutoCString'
   nsAutoCString mLog;
                 ^

These checks require the compiler to know more about your program than it is possible to express in C++. For example, for the above two checks to work, the compiler needs to know which classes are RAII classes, which container classes cannot accept non-memmoveable types, and so on. There are various techniques for representing this information; we typically rely on attributes hidden behind macros to add annotations to various types/etc. to deliver the extra information. For example, as long as you remember to annotate an object that is not safe to be moved in memory as MOZ_NON_MEMMOVABLE, the custom analyzer will prevent it to be abused as above.

Using clang to create custom analyzers

Most people think of clang as a compiler, but it is much more than that. Clang is developed as a set of libraries that are possible to reuse according to your needs. Furthermore, it allows you to write plugins that can be loaded as part of the normal build, or create your own tools that you can replace with the real compiler in your build system if your needs go beyond additional checks at compile time. For the purpose of writing a custom analyzer, clang provides several great features:

Its support for plugins means that you get integration with the build system almost for free by adjusting CXXFLAGS to load the plugin.
It does many things that you need for you. For example, it parses the code and performs semantic analysis on it for you, has a very powerful diagnostics library which lets you generate errors, warnings and notes that appear similar to normal compiler diagnostics that tools know how to consume, etc.
It is fairly well documented.
It has features that will surprise you. For example, if you generate fix-it hints in your analyzer, your analyzer can not only detect issues, but it can also fix them for you!
It is relatively simple to use.

In the rest of this post, I'll demonstrate how to create a custom analyzer from scratch without any prior knowledge of clang. I won't describe how to create a plugin, since that is documented. I'll assume that you're familiar with the basics of compiler internals, such as ASTs.

Creating a custom analyzer from scratch

The normal work flow for creating a custom analyzer looks like this:

Formulate the problem in terms of the structure of the AST
Identify the pattern to be detected in ASTs that clang generates
Filter the AST nodes to find the interesting pattern in the input program
Perform the checks on the AST nodes
Generate diagnostics as needed

Let's focus on a classic example: checking sizeof usage in presence of memset. More specifically, we want to detect problems such as:

memset(pointer, 0, sizeof(pointer));

The issue here is that the programmer probably wants to zero out the object pointed to by pointer, but they are using sizeof(pointer) instead of sizeof(*pointer), therefore only writing to the first 4 or 8 bytes of the object depending on the size of pointers on the target architecture. In the rest of this post, I will demonstrate how to write a custom analysis check to detect this pattern. These days, clang generates a -Wsizeof-pointer-memaccess warning for code like this, but I've chosen this example for its simplicity. Let's begin by formulating the problem in terms of the structure of the AST. We are looking for function calls where the callee is memset, the first argument is a pointer to an object, the second argument is 0, and the third argument is sizeof with the same argument as the first argument to the memset call. That sounds simple enough. Now we need to know how these concepts map to the AST that clang generates internally. The easiest way to answer this question is to write a small test program and then ask clang to dump its AST. Here is an example program:

#include &lt;string.h>
struct S { int i1, i2; double d; };
void foo(S* s) {
  memset(s, 0, sizeof(s));
}

By passing the -Xclang -ast-dump arguments to clang, you can get a dump of the entire AST in a simple to read text format. Note that because of the <string.h> inclusion, the AST is huge, but let's focus only on the foo body: Each node in Clang's AST is a C++ object. The names in the beginning of a line are the names of the C++ classes for each node. Some of the names are self-describing, for example, FunctionDecl is an AST node that represents a function declaration; CallExpr is an AST node that represents a call expression. Searching for these names leads you to the doxygen documentation for them. From each node you can access its properties. The documentation is usually helpful to figure out how to get the information that you need from a node. But before we get to that, we need to find call expressions that are interesting in the first place! Clang supports a library called AST Matchers that makes this easy, by allowing you to create matcher predicates that narrow down the AST into the specific section that we're interested in. In order to get started, we begin with callExpr(), which is a node matcher that matches CallExprs. The first thing that we need to filter based on is the name of the function being called. Because we want to filter on something to narrow down our search, we should look at narrowing matchers. After a quick look, we find hasName() which takes a string and returns a Matcher<NamedDecl>. A NamedDecl is a declaration with a name, so we need to get the node representing the function being called from our callExpr(), so we need to look for a traversal matcher. After looking around a bit, we'll find the callee() traversal matcher which matches the called function. Its documentation has an example of matching a method call. We can copy that idea but instead match a function declaration. Combining all of this, we'll get to:

callExpr(callee(functionDecl(hasName("::memset"))))

Now we need to look at the arguments. The documentation suggests we need to use the hasArgument() matcher. Note in the AST dump that the first argument is an ImplicitCastExpr, which represents an implicit cast to void* even though this has not been written in the text of our test program. This is because clang modifies the AST as it performs semantic analysis and mutates the AST to make it contain richer information than a typical AST. This is very useful! Thankfully, hasArgument() knows to ignore this extra node in the AST. Now we get to:

callExpr(callee(functionDecl(hasName("::memset"))),
         hasArgument(0, declRefExpr()))

On to the second argument. This one is easy, just an integer literal with value 0. A quick look at the docs leads us to the integerLiteral() and equals() matchers, which, after being put together, bring us one step closer:

callExpr(callee(functionDecl(hasName("::memset"))),
         hasArgument(0, declRefExpr()),
         hasArgument(1, integerLiteral(equals(0))))

Now, in order to match the third argument, we can use the sizeOfExpr() matcher. Note that in C++, sizeof can appear with or without parentheses. A quick experiment with both cases shows that when you don't omit the parentheses, clang injects a ParenExpr into the AST. As the parentheses are not interesting to us, we can use has() to match ASTs with and without them. This leads us to this matcher:

callExpr(callee(functionDecl(hasName("::memset"))),
         hasArgument(0, declRefExpr()),
         hasArgument(1, integerLiteral(equals(0))),
         hasArgument(2, sizeOfExpr(has(declRefExpr()))))

Note that there is still something missing from this matcher. The first and third arguments are only interesting if they're pointer types, and are referring to the same variable. To filter non-pointer type arguments, we can use hasType() and pointerType(). There is no facility to check whether two DeclRefExprs are pointing to the same declaration in AST matchers, so we need to take care of that final bit of detail later. Our final matcher looks like this:

callExpr(callee(functionDecl(hasName("::memset"))),
         hasArgument(0, declRefExpr(hasType(pointerType()))),
         hasArgument(1, integerLiteral(equals(0))),
         hasArgument(2, sizeOfExpr(has(declRefExpr(hasType(pointerType()))))))

Now that we have a matcher, we need to register it with clang and get it to perform some work for us when the matcher matches something in the AST. This can be done using the MatchFinder class. First, we'll need to create a callback class:

class MemsetChecker : public MatchFinder::MatchCallback {
public:
  void run(const MatchFinder::MatchResult &Result) override;
};

Then we need to create a MatchFinder, and register our matcher and callback with it:

void analyze() {
  MatchFinder astMatcher;
  MemsetChecker memsetChecker;
  auto matcher =
    callExpr(callee(functionDecl(hasName("::memset"))),
             hasArgument(0, declRefExpr(hasType(pointerType()))),
             hasArgument(1, integerLiteral(equals(0))),
             hasArgument(2, sizeOfExpr(has(declRefExpr(hasType(pointerType()))))));
  astMatcher.addMatcher(matcher, &memsetChecker);
}

With this setup, MatchFinder calls our callback function every time that it performs a successful match. The callback function just needs to check that the first and third arguments point to the same declaration, and in that case generate some diagnostics. Therefore, we need to make it possible for the callback function to extract the DeclRefExprs for the first and third arguments. Node matchers support a bind() method that can be used in order to give them an ID that the match callback can extract. In order to keep the matcher readable, I will factor out some of the matchers into a couple of variables, like this:

  auto firstDeclRefExpr = declRefExpr(hasType(pointerType())).bind("first");
  auto thirdDeclRefExpr = declRefExpr(hasType(pointerType())).bind("third");
  auto matcher =
    callExpr(callee(functionDecl(hasName("::memset"))),
             hasArgument(0, firstDeclRefExpr),
             hasArgument(1, integerLiteral(equals(0))),
             hasArgument(2, sizeOfExpr(has(thirdDeclRefExpr))));

Now these DeclRefExprs have a name. The match callback function looks like this:

void MemsetChecker::run(const MatchFinder::MatchResult &Result) {
  DiagnosticsEngine &Diag = Result.Context->getDiagnostics();
  unsigned ErrorID = Diag.getDiagnosticIDs()->getCustomDiagID(
      DiagnosticIDs::Error, "Invalid size argument passed to memset");
  unsigned NoteID = Diag.getDiagnosticIDs()->getCustomDiagID(
      DiagnosticIDs::Note, "Did you mean to dereference this variable?");

  const auto *first = Result.Nodes.getNodeAs("first");
  const auto *third = Result.Nodes.getNodeAs("third");

  if (first->getFoundDecl() == third->getFoundDecl()) {
    Diag.Report(third->getLocation(), ErrorID);
    Diag.Report(third->getLocation(), NoteID);
  }
}

A lot of this is boilerplate that you need in order to generate diagnostics. We basically create an error and a note ID, get our to DeclRefExprs out of the match result. Then, in order to verify that the two are pointing to the same declaration, we need to compare their corresponding Decls (denoting the respective declarations, which can be retrieved through getFoundDecl()) and if they are the same, we have found a potentially invalid size being passed to memset so we'd need to diagnose that. One thing to note about generating diagnostics is that we end up passing a SourceLocation object (obtained via getLocation()) indicating a location inside the source code. This will allow clang to show exactly what code caused the diagnostics to appear, and it takes care of things such as macro expansions for us, generating the usual clang output indicating where the code in question is coming from. Let's have a look at the output of the compiler plugin on our original test program:

$ clang -Xclang -load -Xclang path/to/plugin.dylib -Xclang -add-plugin -Xclang memset-checker -c TestMemset.cpp
TestMemset.cpp:4:23: error: Invalid size argument passed to memset
  memset(s, 0, sizeof(s));
                      ^
TestMemset.cpp:4:23: note: Did you mean to dereference this variable?
2 errors generated.

The -Xclang arguments are directly passed to cc1 which is the actual compiler run by the normal clang command that one uses on a daily basis. You'll note that in order to generate a series of arguments for cc1 that way, we need to prefix all of them with -Xclang. memset-checker is the name of the plugin action passed to [FrontendPluginRegistry](http://clang.llvm.org/doxygen/FrontendPluginRegistry_8h.html)::Add() as documented here. Now if we run the same command on a slightly different program that uses some macro expansions, clang takes care of generating a good diagnostic without any further attempt from our side:

$ cat TestMemset.cpp
#include 
#define SIZE_OF(x) sizeof x
#define ZERO_OUT(ptr) memset(ptr, 0, SIZE_OF(ptr))
struct S { int i1, i2; double d; };
void foo(S* s) {
  ZERO_OUT(s);
}
$ clang -Xclang -load -Xclang path/to/plugin.dylib -Xclang -add-plugin -Xclang memset-checker -c TestMemset.cpp
TestMemset.cpp:6:12: error: Invalid size argument passed to memset
  ZERO_OUT(s);
           ^
TestMemset.cpp:3:46: note: expanded from macro 'ZERO_OUT'
#define ZERO_OUT(ptr) memset(ptr, 0, SIZE_OF(ptr))
                                             ^
TestMemset.cpp:2:27: note: expanded from macro 'SIZE_OF'
#define SIZE_OF(x) sizeof x
                          ^
TestMemset.cpp:6:12: note: Did you mean to dereference this variable?
TestMemset.cpp:3:46: note: expanded from macro 'ZERO_OUT'
#define ZERO_OUT(ptr) memset(ptr, 0, SIZE_OF(ptr))
                                             ^
TestMemset.cpp:2:27: note: expanded from macro 'SIZE_OF'
#define SIZE_OF(x) sizeof x
                          ^
2 errors generated.

Conclusion

The purpose of this example analysis was to demonstrate the process of implementing such checks, and many more examples of the existing checks for Firefox can be found in clang-plugin.cpp. In order to familiarize yourself with what these checks do, you can look at our unit tests. Each check comes with a unit tests that is a self-contained small C++ program written using special comments that let clang verify that expected diagnostics are generated in the expected places. These tests are run by passing -verify to clang, which causes the compiler to verify the diagnostics as they appear in the source code. Using the techniques that I demonstrated in this post, you can write your own checks, as long as the problem that you're trying to detect can be expressed in terms of the structure of the AST.