32bit_me Jul 1 2019 at 19:52

Why LLVM may call a never called function?

11 min

7.1K

I don’t care what your dragon’s said, it’s a lie. Dragons lie. You don’t know what’s waiting for you on the other side.

Michael Swanwick, The Iron Dragon’s Daughter

This article is based on the post in the Krister Walfridsson’s blog, “Why undefined behavior may call a never called function?”.

The article draws a simple conclusion: undefined behavior in a compiler can do anything, even something absolutely unexpected. In this article, I examine the internal mechanism of this optimization works.

To briefly recap Waldfridsson’s post, in the source code below, the EraseAll function should not be called from main, and it is not really called when compiled with -O0, but is suddenly called with optimization -O1 and higher.

#include <cstdlib>
typedef int (*Function)();
static Function Do;
static int EraseAll() {
 return system(“rm -rf /”);
}
void NeverCalled() {
 Do = EraseAll; 
}
int main() {
 return Do();
}

How does a compiler optimize it? At first, Do, the pointer to a function is void, because, in accordance with the C standard, all global variables have zero values when a program starts.

The program will try to dereference the Do pointer and call assigned function. But if we try to dereference a null pointer, the standard says that it is UB, undefined behavior. Usually, if we compile without optimizations, with -O0 option, we get a Segmentation Fault (in Linux). But the Standard says, that in case of UB, a program can do anything.

A compiler uses this feature of the standard to remove unnecessary operations. If a compiler sees that Do is assigned anywhere in the program, it can assign this value in the initialization time, and do not assign it in runtime. In reality, there are two possibilities:

1. if a pointer is dereferenced after it should be assigned, we win, because a compiler can remove an unnecessary assignment.

2. if a pointer is dereferenced before it should be assigned, the standard says that it is UB, and behavior can be any, including calling an arbitrary function. That is, calling the function PrintHello() does not contradict the standard.

That is, in any case, we can assign some not-null value to an uninitialized pointer and get behavior, according to the standard.

What are the conditions that make this optimization possible? Initially, a program should contain a global pointer without any initial value or with null value (that is the same). Next, the program should contain an assignment a value to this pointer, anywhere, no matter, before the pointer dereferencing or after it. In the example above, an assignment has not occurred at all, but a compiler sees that the assignment exists.

If these conditions are met, a compiler can remove the assignment and change it into the initial value of the pointer.

In the given code the variable Do is a pointer to a function, and it has the initial value null. When we try to call a function on the null pointer, the behavior of the program is undefined (undefined behavior, UB) and the compiler has right to optimize the UB as it wants. In this case, the compiler immediately executed the Do = EraseAll assignment.

Why does this happen? In the rest of the text, LLVM and Clang version 5.0.0 are used as a compiler. Code examples are runnable for you to practice yourself.

To begin with, let’s look at the IR code when optimizing with -O0 and -O1. Let’s change the source code slightly to make it less dramatic:

#include <cstdlib>
typedef int (*Function)();
static Function Do;
static int PrintHello() {
  return printf("hello world\n");
}
void NeverCalled() {
  Do = PrintHello;  
}
int main() {
  return Do();
}

And we compile the IR code with -O0 (the debugging information is omitted for clarity):

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@Do = internal global i32 (...)* null, align 8
@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1
; Function Attrs: noinline nounwind optnone uwtable
define void @NeverCalled() #0 {
entry:
  store i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*), i32 (...)** @Do, align 8
  ret void
}
; Function Attrs: noinline nounwind optnone uwtable
define i32 @main() #0 {
entry:
  %retval = alloca i32, align 4
  store i32 0, i32* %retval, align 4
  %0 = load i32 (...)*, i32 (...)** @Do, align 8
  %call = call i32 (...) %0()
  ret i32 %call
}
; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @PrintHello() #0 {
entry:
  %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0))
  ret i32 %call
}
declare i32 @printf(i8*, ...) #1
And with -O1:
; ModuleID = 'test.ll'
source_filename = "test.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1
; Function Attrs: noinline nounwind optnone uwtable
define void @NeverCalled() local_unnamed_addr #0 {
entry:
  ret void
}
; Function Attrs: noinline nounwind optnone uwtable
define i32 @main() local_unnamed_addr #0 {
entry:
  %retval = alloca i32, align 4
  store i32 0, i32* %retval, align 4
  %call = call i32 (...) bitcast (i32 ()* @PrintHello to i32 (...)*)()
  ret i32 %call
}
; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @PrintHello() unnamed_addr #0 {
entry:
  %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0))
  ret i32 %call
}
declare i32 @printf(i8*, ...) local_unnamed_addr #1

If you compile the executables, you will confirm that in the first case, a segmentation error occurs, and in the second case, “hello world” is displayed. With other optimization options, the result is the same as for -O1.

Now find the part of the compiler code that performs this optimization. The architecture of LLVM the frontend does not deal with optimizations itself, i.e. cfe (Clang Frontend) always generates the code without optimizations, which we see in the version for -O0, and all optimizations are performed by the utility opt:

With -O1, 186 optimization passes are performed.

Turning off the passes one after another, we find what we are looking for: the globalopt pass. We can leave only this optimization pass, and make sure that it, and no one else, generates the code we need. The source is in the file /lib/Transforms/IPO/GlobalOpt.cpp. You can see the source code in the LLVM repository. For brevity, I have only provided functions important for understanding how it works.

This picture represents a structure of the IR representation. A code in LLVM IR representation has hierarchical levels: a module represents the highest level of a hierarchy, and includes all function and global objects, such as global variables. A function is the most important level of IR representation and most of passes work on this level. A basic block is one is the most important concept of a compiler theory. A basic block consists of instructions, which can not make jumps from a middle of a basic block or inside a basic block. All transitions between basic block are possible only from an end of a basic block and to a begin of a basic block, and any jumps from or to a middle of a basic block are never possible. An instruction level represents an LLVM IR code instruction. It is not a processor’s instruction, it’s an instruction of some very generalized virtual machine with an infinite number of registers.

This picture shows a hierarchy of LLVM passes. On the left passes working on LLVM IR code are shown, on the right side passes working with target’s instructions are shown.

Initially, it implements the runOnModule method, i.e. when working, it sees and optimizes the entire module (which, of course, is reasonable in this case). The function which performs the optimization is optimizeGlobalsInModule:

static bool optimizeGlobalsInModule(
    Module &M, const DataLayout &DL, TargetLibraryInfo *TLI,
    function_ref<dominatortree> LookupDomTree) {
  SmallSet<const comdat="Comdat" 8="8"> NotDiscardableComdats;
  bool Changed = false;
  bool LocalChange = true;
  while (LocalChange) {
    LocalChange = false;
NotDiscardableComdats.clear();
    for (const GlobalVariable &GV : M.globals())
      if (const Comdat *C = GV.getComdat())
        if (!GV.isDiscardableIfUnused() || !GV.use_empty())
          NotDiscardableComdats.insert(C);
    for (Function &F : M)
      if (const Comdat *C = F.getComdat())
        if (!F.isDefTriviallyDead())
          NotDiscardableComdats.insert(C);
    for (GlobalAlias &GA : M.aliases())
      if (const Comdat *C = GA.getComdat())
        if (!GA.isDiscardableIfUnused() || !GA.use_empty())
          NotDiscardableComdats.insert(C);
// Delete functions that are trivially dead, ccc -> fastcc
    LocalChange |=
        OptimizeFunctions(M, TLI, LookupDomTree, NotDiscardableComdats);
// Optimize global_ctors list.
    LocalChange |= optimizeGlobalCtorsList(M, [&](Function *F) {
      return EvaluateStaticConstructor(F, DL, TLI);
    });
// Optimize non-address-taken globals.
    LocalChange |= OptimizeGlobalVars(M, TLI, LookupDomTree,
                                      NotDiscardableComdats);
// Resolve aliases, when possible.
    LocalChange |= OptimizeGlobalAliases(M, NotDiscardableComdats);
// Try to remove trivial global destructors if they are not removed
    // already.
    Function *CXAAtExitFn = FindCXAAtExit(M, TLI);
    if (CXAAtExitFn)
      LocalChange |= OptimizeEmptyGlobalCXXDtors(CXAAtExitFn);
Changed |= LocalChange;
  }
// TODO: Move all global ctors functions to the end of the module for code
  // layout.
return Changed;
}

Let’s try to describe in words what this function does. For each global variable in the module, it requests a Comdat object.

What is a Comdat object?

A Comdat section is a section in the object file, in which objects are placed, which can be duplicated in other object files. Each object has information for the linker, indicating what it must do when duplicates are detected. The options can be: Any — do anything, ExactMatch — duplicates must completely match, otherwise an error occurs, Largest — take the object with the largest value, NoDublicates — there should not be a duplicate, SameSize — duplicates must have the same size, otherwise an error occurs.

In LLVM, Comdat data is represented by an enumeration:

enum SelectionKind {
    Any,          ///< The linker may choose any COMDAT.
    ExactMatch,   ///< The data referenced by the COMDAT must be the same.
    Largest,      ///< The linker will choose the largest COMDAT.
    NoDuplicates, ///< No other Module may specify this COMDAT.
    SameSize,     ///< The data referenced by the COMDAT must be the same size.
  };

and the class Comdat actually represents a pair (Name, SelectionKind). (In fact, everything is more complicated.) All variables that for some reason cannot be deleted are placed in a set of NotDiscardableComdats. With functions and global aliases, we do the same — something that can not be deleted is placed in NotDiscardableComdats. Then, separate optimization functions for global constructors, global functions, global variables, global aliases, and global destructors are called. Optimizations continue in the loop until no optimization is performed. At each iteration of the loop, the set of NotDiscardableComdats is set to zero.

Let’s see what objects of the listed our test source contains.

Global variables:

1. @Do = internal global i32 (...)* null, align 8
2. @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1

(a little looking ahead, I can say that the first variable will be deleted by the optimizer at the first iteration).
Functions:

define void @NeverCalled()
define i32 @main()
define internal i32 @PrintHello()
declare i32 @printf(i8*, ...)

Note that printf is only declared, but not defined.

There are no global aliases.

Let’s look at the example of this optimization pass and consider how this result turned out. Of course, to analyze all the optimization variants even in one pass is a very big task, because it involves many different special cases of optimizations. We will concentrate on our example, considering those functions and data structures that are important for understanding the work of this optimization pass.

Initially, the optimizer does various uninteresting checks in this case, and calls the processInternalGlobal function, which tries to optimize global variables. This function is also quite complex and does a lot of different things, but we are interested in one thing:

if (GS.StoredType == GlobalStatus::StoredOnce && GS.StoredOnceValue) {
...
// We are trying to optimize global variables, about which it is known that they are assigned a value only once, except the initializing value.
    if (optimizeOnceStoredGlobal(GV, GS.StoredOnceValue, GS.Ordering, DL, TLI))
      return true;
...
}

The information that the global variable is assigned the value one and only once is extracted from the GS structure (GlobalStatus). This structure is populated in the calling function:

static bool
processGlobal(GlobalValue &GV, TargetLibraryInfo *TLI,
              function_ref<dominatortree> LookupDomTree) {
  if (GV.getName().startswith("llvm."))
    return false;
GlobalStatus GS;
if (GlobalStatus::analyzeGlobal(&GV, GS))
    return false;
...

Here we see one more interesting fact: objects whose names begin with “llvm.” are not subject to optimization (since they are system calls for llvm runtime). And, just in case, the names of variables in LLVM IR can contain points (and even consist of one point with the prefix @ or %). The function analyzeGlobal is a call to the LLVM API and we will not consider its internal work. The structure of GlobalStatus should be viewed in details since it contains very important information for optimization passes.

/// As we analyze each global, keep track of some information about it.  If we
/// find out that the address of the global is taken, none of this info will be
/// accurate.
struct GlobalStatus {
  /// True if the global's address is used in a comparison.
  bool IsCompared = false;
/// True if the global is ever loaded.  If the global isn't ever loaded it
  /// can be deleted.
  bool IsLoaded = false;
/// Keep track of what stores to the global look like.
  enum StoredType {
    /// There is no store to this global.  It can thus be marked constant.
    NotStored,
/// This global is stored to, but the only thing stored is the constant it
    /// was initialized with. This is only tracked for scalar globals.
    InitializerStored,
/// This global is stored to, but only its initializer and one other value
    /// is ever stored to it.  If this global isStoredOnce, we track the value
    /// stored to it in StoredOnceValue below.  This is only tracked for scalar
    /// globals.
    StoredOnce,
/// This global is stored to by multiple values or something else that we
    /// cannot track.
    Stored
  } StoredType = NotStored;
/// If only one value (besides the initializer constant) is ever stored to
  /// this global, keep track of what value it is.
  Value *StoredOnceValue = nullptr;
...
};

It is worth to explain why “If we find out that the address of the global is taken, none of this info will be accurate.” In fact, if we take the address of a global variable, and then write something at this address, not by name, then it will be extremely difficult to track this, and it is better to leave such variables as is, without trying to optimize.

So, we get into the function optimizeOnceStoredGlobal, to which the variable (GV) and the stored value (StoredOnceVal) are passed. Here they are:

@Do = internal unnamed_addr global i32 (...)* null, align 8 // the variable
i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*) // the value

Next, for the value, the insignificant bitcast is deleted, and for the variable the following condition is checked:

if (GV->getInitializer()->getType()->isPointerTy() &&
      GV->getInitializer()->isNullValue()) {
...

that is, the variable must be initialized with a null pointer. If this is the case, then we create a new SOVC variable corresponding to the value of StoredOnceVal cast to the GV type:

if (Constant *SOVC = dyn_cast<constant>(StoredOnceVal)) {
      if (GV->getInitializer()->getType() != SOVC->getType())
        SOVC = ConstantExpr::getBitCast(SOVC, GV->getInitializer()->getType());

Here, getBitCast is the method that returns the bitcast command, which types the types in the LLVM IR language.

After that, the function OptimizeAwayTrappingUsesOfLoads is called. It transfers the global variable GV and the constant LV.

Direct optimization is performed by the function OptimizeAwayTrappingUsesOfValue (Value * V, Constant * NewV).

For each use of a variable:

for (auto UI = V->user_begin(), E = V->user_end(); UI != E; ) {
    Instruction *I = cast<instruction>(*UI++);

if this is a Load command, replace its operand with a new value:

if (LoadInst *LI = dyn_cast<loadinst>(I)) {
      LI->setOperand(0, NewV);
      Changed = true;
    }

If the variable is used in the function call or invoke (which is exactly what happens in our example), create a new function, replacing its argument with a new value:

if (isa<callinst>(I) || isa<invokeinst>(I)) {
      CallSite CS(I);
      if (CS.getCalledValue() == V) {
        // Calling through the pointer!  Turn into a direct call, but be careful
        // that the pointer is not also being passed as an argument.
        CS.setCalledFunction(NewV);
        Changed = true;
        bool PassedAsArg = false;
        for (unsigned i = 0, e = CS.arg_size(); i != e; ++i)
          if (CS.getArgument(i) == V) {
            PassedAsArg = true;
            CS.setArgument(i, NewV);
          }

All other arguments to the function are simply copied.

Also, similar replacement algorithms are provided for the Cast and GEP instructions, but in our case this does not happen.

The further actions are as follows: we look through all uses of a global variable, trying to delete everything, except assignment of value. If this is successful, then we can delete the Do variable.

So, we briefly reviewed the work of the optimization pass LLVM on a specific example. In principle, nothing super complicated is here, but more careful programming is required to provide for all possible combinations of commands and variable types. Of course, all this must be covered by tests. Learning the source code of LLVM optimizers will help you write your optimizations, allowing you to improve the code for some specific cases.

Hubs: