Firensis Apr 8 2020 at 09:06

How does strange code hide errors? TensorFlow.NET Analysis

15 min

901

PVS-Studio corporate blog.NET*C#*Open source*Machine learning*

Static analysis is an extremely useful tool for any developer, as it helps to find in time not only errors, but also suspicious and strange code fragments that may cause bewilderment of programmers who will have to work with it in the future. This idea will be demonstrated by the analysis of the TensorFlow.NET open C# project, developed for working with the popular TensorFlow machine learning library.

My name is Nikita Lipilin. Some time ago I joined the C# programmers department of PVS-Studio. Traditionally, all newcomers of the team write articles that cover the results of checking various open projects using the PVS-Studio static analyzer. Such articles help new employees to get to know the product better, and at the same time provide additional benefits in terms of popularizing the static analysis methodology. I suggest that you check out my first article on the topic of open projects analysis.

Introduction

The variety of possible errors in the program code is amazing. Some of them reveal themselves immediately upon a brief glance on a created application. Other bugs are difficult to notice even during the code review by a team of experienced developers. However, it also happens that due to inattention or some other reason, the programmer sometimes writes simply strange and illogical code, which, nevertheless, (seems to) successfully fulfill its function. Only afterwards some unanswered questions appear when returning to what was written or when other people study the code.

Refactoring of old code might result in problems especially when other parts of the program depend on it. Therefore, even after finding some brazenly ugly constructions, the method «If it ain't broke don't fix it» is still being applied. Eventually it becomes difficult to study the source code, therefore expanding available capabilities gets more complicated. The code base gets clogged — it becomes more likely that a small and unwitnessed potentially unpleasant internal problem will not be fixed in the right time.

At some point, the consequences of this error will be felt, but catching it will take a lot of time, because the developer's suspicions will fall on a huge number of strange code fragments that were not refactored at one time. It follows from this that various problems and oddities in a particular fragment should be corrected immediately after its writing. In the case when there are reasonable reasons to leave everything as it is, such a fragment should be followed by an explanatory comment. For example, if the code is some kind of a draft for the future.

It is also worth noting that, regardless of the developer's qualifications, some problematic and simply unsuccessful moments can slip away from their eyes. In some cases, once a temporary solution is applied, it will soon become permanent. Subsequently, the analysis of such code (most likely, another developer will be engaged in this) will take an unacceptably much effort.

Code review can be of help in such cases. However, if the task is a complex beast, then this will require a lot of time. In addition, when there are a lot of small errors or shortcomings, then the checking developer may well not notice high-level errors behind them. Code verification becomes a tedious routine, leading to a gradual decrease of the review effectiveness.

Obviously, routine tasks are best to be delegated to a computer. This approach is used in many areas of modernity. Automation of various processes is the key to prosperity. What is automation in terms of this topic?

A reliable assistant in solving the problem of writing reasonable and stable working code is static analysis. Each time before sending the results of their activities to the review, the programmer will be able to conduct an automated check and not burden other developers with unnecessary work. The code will be sent for review only after all analyzer warnings have been taken into account: errors have been fixed, and strange moments have been rewritten or at least explained by a comment.

Of course, the need for code review does not fall away, but static analysis complements and greatly simplifies its implementation. A sufficiently large part of the errors will be fixed thanks to the analyzer, and strange moments will definitely not be forgotten and will be marked accordingly. It follows, that when reviewing the code one will be able to focus on the implementation of complex logical interactions and detecting underlying problems. Alas, they can't be identified by the analyzer so far.

TensorFlow.NET

This article is inspired by the TensorFlow.NET project. It gives the ability to work with the popular TensorFlow machine learning library via the C# code. Speaking of which, we've also checked it. This idea seemed quite interesting, because at the time of this writing, working with the library is available only in terms of Python, Java, and Go.

The source code available on GitHub is constantly being updated and now its size is a bit more than one hundred thousand lines. After a superficial study, I got the incredible urge to check it using static analysis. PVS-Studio was used as a specific tool, which has proved its effectiveness in a fairly large number of different projects.

For TensorFlow.NET, the analyzer issued the following number of warnings: 39 of the High level, 227 — Medium level and 154 — Low level. You can read about warning levels here in the subsection «Warning levels and diagnostic rule sets». A detailed analysis of each of them would make this article endless, so I'm going to describe only the most interesting ones. It is also worth noting that some problems repeat several times in the project. Review of every such fragment is beyond the purpose of this text.

The project sets itself a rather challenging task. Unfortunately, the appearance of various kinds of strange code fragments is inevitable. In this article I will try to show that the use of static analysis can greatly simplify the work of programmers by pointing to areas that may cause questions. A warning does not always indicate an error, it might be the code that would cause someone's questions. Accordingly, the code is more likely to be either rewritten or commented in the right way.

Fragments that attracted attention when studying the analyzer report

In fact, a fairly large number of analyzer warnings for this project can be called not exactly errors but strange code. When looking through the lines of code triggered warnings, I feel at least puzzled. Some of the given examples might be temporary solutions. Despite this, they are not commented. A person working with this code in future will have some questions about it, leading to a waste of time in the search of answers to them.

At the same time, some warnings point to code that is obviously not just weird, but simply wrong. This is the main danger of strange code — it is extremely difficult to notice a real error among strange solutions at every turn. A reader gradually gets used to the fact that the code seems wrong.

Sophisticated collection traverse

private static void _RemoveDefaultAttrs(....)
{
  var producer_op_dict = new Dictionary<string, OpDef>();
  producer_op_list.Op.Select(op =>
  {
    producer_op_dict[op.Name] = op;
    return op;
  }).ToArray();           
  ....
}

Analyzer warning: V3010 The return value of function 'ToArray' is required to be utilized. importer.cs 218

The analyzer considers the call to ToArray suspicious in this place, as the value, returned by this function is not assigned to a variable. However, such code is not an error. This construction is used to fill the producer_op_dict dictionary by values, corresponding to the elements of the producer_op_list.Op list. Calling ToArray is needed so that the function passed as an argument of the Select method is called for all collection elements.

In my opinion, the code does not look the best. Filling out the dictionary is somewhat unobvious, and some developers may want to remove the ″unnecessary″ call to ToArray. It would be much simpler and more understandable to use the foreach loop here:

var producer_op_dict = new Dictionary<string, OpDef>();

foreach (var op in producer_op_list.Op)
{
  producer_op_dict[op.Name] = op;
}

In this case, the code looks as simple as possible.

Another similar fragment looks like this:

public GraphDef convert_variables_to_constants(....)
{
  ....
  inference_graph.Node.Select(x => map_name_to_node[x.Name] = x).ToArray();
  ....
}

Analyzer warning: V3010 The return value of function 'ToArray' is required to be utilized. graph_util_impl.cs 48

The only difference is that such a piece of code looks more concise. However, it is still tempting to remove the ToArray call, which still looks unobvious.

Temporary solution

public GraphDef convert_variables_to_constants(....)
{
  ....
  var source_op_name = get_input_name(node);
  while(map_name_to_node[source_op_name].Op == "Identity")
  {
    throw new NotImplementedException);
    ....
  }
  ....
}

Analyzer warning: V3020 An unconditional 'throw' within a loop. graph_util_impl.cs 73

In this project, the following approach is often used: if some kind of behavior has to be implemented later, NotImplementedException is thrown where appropriate. It is clear why the analyzer warns of a possible error in this piece: using while instead of if does not really look too reasonable.

This is not the only warning that appears due to the use of temporary solutions. For example, there is such a method:

public static Tensor[] _SoftmaxCrossEntropyWithLogitsGrad(
  Operation op, Tensor[] grads
)
{
  var grad_loss = grads[0];
  var grad_grad = grads[1];
  var softmax_grad = op.outputs[1];
  var grad = _BroadcastMul(grad_loss, softmax_grad);

  var logits = op.inputs[0];
  if(grad_grad != null && !IsZero(grad_grad)) // <=
  {
    throw new NotImplementedException("_SoftmaxCrossEntropyWithLogitsGrad");
  }

  return new Tensor[] 
  {
    grad,
    _BroadcastMul(grad_loss, -nn_ops.log_softmax(logits))
  };
}

Analyzer warning: V3022 Expression 'grad_grad != null && !IsZero(grad_grad)' is always false. nn_grad.cs 93

In fact, the exception NotImplementedException("_SoftmaxCrossEntropyWithLogitsGrad") will never be thrown, since the code is simply unreachable. In order to unravel the reason, we need to refer to the code of the IsZero function:

private static bool IsZero(Tensor g)
{
  if (new string[] { "ZerosLike", "Zeros" }.Contains(g.op.type))
    return true;

  throw new NotImplementedException("IsZero");
}

The method either returns true or throws an exception. This code is not an error — obviously, the implementation here is left for later. What is really important here is for this «later» to come true. Well, we are lucky that PVS-Studio will not let you forget that there is such an imperfection here :)

Is Tensor is Tensor?

private static Tensor[] _ExtractInputShapes(Tensor[] inputs)
{
  var sizes = new Tensor[inputs.Length];
  bool fully_known = true;
  for(int i = 0; i < inputs.Length; i++)
  {
    var x = inputs[i];

    var input_shape = array_ops.shape(x);
    if (!(input_shape is Tensor) || input_shape.op.type != "Const")
    {
      fully_known = false;
      break;
    }

    sizes[i] = input_shape;
  }
  ....
}

Analyzer warning: V3051 An excessive type check. The object is already of the 'Tensor' type. array_grad.cs 154

The type of the return value of the shape method is Tensor. Thus, the input_shape is Tensor check looks at least weird. Perhaps, once the method returned a value of a different type and the check made sense, but it is also possible that instead of Tensor the condition should specify some kind of an heir of this class. One way or another, the developer should pay attention to this fragment.

Decent condition checking

public static Tensor[] _BaseFusedBatchNormGrad(....)
{
  ....
  if (data_format == "NCHW") // <=
    throw new NotImplementedException("");

  var results = grad_fun(new FusedBatchNormParams
  {
    YBackprop = grad_y,
    X = x,
    Scale = scale,
    ReserveSpace1 = pop_mean,
    ReserveSpace2 = pop_var,
    ReserveSpace3 = version == 2 ? op.outputs[5] : null,
    Epsilon = epsilon,
    DataFormat = data_format,
    IsTraining = is_training
  });

  var (dx, dscale, doffset) = (results[0], results[1], results[2]);
  if (data_format == "NCHW") // <=
    throw new NotImplementedException("");

  ....
}

Analyzer warnings:

V3021 There are two 'if' statements with identical conditional expressions. The first 'if' statement contains method return. This means that the second 'if' statement is senseless nn_grad.cs 230
V3022 Expression 'data_format == «NCHW»' is always false. nn_grad.cs 247

Unlike some of the previous examples, there is clearly something wrong with this code. The second check does not make any sense, since if the condition is true, then the execution of the program will not reach it at all. Perhaps some typo is allowed here, or one of the checks is simply superfluous.

The illusion of choice

public Tensor Activate(Tensor x, string name = null)
{
  ....
  Tensor negative_part;
  if (Math.Abs(_threshold) > 0.000001f)
  {
    negative_part = gen_ops.relu(-x + _threshold);
  } else
  {
    negative_part = gen_ops.relu(-x + _threshold);
  }
  ....
}

Analyzer warning: V3004 The 'then' statement is equivalent to the 'else' statement. gen_nn_ops.activations.cs 156

A rather amusing demonstration of the effectiveness of using static analysis in development. It is difficult to come up with a sensible reason why the developer wrote this particular code. Most likely, this is a typical copy-paste error. Although this, of course, may be another ″for later″ example.

There are other fragments like this, for example:

private static Operation _GroupControlDeps(
  string dev, Operation[] deps, string name = null
)
{
  return tf_with(ops.control_dependencies(deps), ctl =>
  {
    if (dev == null)
    {
      return gen_control_flow_ops.no_op(name);
    }
    else
    {
      return gen_control_flow_ops.no_op(name);
    }
  });
}

Analyzer warning: V3004 The 'then' statement is equivalent to the 'else' statement. control_flow_ops.cs 135

Maybe once the check made sense. Nevertheless, over time it either got lost, or in the future it is planned to make some additional changes. However, neither of these options seems to be sufficient justification for leaving something like this in the code, without explaining this oddity in any way. With a high degree of probability, a copy-paste error was made here in exactly the same way.

Late check

public static Tensor[] Input(int[] batch_shape = null,
  TF_DataType dtype = TF_DataType.DtInvalid,
  string name = null,
  bool sparse = false,
  Tensor tensor = null)
{
  var batch_size = batch_shape[0];
  var shape = batch_shape.Skip(1).ToArray(); // <=

  InputLayer input_layer = null;
  if (batch_shape != null)                   // <=
    ....
  else
    ....

  ....
}

Analyzer warning: V3095 The 'batch_shape' object was used before it was verified against null. Check lines: 39, 42. keras.layers.cs 39

A classic and rather dangerous mistake of the potential use of a variable, which is a reference to nowhere. In doing so the code clearly implies the possibility that null will take place in batch_shape. This is clear both from the arguments list and the subsequent check of the same variable. Thus, the analyzer here indicates an evident error.

Another ″for later″ fragment?

public MnistDataSet(
  NDArray images, NDArray labels, Type dataType, bool reshape // <=
) 
{
  EpochsCompleted = 0;
  IndexInEpoch = 0;

  NumOfExamples = images.shape[0];

  images = images.reshape(
    images.shape[0], images.shape[1] * images.shape[2]
  );
  images = images.astype(dataType);
  // for debug np.multiply performance
  var sw = new Stopwatch();
  sw.Start();
  images = np.multiply(images, 1.0f / 255.0f);
  sw.Stop();
  Console.WriteLine($"{sw.ElapsedMilliseconds}ms");
  Data = images;

  labels = labels.astype(dataType);
  Labels = labels;
}

Analyzer warning: V3117 Constructor parameter 'reshape' is not used. MnistDataSet.cs 15

Like some other oddities, this is most likely due to the fact that the functionality is far from fully implemented. It is quite possible that the reshape parameter will be used somehow in this constructor in the future. So far, I have the feeling that it is left here without any reason. If it has been really left here «for later», it should have been followed by a comment. If not, the code constructing the object will have to pass the constructor an extra parameter. There might be the case when this step is best to me omitted.

Uncatchable possible null dereference

public static Tensor[] _GatherV2Grad(Operation op, Tensor[] grads)
{
  ....
  if((int)axis_static == 0)
  {
    var params_tail_shape = params_shape.slice(new NumSharp.Slice(start:1));
    var values_shape = array_ops.concat(
      new[] { indices_size, params_tail_shape }, 0
    );
    var values = array_ops.reshape(grad, values_shape);
    indices = array_ops.reshape(indices, indices_size);
    return new Tensor[]
    {
      new IndexedSlices(values, indices, params_shape), // <=
      null,
      null
    };
  }
  ....
}

Analyzer warning: V3146 Possible null dereference of the 1st argument 'values' inside method. The '_outputs.FirstOrDefault()' can return default null value. array_grad.cs 199

In order to get the point of the problem, we have to refer to the IndexedSlices constructor code:

public IndexedSlices(
  Tensor values, Tensor indices, Tensor dense_shape = null
)
{
  _values = values;
  _indices = indices;
  _dense_shape = dense_shape;

  _values.Tag = this; // <=
}

Obviously, passing null to this constructor will result in an exception. However, why does the analyzer consider that the values variable may contain null?

PVS-Studio uses the Data-Flow Analysis technique, which allows you to find the sets of possible variables' values in different parts of the code. The warning tells us that null can be returned in the specified variable in the following line: _outputs.FirstOrDefault(). At the same time the above code implies that the value of the values variable is received by calling array_ops.reshape(grad, values_shape). Then what's that got to do with _outputs.FirstOrDefault()?

The fact is that when analyzing the data flow, not only is the current function considered, but also all called ones. In doing so, PVS-Studio receives information about the set of possible values of any variable anywhere. Therefore, the warning means that the implementation of array_ops.reshape(grad, values_shape) contains the call of _outputs.FirstOrDefault(), the result of which is ultimately returned.

To verify this, let's go to the reshape implementation:

public static Tensor reshape<T1, T2>(T1 tensor, T2 shape, string name = null)
            => gen_array_ops.reshape(tensor, shape, null);

Then go to the reshape method called inside:

public static Tensor reshape<T1, T2>(T1 tensor, T2 shape, string name = null)
{
  var _op = _op_def_lib._apply_op_helper(
    "Reshape", name, new { tensor, shape }
  );
  return _op.output;
}

The _apply_op_helper function returns the object of the Operation class, containing the output property. It is upon receipt of its value that the code described in the warning is called:

public Tensor output => _outputs.FirstOrDefault();

Tensor is, of course, a reference type, so the default value for it will be null. From all this it can be seen that PVS-Studio meticulously analyzes the logical structure of the code, penetrating deep into the structure of calls.

The analyzer did what it had to and indicated a potentially problematic place. The only thing a programmer has to check is whether a situation may arise when elements in _outputs are absent.

Thus, the static analysis will at least make the developer pay attention to the suspicious fragment in order to evaluate if the error may actually occur there. With this approach, the number of errors that go unnoticed will be rapidly reduced.

Unreliable waiting?

private (LoopVar<TItem>, Tensor[]) _BuildLoop<TItem>(
  ....
) where ....
{
  ....
  // Finds the closest enclosing non-None control pivot.
  var outer_context = _outer_context;
  object control_pivot = null;
  while (outer_context != null && control_pivot == null) // <=
  {

  }

  if (control_pivot != null)
  {

  }
  ....
}

Analyzer warning: V3032 Waiting on this expression is unreliable, as compiler may optimize some of the variables. Use volatile variable(s) or synchronization primitives to avoid this. WhileContext.cs 212

The analyzer indicates that such an implementation of waiting can be optimized by the compiler, but I doubt that authors really tried to implement waiting here — most likely, the code is simply not written up to the end and is planned to be finalized in the future. It might be worth throwing the NotImplementedException here, given that this practice is used elsewhere in the project. Anyway, in my opinion, an explanatory comment would come in handy.

Violation of bounds

public TensorShape(int[][] dims)
{
  if(dims.Length == 1)
  {
    switch (dims[0].Length)
    {
      case 0: shape = new Shape(new int[0]); break;
      case 1: shape = Shape.Vector((int)dims[0][0]); break;
      case 2: shape = Shape.Matrix(dims[0][0], dims[1][2]); break; // <=
      default: shape = new Shape(dims[0]); break;
    }
  }
  else
  {
    throw new NotImplementedException("TensorShape int[][] dims");
  }
}

Analyzer warning: V3106 Possibly index is out of bound. The '1' index is pointing beyond 'dims' bound. TensorShape.cs 107

Among the weird snippets of code that I looked through, I saw a real mistake, which is very difficult to notice. The following fragment is erroneous here: dims[1][2]. Getting an element with index 1 from an array of one element is obviously a mistake. At the same time, if we change the fragment for dims[0][2], another error will appear — getting an element with the index 2 from the array dims[0], the length of which is 2 in this case branch. Thus, this problem turned out to be with a «double bottom».

In any case, this code fragment should be studied and corrected by the developer. In my opinion, this example is an excellent illustration of the Data Flow Analysis performance in PVS-Studio.

A typo?

private void _init_from_args(object initial_value = null, ....) // <=
{
  var init_from_fn = initial_value.GetType().Name == "Func`1"; // <=
  ....
  tf_with(...., scope =>
  {
    ....
    tf_with(...., delegate
    {
      initial_value = ops.convert_to_tensor(  // <=
        init_from_fn ? (initial_value as Func<Tensor>)():initial_value,
        name: "initial_value",
        dtype: dtype
      );
    });
    _shape = shape ?? (initial_value as Tensor).TensorShape;
    _initial_value = initial_value as Tensor; // <=
    ....
    _dtype = _initial_value.dtype.as_base_dtype(); // <=

    if (_in_graph_mode)
    {
      ....

      if (initial_value != null) // <=
      {
        ....
      }

      ....
    }

    ....
  });
}

To understand the code above, it is also worth citing the implementation of the tf_with function:

[DebuggerStepThrough] // with "Just My Code" enabled this lets the 
[DebuggerNonUserCode()]  //debugger break at the origin of the exception
public static void tf_with<T>(
  T py, Action<T> action
) where T : ITensorFlowObject
{
  try
  {
    py.__enter__();
    action(py);
  }
  finally
  {
    py.__exit__();
    py.Dispose();
  }
}

Analyzer warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'initial_value', '_initial_value'. ResourceVariable.cs 137

_init_from_args is a fairly voluminous function, so many fragments have been omitted. Its full version is available by the link. At first this warning did not seem really serious to me. After reviewing it, I realized that something was definitely wrong with the code.

Firstly, it should be noted that the method can be called without passing parameters and by default there will be null in initial_value. In this case, an exception will be thrown right in the first line.

Secondly, the check of initial_value for null looks strange: if initial_value has really become null after calling ops.convert_to_tensor, _initial_value would also be null, which means that the call of _initial_value.dtype.as_base_dtype() would also throw an exception.

The analyzer hints that it is _initial_value that has to be checked for null. But as noted before, this variable is accessed before this check, so this option would also be incorrect.

Would this tiny mistake be noticed in such a giant function without PVS-Studio? I doubt it very much.

Conclusion

In a project with many examples of strange code, a lot of problems can be hidden. The programmer, getting used to seeing the incomprehensible, at the same time ceases to notice errors. The consequences can be very sad. Indeed, among analyzer warnings there are also false ones. However, in most cases, warnings at least indicate fragments of code that can cause questions when viewed by a person. In the case when the strange code is written intentionally, it is worth leaving explanations so that the fragment is clear to the developer who will work with this code in the future (even if it means leaving comments for oneself).

At the same time, static analysis tools, such as PVS-Studio, can be of great help in finding potential errors and oddities, so that they are visible and not forgotten, as well as all temporary solutions are subsequently refined and turned into clean, structured, and stable working code.

Hubs: