vkhanieva Jul 2 2020 at 08:07

PVS-Studio Impressed by the Code Quality of ABBYY NeoML

4 min

604

ABBYY has recently opened the source code of their NeoML framework. Someone suggested that we check this library with PVS-Studio. We liked the idea and got down to work without further delay. This article won't take long to read because the project has proved to be pretty high-quality :).

The source code of NeoML can be downloaded from GitHub. This is a cross-platform framework designed for implementing machine learning models. It is used by ABBYY engineers for computer vision and natural language processing tasks, including image preprocessing, document layout analysis, and so on. It currently supports C++, Java, and Objective-C, with Python support coming soon. The framework itself is mainly written in C++.

Starting analysis

Starting the analysis on this framework was easy. Once I had the Visual Studio project generated in CMake, I ran PVS-Studio from Visual Studio on the projects in that solution, except for third-party libraries. Besides NeoML itself, the solution also included ABBYY libraries such as NeoOnnx and NeoMathEngine, which I also included in the list of projects to be analyzed.

Analysis results

Needless to say, I was hoping to find some bad bugs, but… the code turned out to be pretty clean, and I had to settle for just a few warnings. It's very likely that the project was already checked with some static analysis tool during development. Many of the warnings were produced by the same diagnostics on similar code fragments.

For example, calling a virtual method in a constructor is very common in this project, though it's generally an unsafe practice. Such cases are detected by the V1053 diagnostic: Calling the 'foo' virtual function in the constructor/destructor may lead to unexpected result at runtime. I got a total of 10 warnings of this type. To learn more about why this practice is unsafe and what issues it leads to, see the article "Never Call Virtual Functions during Construction or Destruction" by Scott Meyers. But the NeoML developers seem to understand what they are doing, so those warnings can be ignored.

There were also 11 warnings issued by the medium-level diagnostic V803, which deals with micro-optimizations. This diagnostic recommends replacing postfix increments with prefix ones when the iterator's previous value is not used. With a postfix increment, an unnecessary temporary object is created. It's not a bug, of course – just a minor detail. If this diagnostic is irrelevant, you can simply turn it off. Actually, the "micro-optimizations" set is turned off by default.

You must have already guessed that me talking about trifles like iterator increment means the code is fine and I'm just looking for something to pick on.

Certain diagnostics are very often irrelevant or inapplicable to a given project, so we recommend that you spend some time configuring the analyzer before analysis rather than put up with the pain of working with non-optimal settings. If you want to get to the most interesting warnings right off, follow the steps described in our article "How to quickly check out interesting warnings given by the PVS-Studio analyzer for C and C++ code?"

A few interesting warnings related to "micro-optimizations" were produced by diagnostic V802, which recommends rearranging a structure's fields by type size in decreasing order, thus reducing the structure's overall size.

V802 On 64-bit platform, structure size can be reduced from 24 to 16 bytes by rearranging the fields according to their sizes in decreasing order. HierarchicalClustering.h 31

struct CParam {
  TDistanceFunc DistanceType; 
  double MaxClustersDistance;
  int MinClustersCount; 
};

By simply swapping the MaxClustersDistance field of type double and the enumerator DistanceType field, we can reduce the structure's size from 24 to 16 bytes.

struct CParam {
  double MaxClustersDistance; 
  int MinClustersCount; 
  TDistanceFunc DistanceType; 
};

TDistanceFunc is enum, so its size is the same as that of int or smaller, which means we should move it to the bottom of the structure.

Again, that's not a bug, but if you want to have micro-optimizations just for the sake of it or if they are objectively crucial to your project, warnings like the ones shown above will help you quickly find spots in your code that could use at least some basic refactoring.

Overall, the code of NeoML is neat and clear, but the V807 diagnostic pointed out a couple of lines that could be optimized and made somewhat clearer. Here's one example:

V807 Decreased performance. Consider creating a reference to avoid using the same expression repeatedly. GradientBoostFullTreeBuilder.cpp 469

The chain curLevelStatistics[i]->ThreadStatistics[j] can be replaced with a call to an individual variable. There are no calls to any complex methods in this chain, so this optimization wouldn't give any noticeable boost, but it would still make this fragment clearer and shorter, I believe. Besides, it would indicate to any future maintainers that the original developer meant to address these exact indexes and there's no error here. This is the version with the suggested fix applied:

auto threadStatistics = curLevelStatistics[i]->ThreadStatistics[j];

if(threadStatistics.FeatureIndex != NotFound ) {
  if(   threadStatistics.Criterion > criterion
     || ( .... ))
  {
    criterion = threadStatistics.Criterion;
    curLevelStatistics[i]->FeatureIndex    = threadStatistics.FeatureIndex;
    curLevelStatistics[i]->Threshold       = threadStatistics.Threshold;
    curLevelStatistics[i]->LeftStatistics  = threadStatistics.LeftStatistics;
    curLevelStatistics[i]->RightStatistics = threadStatistics.RightStatistics;
  }
}

Conclusion

As you can see, the code base of the NeoML framework turns out to be very clean.

One thing you should keep in mind is that a single run of a static analyzer on an intensely developing project doesn't say much in favor of adopting static analysis because many of the bugs, especially grave ones, have already been found and fixed using other – more time- and resource-intensive – means. The article "Errors that static code analysis does not find because it is not used" elaborates on this topic.

But even with that fact considered, PVS-Studio issued particularly few warnings on NeoML and I give credit to the developers for the quality of their code, no matter if they used static analysis or not.

Hubs:

PVS-Studio corporate blog

PVS-Studio Impressed by the Code Quality of ABBYY NeoML

Starting analysis

Analysis results

Conclusion

Articles

Information