ipolynkina Jan 22 2021 at 09:17

How PVS-Studio Checked ELKI in January

9 min

760

PVS-Studio corporate blogData Mining*Java*Open source*

If you feel like the New Year just came, and you missed the first half of January, then all this time you've been busy looking for tricky bugs in the code you maintain. It also means that our article is what you need. PVS-Studio has checked the ELKI open source project to show you errors that may occur in the code, how cunningly they can hide there, and how you can deal with them.

ELKI/image1.png

What kind of library is ELKI?

The abbreviation ELKI stands for Environment for Developing KDD-Applications Supported by Index-Structures. This project is written in Java and is designed for data mining. Most users of this library are students, researchers, data scientists, and software engineers. No wonder since this library was developed for research only.

The algorithms included in the library are mainly related to cluster analysis and outlier detection that may indicate experimental errors. Every year, there are more and more references to research conducted using ELKI. On the developers' official website, you can find a whole list of scientific works that were conducted via this library. From year to year, this list is constantly growing and being supplemented, and the topics of scientific works are getting more diverse.

The only obstacle to using this library may be the AGPL license, which requires the full source code of the application to be open to any user. In other words, you won't be able to use this library in commercial projects with closed source code.

Let's start checking

The ELKI library contains 2 630 java files with 186 444 lines of code, except for comments. Therefore, the project seems to be quite small compared to some open source giants.

However, the small size of the library does not guarantee the absence of errors. As you know, errors can occur in any projects, regardless of their scale. That's why we are here now! Let's see my top 10 most interesting warnings found when viewing the report.

Unchecked boundaries

V6001 There are identical sub-expressions 'bounds[j + 1]' to the left and to the right of the '!=' operator. CLIQUEUnit.java(252)

private boolean checkDimensions(CLIQUEUnit other, int e) {
    for(int i = 0, j = 0; i < e; i++, j += 2) {
        if (dims[i] != other.dims[i]
            || bounds[j] != other.bounds[j]
            || bounds[j + 1] != bounds[j + 1]) {
          return false;
        }
    }
    return true;
}

In the if block of the checkDimensions method, the value of bounds[j + 1] is checked for inequality with its own value. Naturally, this condition will always be false, and one of the boundaries will never be checked. That is, the checkDimensions method may return true when checking, even if the boundaries of the arrays do not match.

The correct check in the if block must look this way:

bounds[j + 1] != other.bounds[j + 1]

Lazy constructor

V6022 Parameter 'updates' is not used inside constructor body. DataStoreEvent.java(60)

V6022 Parameter 'removals' is not used inside constructor body. DataStoreEvent.java(60)

public DataStoreEvent(DBIDs inserts, DBIDs removals, DBIDs updates) {
    super();
    this.inserts = inserts;
    this.removals = inserts;
    this.updates = inserts;
}

Two warnings were received for this code snippet. Here, three parameters are passed to the DataStoreEvent constructor, two of which are not used for initialization. Looking closer at the code in the constructor, it becomes clear that the programmer used copy-paste and forgot to correct the names of variables that should have been involved in initialization.

If you go even further and take a closer look at the DataStoreEvent class, you can see three methods in it that are actively using the above constructor.

public static DataStoreEvent insertionEvent(DBIDs inserts) {
  return new DataStoreEvent(inserts, DBIDUtil.EMPTYDBIDS, DBIDUtil.EMPTYDBIDS);
}

public static DataStoreEvent removalEvent(DBIDs removals) {
  return new DataStoreEvent(DBIDUtil.EMPTYDBIDS, removals, DBIDUtil.EMPTYDBIDS);
}

public static DataStoreEvent updateEvent(DBIDs updates) {
  return new DataStoreEvent(DBIDUtil.EMPTYDBIDS, DBIDUtil.EMPTYDBIDS, updates);
}

Due to incorrect initialization in the constructor, the behavior of these methods will also differ from what is expected.

The correct code version should be as follows:

this.inserts = inserts;
this.removals = removals;
this.updates = updates;

Opaque variable

V6012 The '?:' operator, regardless of its conditional expression, always returns one and the same value '0.5'. ClusterHullVisualization.java(173), ClusterHullVisualization.java(173)

public void fullRedraw() {
    ....
    boolean flat = (clusters.size() == topc.size());
    // Heuristic value for transparency:
    double baseopacity = flat ? 0.5 : 0.5;
    ....
}

In this code snippet, the value of the baseopacity variable will always be 0.5, regardless of what the value of the flat variable is. It's clear that there should have been two different values, but due to inattention, the author of the code forgot to correct one of them.

The mistake is simple and very disappointing. It immediately strikes your eye if you exclude lines of code that are not related to this variable from the method. But it's not an easy task to find a mistake in a method consisting of almost 90 lines. It looks like this is exactly what became the obstacle for the programmer.

In pursuit of something beyond the limits

V6025 Index '1' is out of bounds. GeneratorStatic.java(104)

@Override
public double[] computeMean() {
    // Not supported except for singletons.
    return points.size() == 1 ? points.get(1) : null;
}

In the computeMean method, the code author checks that the size of the points collection is equal to one. If so, the developer tries to return… a collection element with index 1 from the method. Since indexing starts from zero, an IndexOutOfBoundsException is inevitable.

The correct version of the code must look like this:

return points.size() == 1 ? points.get(0) : null;

Is it possible to divide by zero?

V6020 Divide by zero. The range of the 'referenceSetSize' denominator values includes zero. PreDeConNeighborPredicate.java(138)

protected PreDeConModel computeLocalModel(DoubleDBIDList neighbors, ....) {
    final int referenceSetSize = neighbors.size();
    ....
    // Shouldn't happen:
    if(referenceSetSize < 0) {
        LOG.warning("Empty reference set – 
            should at least include the query point!");
        return new PreDeConModel(Integer.MAX_VALUE, DBIDUtil.EMPTYDBIDS);
    }
    ....
    for(int d = 0; d < dim; d++) {
        s[d] /= referenceSetSize;
        mvVar.put(s[d]);
    }
    ....
}

Every programmer knows that you can't divide by zero. Every programmer tries to avoid the occurrence of such an error in their code. And it works in most cases. But not always.

In this code snippet, everything depends on the size of the neighbors parameter passed to the computeLocalModel method. The code author checks the neighbors size and cuts off the values that are less than zero by checking in the if statement, but isn't taking any actions if the size is zero. Since the check of referenceSetSize < 0 doesn't make any sense, and the message for logging also says about an empty set, all of that seems like a typo. The check of referenceSetSize == 0 was most likely to be here.

If an empty neighbors container is passed in this method anyway, the division by zero will occur in the for loop. We can only hope that this will never really happen.

Endless initialization

V6062 Possible infinite recursion inside the 'setInitialMeans' method. Predefined.java(65), Predefined.java(66)

public void setInitialMeans(List<double[]> initialMeans) {
    this.setInitialMeans(initialMeans);
}

This one-line method is a true recursion. It's difficult to guess what exactly the code author wanted to write here. The only line of this method probably should have looked as follows:

this.setInitialMeans(initialMeans.toArray(new double[0][0]));

Since there is another method with the same name in this class, the programmer most likely wanted to pass data for initialization to it, but in the end, something went wrong. This is what the body of the second method looks like, by the way:

public void setInitialMeans(double[][] initialMeans) {
    double[][] vecs = initialMeans.clone(); // TODO: deep copy?
    this.initialMeans = vecs;
}

Lost weight

V6094 The expression was implicitly cast from 'int' type to 'double' type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double)(X) / Y;. ProbabilityWeightedMoments.java(130)

public static <A> double[] alphaBetaPWM(...., final int nmom) {
    final int n = adapter.size(data);
    final double[] xmom = new double[nmom << 1];
    double aweight = 1. / n, bweight = aweight;
    for(int i = 0; i < n; i++) {
        ....
        for(int j = 1, k = 2; j < nmom; j++, k += 2) {
            xmom[k + 1] += val * (aweight *= (n - i - j + 1) / (n - j + 1));
            xmom[k + 1] += val * (bweight *= (i - j + 1) / (n - j + 1));
        }
    }
    return xmom;
}

In the for loop of this code snippet, the (n — i — j + 1) / (n — j + 1) expression is converted implicitly from the int type to the double type. In this case, the loss of accuracy may be quite different: from a few digits after the decimal point to a complete nullification if the number modulo is less than one. Probably, it's not exactly the expected behavior given that the xmom array is of the double type. The (n – i – j + 1) / (n – j + 1) expression helps to prove that the programmer meant something completely different here. Let's say n – j + 1 equals x. Then we get the expression: (x – i) / x. This expression will always give 0 for integer division unless we start going into negative ranges. But since the values of n in this code snippet are always greater than zero, we can conclude that the programmer did not intend to use integer division here.

Not to lose precision, an explicit conversion to the double type is required:

xmom[k + 1] += val * (aweight *= (double) (n - i - j + 1) / (n - j + 1));
xmom[k + 1] += val * (bweight *= (double) (i - j + 1) / (n - j + 1));

Out of bounds

V6079 Value of the 'splitpoint' variable is checked after use. Potential logical error is present. KernelDensityFittingTest.java(97), KernelDensityFittingTest.java(97)

public final void testFitDoubleArray() throws IOException {
    ....
    int splitpoint = 0;
    while(fulldata[splitpoint] < splitval && splitpoint < fulldata.length) {
        splitpoint++;
    }
    ....
}

In this code fragment, in the while loop, first, the fulldata array element with the splitpoint index is compared with the value of the splitval variable, and only then we check whether the splitpoint value is smaller than the size of the array itself. These two checks in the while loop need to be changed, otherwise, you can very easily find yourself outside the array's bounds.

Unreachable code

V6019 Unreachable code detected. It is possible that an error is present. Tokenizer.java(172)

V6007 Expression 'c != '\n'' is always true. Tokenizer.java(169)

public String getStrippedSubstring() {
    int sstart = start, ssend = end;
    while(sstart < ssend) {
        char c = input.charAt(sstart);
        if(c != ' ' || c != '\n' || c != '\r' || c != '\t') {
            break;
        }
        ++sstart;
    }
    ....
}

Two diagnostics decided to act together this time. They both issued warnings. The V6019 diagnostic pointed to an unreachable code fragment: ++sstart, and V6007 pointed to a condition in the if statement that will always be true.

Why will there always be true in the if block? The answer is very simple. In this statement several conditions are checked at once: c !=' ' or c != '\n', or c != '\r', or c != '\t'. Whatever the input data, some of those will be true. Even if one of the checks is false, the next check will return true, and because of the || (or) operator, the condition in if will eventually be true. The condition in the if block will always be true. Therefore, the break statement, which prematurely ends the while loop, will trigger. As a result, the increment of the sstart variable will never be executed. This is exactly what the V6019 diagnostics noticed and started sounding the alarm.

Most likely, the programmer wanted to write something like this:

if(c != ' ' && c != '\n' && c != '\r' && c != '\t')

Override but check

V6009 Function 'equals' receives an odd argument. An object 'other.similarityFunction' is used as an argument to its own method. AbstractSimilarityAdapter.java(91)

@Override
public boolean equals(Object obj) {
    if(obj == null) {
        return false;
    }

    if(!this.getClass().equals(obj.getClass())) {
        return false;
    }

    AbstractSimilarityAdapter<?> other = (AbstractSimilarityAdapter<?>) obj;
    return other.similarityFunction.equals(other.similarityFunction);
}

The code author decided to override the equals method in the AbstractSimilarityAdapter class. If it is assumed that objects in the program will be stored in containers, or checked for equality, the equals override is required. However, the whole author's idea was spoiled by the method's last line, in which equals is called for the same object. As a result, even the most common comparison will be incorrect.

The code was probably meant to be like this:

return this.similarityFunction.equals(other.similarityFunction);

I'd like to note that this error pattern is often found in programs in different languages, and not only in Java. We've got an article 'The Evil within the Comparison Functions' on this topic. You must read it if you're interested in finding out why programmers often make mistakes in fairly simple functions designed to compare two objects.

Let's recap

As you can see, mistakes may appear in any project. Some of them are simple and disappointing, others are cunning and tricky. And it doesn't matter whether your project is small or large. It doesn't matter if you're a professional developer or just starting to write code. Mistakes can appear anywhere in your program and hide safely from your eyes.

For every developer and every project, static code analysis is an indispensable tool. As essential as a code review or unit tests. That's why, we recommend using static code analysis in your work right now, so that in the New Year your code will become better and clearer than in the previous one.

In conclusion

It's not the first time we are checking packages related to calculations, statistics, and scientific researches. And every time we worry about mistakes in the code of such projects. Can we trust the data calculated using these libraries?

Of course, it's incredibly difficult to write code without mistakes, and in practice, it's even impossible. However, libraries used as building blocks in scientific, medical, and research programs should be checked as carefully as possible. The code should be as clean as possible. After all, every mistake can lead to very unpleasant consequences in the future. In any case, static analysis will prove to be extremely useful for such projects.

We have already covered this topic in some of our other articles, so if this question is interesting for you, don't hesitate to check out some of them:

Hubs: