# A small overview of SIMD in .NET/C#

Here’s a quick look at algorithm vectorization capabilities in .NET Framework and .NET Core. This article is for those who know nothing about these techniques. I will also show that .NET doesn’t actually lag behind "real compiled" languages for native development.

I’m just starting to learn vectorization techniques. So, I will appreciate if community members find clear errors or suggest improvements to the described algorithms.

## Some history

SIMD appeared in .NET Framework 4.6 in 2015. That's when Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3 and Vector4 types were added. They allowed vectorized computations. Next was the Vector<T> type that gave more opportunities to vectorize algorithms. However, many programmers were still dissatisfied as these types restricted coders’ idea streams and didn’t let use the full capacity of SIMD instructions in modern processors. Now in .NET Core 3.0 Preview, we have System.Runtime.Intrinsics namespace that gives much freedom in the choice of instructions. To get the most in speed you need to use RyuJit and resort either to x64 assembly or switch off Prefer 32-bit and choose AnyCPU assembly. I ran all the benchmarks on Intel Core i7-6700 3.40 GHz (Skylake) CPU computer.

## Summing array elements

I decided to start with a classic task which usually comes first if there’s vectorization involved. It deals with finding the sum of array elements. Let’s write four implementations of this task to sum the elements of Array.

The most obvious implementation:

public int Naive() {
int result = 0;
foreach (int i in Array) {
result += i;
}
return result;
}

LINQ-based implementation:

public long LINQ() => Array.Aggregate<int, long>(0, (current, i) => current + i);

The implementation based on vectors from System.Numerics:

public int Vectors() {
int vectorSize = Vector<int>.Count;
var accVector = Vector<int>.Zero;
int i;
var array = Array;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = new Vector<int>(array, i);
}
int result = Vector.Dot(accVector, Vector<int>.One);
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}

The implementation based on code from System.Runtime.Intrinsics namespace:

public unsafe int Intrinsics() {
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<int>.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
}
}
int result = 0;
var temp = stackalloc int[vectorSize];
Avx2.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}

I benchmarked those 4 methods on my computer and got the following results:

Method ItemsCount Mean Error StdDev Ratio
Naive 10 3.531 ns 0.0336 ns 0.0314 ns 1.00
LINQ 10 76.925 ns 0.4166 ns 0.3897 ns 21.79
Vectors 10 2.750 ns 0.0210 ns 0.0196 ns 0.78
Intrinsics 10 6.513 ns 0.0623 ns 0.0582 ns 1.84
Naive 100 47.982 ns 0.3975 ns 0.3524 ns 1.00
LINQ 100 590.414 ns 3.8808 ns 3.4402 ns 12.31
Vectors 100 10.122 ns 0.0747 ns 0.0699 ns 0.21
Intrinsics 100 14.277 ns 0.0566 ns 0.0529 ns 0.30
Naive 1000 569.910 ns 2.8297 ns 2.6469 ns 1.00
LINQ 1000 5,658.570 ns 31.7465 ns 29.6957 ns 9.93
Vectors 1000 79.598 ns 0.3498 ns 0.3272 ns 0.14
Intrinsics 1000 66.970 ns 0.3937 ns 0.3682 ns 0.12
Naive 10000 5,637.571 ns 37.5050 ns 29.2814 ns 1.00
LINQ 10000 56,498.987 ns 294.8776 ns 275.8287 ns 10.02
Vectors 10000 772.900 ns 2.6802 ns 2.5070 ns 0.14
Intrinsics 10000 579.152 ns 2.8371 ns 2.6538 ns 0.10
Naive 100000 56,352.865 ns 230.7916 ns 215.8826 ns 1.00
LINQ 100000 562,610.571 ns 3,775.7631 ns 3,152.9332 ns 9.98
Vectors 100000 8,389.647 ns 165.9590 ns 227.1666 ns 0.15
Intrinsics 100000 7,261.334 ns 89.6468 ns 69.9903 ns 0.13

It’s clear that solutions with Vectors and Intrinsics are much faster than the obvious and LINQ-based solutions. Now we need to figure out what goes on in these two methods.

Let’s consider Vectors method more closely:

Vectors
public int Vectors() {
int vectorSize = Vector<int>.Count;
var accVector = Vector<int>.Zero;
int i;
var array = Array;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = new Vector<int>(array, i);
}
int result = Vector.Dot(accVector, Vector<int>.One);
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}

• int vectorSize = Vector<int>.Count; — the amount of 4-byte numbers we can place in a vector. If hardware acceleration is used, this value shows how many 4-byte numbers we can put in one SIMD register. In fact, it shows how many elements of this type can be handled concurrently;
• accVector is a vector that accumulates the result of the function;
• var v = new Vector<int>(array, i); — the data from array is loaded into a new v vector, starting from i index. The vectorSize of data will be loaded exactly;
• accVector = Vector.Add(accVector, v); — two vectors are summed.
For example, there are 8 numbers in Array: {0, 1, 2, 3, 4, 5, 6, 7} and vectorSize == 4.
Then during the first cycle iteration accVector = {0, 0, 0, 0}, v = {0, 1, 2, 3} and after addition accVector will hold: {0, 0, 0, 0} + {0, 1, 2, 3} = {0, 1, 2, 3}.
During the second iteration v = {4, 5, 6, 7} and after addition accVector = {0, 1, 2, 3} + {4, 5, 6, 7} = {4, 6, 8, 10}.
• Now we just need to get the sum of all vector elements. To do this we can use scalar multiplication by a vector filled with ones: int result = Vector.Dot(accVector, Vector<int>.One);
Then we get: {4, 6, 8, 10} * {1, 1, 1, 1} = 4 * 1 + 6 * 1 + 8 * 1 + 10 * 1 = 28.
• If necessary, those numbers that don’t fit the last vector will be summed at the end.

Let's look into Intrinsics code:

Intrinsics
public unsafe int Intrinsics() {
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<int>.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
}
}
int result = 0;
var temp = stackalloc int[vectorSize];
Avx2.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}

We can see that it’s like Vectors with one exception:

• vectorSize is specified by a constant. This is because this method explicitly uses Avx2 instructions that operate with 256-bit registers. A real application should include a check of whether a current processor supports Avx2. If not, another code should be called. It looks like this:
if (Avx2.IsSupported) {
DoThingsForAvx2();
}
else if (Avx.IsSupported) {
DoThingsForAvx();
}
...
else if (Sse2.IsSupported) {
}
...
• var accVector = Vector256<int>.Zero; accVector is declared as 256-bit vector filled with zeros.
• fixed (int* ptr = Array) — the pointer to the array is placed in ptr.
• Next are the same operations as in Vectors: loading data into a vector and addition of two vectors.
• The summing of vector elements uses the following method:
• create an array on stack: var temp = stackalloc int[vectorSize];
• load a vector into this array: Avx2.Store(temp, accVector);
• sum array elements during the cycle.
• Next, the elements which don’t fit the last vector are summed up.

## Comparing two arrays

We need to compare two arrays of bytes. It is exactly this task that made me study SIMD in .NET. Again, let’s write several methods for benchmarking and compare two arrays: ArrayA and ArrayB.

The most obvious solution:

public bool Naive() {
for (int i = 0; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i]) return false;
}
return true;
}

LINQ-based solution:

public bool LINQ() => ArrayA.SequenceEqual(ArrayB);

The solution based on MemCmp function:

[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
static extern int memcmp(byte[] b1, byte[] b2, long count);

public bool MemCmp() => memcmp(ArrayA, ArrayB, ArrayA.Length) == 0;

The solution based on vectors from System.Numerics:

public bool Vectors() {
int vectorSize = Vector<byte>.Count;
int i = 0;
for (; i <= ArrayA.Length - vectorSize; i += vectorSize) {
var va = new Vector<byte>(ArrayA, i);
var vb = new Vector<byte>(ArrayB, i);
if (!Vector.EqualsAll(va, vb)) {
return false;
}
}
for (; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i])
return false;
}
return true;
}

Intrinsics-based solution:

public unsafe bool Intrinsics() {
int vectorSize = 256 / 8;
int i = 0;
const int equalsMask = unchecked((int) (0b1111_1111_1111_1111_1111_1111_1111_1111));
fixed (byte* ptrA = ArrayA)
fixed (byte* ptrB = ArrayB) {
for (; i <= ArrayA.Length - vectorSize; i += vectorSize) {
var va = Avx2.LoadVector256(ptrA + i);
var vb = Avx2.LoadVector256(ptrB + i);
var areEqual = Avx2.CompareEqual(va, vb);
return false;
}
}
for (; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i])
return false;
}
return true;
}
}

The results of running the benchmark on my computer:

Method ItemsCount Mean Error StdDev Ratio
Naive 10000 7,033.8 ns 50.636 ns 47.365 ns 1.00
LINQ 10000 64,841.4 ns 289.157 ns 270.478 ns 9.22
Vectors 10000 504.0 ns 2.406 ns 2.251 ns 0.07
MemCmp 10000 368.1 ns 2.637 ns 2.466 ns 0.05
Intrinsics 10000 283.6 ns 1.135 ns 1.061 ns 0.04
Naive 100000 85,214.4 ns 903.868 ns 845.478 ns 1.00
LINQ 100000 702,279.4 ns 2,846.609 ns 2,662.720 ns 8.24
Vectors 100000 5,179.2 ns 45.337 ns 42.409 ns 0.06
MemCmp 100000 4,510.5 ns 24.292 ns 22.723 ns 0.05
Intrinsics 100000 2,957.0 ns 11.452 ns 10.712 ns 0.03
Naive 1000000 844,006.1 ns 3,552.478 ns 3,322.990 ns 1.00
LINQ 1000000 6,483,079.3 ns 42,641.040 ns 39,886.455 ns 7.68
Vectors 1000000 54,180.1 ns 357.258 ns 334.180 ns 0.06
MemCmp 1000000 49,480.1 ns 515.675 ns 457.133 ns 0.06
Intrinsics 1000000 36,633.9 ns 680.525 ns 636.564 ns 0.04

I guess the code of these methods is clear, except for two lines in Intrinsics:

var areEqual = Avx2.CompareEqual(va, vb);
return false;
}

In the first line two vectors are compared for equality and the result is saved in areEqual vector in which all bits in the element at a particular position are set to 1 if the corresponding elements in va and vb are equal. So, it turns out that if byte vectors va and vb are equal, all the elements in areEquals should equal to 255 (11111111b). As Avx2.CompareEqual is a wrapper over _mm256_cmpeq_epi8, we can go to Intel website and see the pseudocode of this operation:
MoveMask method makes a 32-bit number from a vector. The top bits of each 32 onebyte elements in a vector are the values of bits in the MoveMask result. The pseudocode is available here.

Thus, if some bytes in va and vb don’t match, the corresponding bytes in areEqual will be 0. Therefore, the top bits of these bytes will be 0 too. This means the corresponding bits in Avx2.MoveMask response will also be 0 and areEqual will not equal to equalsMask.

Let’s look at one example assuming that vector length is 8 bytes (to write less):

• Let va = {100, 10, 20, 30, 100, 40, 50, 100} and vb = {100, 20, 10, 30, 100, 40, 80, 90}.
• Then areEquals will be {255, 0, 0, 255, 255, 255, 0, 0}.
• The MoveMask method will return 10011100b that should be compared with 11111111b mask. As these masks are not equal, va and vb vectors are not equal too.

## Counting the times an element occurs in a collection.

Sometimes you need to count the occurrences of a particular element, e.g. integers, in a collection. We can speed up this algorithm too. For comparison let’s write several methods to search Item element in Array.

The most obvious one:

public int Naive() {
int result = 0;
foreach (int i in Array) {
if (i == Item) {
result++;
}
}
return result;
}

Using LINQ:

public int LINQ() => Array.Count(i => i == Item);

Using vectors from System.Numerics.Vectors:

public int Vectors() {
int vectorSize = Vector<int>.Count;
var accResult = new Vector<int>();
int i;
var array = Array;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = new Vector<int>(array, i);
accResult = Vector.Subtract(accResult, areEqual);
}
int result = 0;
for (; i < array.Length; i++) {
if (array[i] == Item) {
result++;
}
}
result += Vector.Dot(accResult, Vector<int>.One);
return result;
}

Using Intrinsics:

public unsafe int Intrinsics() {
int vectorSize = 256 / 8 / 4;
var temp = stackalloc int[vectorSize];
for (int j = 0; j < vectorSize; j++) {
temp[j] = Item;
}
var accVector = Vector256<int>.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i <= array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
accVector = Avx2.Subtract(accVector, areEqual);
}
}
int result = 0;
Avx2.Store(temp, accVector);
for(int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for(; i < array.Length; i++) {
if (array[i] == Item) {
result++;
}
}
return result;
}

The results of running the benchmark on my computer:

Method ItemsCount Mean Error StdDev Ratio
Naive 10 8.844 ns 0.0772 ns 0.0603 ns 1.00
LINQ 10 87.456 ns 0.9496 ns 0.8883 ns 9.89
Vectors 10 3.140 ns 0.0406 ns 0.0380 ns 0.36
Intrinsics 10 13.813 ns 0.0825 ns 0.0772 ns 1.56
Naive 100 107.310 ns 0.6975 ns 0.6183 ns 1.00
LINQ 100 626.285 ns 5.7677 ns 5.3951 ns 5.83
Vectors 100 11.844 ns 0.2113 ns 0.1873 ns 0.11
Intrinsics 100 19.616 ns 0.1018 ns 0.0903 ns 0.18
Naive 1000 1,032.466 ns 6.3799 ns 5.6556 ns 1.00
LINQ 1000 6,266.605 ns 42.6585 ns 39.9028 ns 6.07
Vectors 1000 83.417 ns 0.5393 ns 0.4780 ns 0.08
Intrinsics 1000 88.358 ns 0.4921 ns 0.4603 ns 0.09
Naive 10000 9,942.503 ns 47.9732 ns 40.0598 ns 1.00
LINQ 10000 62,305.598 ns 643.8775 ns 502.6972 ns 6.27
Vectors 10000 914.967 ns 7.2959 ns 6.8246 ns 0.09
Intrinsics 10000 931.698 ns 6.3444 ns 5.9346 ns 0.09
Naive 100000 94,834.804 ns 793.8585 ns 703.7349 ns 1.00
LINQ 100000 626,620.968 ns 4,696.9221 ns 4,393.5038 ns 6.61
Vectors 100000 9,000.827 ns 179.5351 ns 192.1005 ns 0.09
Intrinsics 100000 8,690.771 ns 101.7078 ns 95.1376 ns 0.09
Naive 1000000 959,302.249 ns 4,268.2488 ns 3,783.6914 ns 1.00
LINQ 1000000 6,218,681.888 ns 31,321.9277 ns 29,298.5506 ns 6.48
Vectors 1000000 99,778.488 ns 1,975.6001 ns 4,252.6877 ns 0.10
Intrinsics 1000000 96,449.350 ns 1,171.8067 ns 978.5116 ns 0.10

Vectors and Intrinsics methods completely coincide in logic but differ in the implementation of particular operations. The idea is the following:

• create mask vector in which a required number is stored in each element;
• load the part of an array in v vector and compare this part with a mask. As a result, all bits will set in equal elements of areEqual. As areEqual is an array of integers, then if we set all the bits of one element, we will get -1 in this element ((int)(1111_1111_1111_1111_1111_1111_1111_1111b) == -1);
• subtract areEqual vector from accVector. Then, accVector will hold the count of how many times the item element occurred in all v vectors for each position (minus by minus is a plus).

The whole code from the article is on GitHub.

## Conclusion

I described only a small part of .NET capabilities for computation vectorization. To see the full updated list of all intrinsics available in .NET Core under x86, turn to the source code. It’s convenient that the summary of each intrinsic in C# files contains its name in C world. This helps either to understand the purpose of this intrinsic and the transfer of existing C++/C algorithms to .NET. System.Numerics.Vector documentation is available on msdn.

I think .NET has a great advantage over C++. As JIT compilation already occurs on a client machine, a compiler can optimize code for a particular client processor, giving maximum performance. At the same time, a programmer can stay within one language and the same technologies to write fast code.

Share post

## Similar posts

AdBlock has stolen the banner, but banners are not teeth — they will be back