# Making «foreach» loop as fast as «for» loop

6 min
15K
Original author: WhiteBlackGoose

Hi. I want to share some experience on creating a enumerator for `foreach` loop so that it would be as performant as if it was a `for` loop.

## Problem statement

As we know, `System.Range` has a nice syntax sugar:

``var a = 1..2; // same as new Range(1, 2)``

That's why I decided to abuse it as a replacement for a `for` loop:

``````foreach (var i in 1..5)
Console.Write(i);``````

(prints "12345")

When you iterate over something with `foreach`, Roslyn requires that "something" to have a `GetEnumerator` method, which should return something, which has `bool MoveNext()` and `T Current`. We can add it as an extension method:

``````public static class Extensions
{
public ... GetEnumerator(this System.Range range) => ...
}``````

## Baseline (our goal)

Theoretically, `for` is more functional than just going over a sequence of natural numbers. But because so often we see

``for (int i = 0; i < n; i++)``

that it deserves its own use case of `for`. We are going to compare against a `for` loop with `i += inc` as a step:

``for (int i = 0; i < n; i += inc)``

In terms of performance, it's identical to the one with `i++`:

Method

Iterations

Mean

Error

StdDev

i++

10000

3.380 us

0.0534 us

0.0446 us

i += inc

10000

3.385 us

0.0575 us

0.0685 us

The code for baseline `for`:

``````[Benchmark]
public int ForWithCustomIncrement()
{
var iters = Iterations;
var a = 0;
var inc = Inc;
for (int i = 0; i < iters; i += inc)
a += i;
return a;
}``````

(we loaded all fields/properties onto local variables to avoid repeatitive reading from far RAM)

Codegen for it:

``````    L000f: add edx, r8d
L0015: cmp r8d, eax
L0018: jl short L000f``````

We only care about codegen of iteration of the loop, ignoring the initial overhead and all other operations around.

## Experiment 1. Enumerable.Range

It might make sense to start with trying the already existing API from BCL: Enumerable.Range, whose source code we can find here.

Here's what our benchmark looks like:

``````[Benchmark]
public int ForeachWithEnumerableRange()
{
var a = 0;
foreach (var i in Enumerable.Range(0, Iterations))
a += i;
return a;
}``````

This code is interpreted by roslyn as

``````public int ForeachWithEnumerableRange()
{
int num = 0;
IEnumerator<int> enumerator = Enumerable.Range(0, Iterations).GetEnumerator();
try
{
while (enumerator.MoveNext())
{
int current = enumerator.Current;
num += current;
}
return num;
}
finally
{
if (enumerator != null)
{
enumerator.Dispose();
}
}
}``````

No need to study the asm codegen, as long as the benchmarks are pretty illustrative:

Method

Iterations

Mean

Error

StdDev

ForWithCustomIncrement

10000

3.448 us

0.0689 us

0.0896 us

ForeachWithEnumerableRange

10000

43.946 us

0.8685 us

1.2456 us

## Experiment 2. yield return

The second easiest way is to create a custom generator:

``````private static IEnumerable<int> MyRange(int from, int to, int inc)
{
for (int i = from; i <= to; i += inc)
yield return i;
}

[Benchmark]
public int ForeachWithYieldReturn()
{
var a = 0;
foreach (var i in MyRange(0, Iterations - 1, 1))
a += i;
return a;
}``````

Nothing completely new, our enumerator is still a big reference type with a dozen of fields. The performance is even worse than that with Enumerable.Range:

Method

Iterations

Mean

Error

StdDev

ForWithCustomIncrement

10000

3.548 us

0.0661 us

0.1320 us

ForeachWithEnumerableRange

10000

51.663 us

0.9103 us

1.2460 us

ForeachWithYieldReturn

10000

56.580 us

0.3561 us

0.2780 us

## Experiment 3. Custom enumerator

Now we are going to write our own enumerator. First of all, it will be a `struct`. To make it work with `foreach`, we will need a method `bool MoveNext()` and a property `int Current`.

``````public struct RangeEnumerator
{

private int curr;

public RangeEnumerator(int @from, int to, int step)
{
this.to = to + step;
this.step = step;
this.curr = @from - step;
}

public bool MoveNext()
{
curr += step;
return curr != to;
}

public int Current => curr;
}``````

To make `System.Range` iterable with `foreach`, we are making an extension method:

``````public static class Extensions
{
public static RangeEnumerator GetEnumerator(this Range @this)
=> (@this.Start, @this.End) switch
{
({ IsFromEnd: true, Value: 0 }, { IsFromEnd: true, Value: 0 }) => new RangeEnumerator(0, int.MaxValue, 1),
({ IsFromEnd: true, Value: 0 }, { IsFromEnd: false, Value: var to }) => new RangeEnumerator(0, to + 1, 1),
({ IsFromEnd: false, Value: var from }, { IsFromEnd: true, Value: 0 }) => new RangeEnumerator(from, int.MaxValue, 1),
({ IsFromEnd: false, Value: var from }, { IsFromEnd: false, Value: var to })
=> (from < to) switch
{
true => new RangeEnumerator(from, to, 1),
false => new RangeEnumerator(from, to, -1),
},
_ => throw new InvalidOperationException("Invalid range")
};
}``````

The code of the benching method:

``````[Benchmark]
public int ForeachWithRangeEnumerator()
{
var a = 0;
foreach (var i in 0..(Iterations - 1))
a += i;
return a;
}``````

and is interpreted by Roslyn as:

``````public int ForeachWithRangeEnumerator()
{
int num = 0;
RangeEnumerator enumerator = Extensions.GetEnumerator(new Range(0, Iterations - 1));
while (enumerator.MoveNext())
{
int current = enumerator.Current;
num += current;
}
return num;
}``````

This time it's not a reference type. Also, because our struct does not implement `IDisposable`, Roslyn does not wrap it with try-finally block. Let's have a look at the benchmark results:

Method

Iterations

Mean

Error

StdDev

ForWithCustomIncrement

10000

3.477 us

0.0690 us

0.0989 us

ForeachWithEnumerableRange

10000

50.501 us

0.9984 us

1.2627 us

ForeachWithYieldReturn

10000

53.782 us

1.0583 us

0.9900 us

ForeachWithRangeEnumerator

10000

18.061 us

0.1731 us

0.1619 us

18 microseconds per 10k operations. That's much better than the previous attempts, but still worse than `for`. Let's check the codegen:

One loop iteration looks like:

``````    L003e: add esi, eax
L0040: mov eax, [rsp+0x38]
L0048: mov [rsp+0x38], eax
L004c: mov eax, [rsp+0x38]
L0050: cmp eax, [rsp+0x30]
L0054: jne short L003a``````

Despite the enumerator being a struct, and hence, on stack, it's still read from RAM. Let's see what else we can do!

## Experiment 4. Separating the enumerator into a local function

Why...? Because what we want is in fact to put the enumerator itself onto registers instead of reading/writing from/to RAM all the time. But probably because of too many local variables, the JIT compiler decided not to put our struct onto registers, so let us help it a bit.

From experiment 3 we have

``````[Benchmark]
public int ForeachWithRangeEnumeratorRaw()
{
var a = 0;
var enumerator = (0..(Iterations - 1)).GetEnumerator();
while (enumerator.MoveNext())
a += enumerator.Current;
return a;
}``````

Note, that I replaced `foreach` with the equivalent code, and added `Raw` as postfix. In Experiment 3 it's shown how `foreach` works under the hood - the same as I wrote manually.

To reduce the number of local variables, we move the loop to a local function:

``````[Benchmark]
public int ForeachWithRangeEnumeratorRawWithLocal()
{
var enumerator = (0..(Iterations - 1)).GetEnumerator();
return EnumerateItAll(enumerator);

static int EnumerateItAll(RangeEnumerator enumerator)
{
var a = 0;
while (enumerator.MoveNext())
a += enumerator.Current;
return a;
}
}``````

And run the benchmark:

Method

Iterations

Mean

Error

StdDev

ForWithCustomIncrement

10000

3.498 us

0.0658 us

0.1153 us

ForeachWithEnumerableRange

10000

43.313 us

0.8207 us

1.0079 us

ForeachWithYieldReturn

10000

56.963 us

1.0638 us

1.0448 us

ForeachWithRangeEnumerator

10000

17.931 us

0.2047 us

0.1915 us

ForeachWithRangeEnumeratorRaw

10000

17.932 us

0.1486 us

0.1390 us

ForeachWithRangeEnumeratorRawWithLocal

10000

3.501 us

0.0678 us

0.0807 us

`ForeachWithRangeEnumeratorRawWithLocal` is the last method we wrote - and it's almost 6 times faster, than the one with the loop inlined into the body! What happened?

Let's look at the codegen.

In our method the local function hasn't been inlined:

``````Benchmarks.ForeachWithRangeEnumeratorRawWithLocal()
...
L0044: call Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|8_0(RangeEnumerator)``````

And it's good news. Let's have a look at our local function:

``````    L0000: mov eax, [rcx]
L0002: mov edx, [rcx+4]
L0005: mov ecx, [rcx+8]
L0008: xor r8d, r8d
L000b: jmp short L0010
>>  L0012: cmp ecx, eax
>>  L0014: jne short L000d
L0016: mov eax, r8d
L0019: ret``````

The four marked lines

``````    L000d: add r8d, ecx
L0012: cmp ecx, eax
L0014: jne short L000d``````

are what we were trying to do. It's identical to the codegen for the `for` loop!

The first three lines

``````    L0000: mov eax, [rcx]
L0002: mov edx, [rcx+4]
L0005: mov ecx, [rcx+8]``````

simply load the local variables from stack onto registers. That is why our method is so fast.

## Windows vs Linux

There is in fact a huge difference between what was generated on windows vs linux. What I wrote before is how it works on Windows.

To illustrate the difference I created a gist for comparing codegens for `ForeachWithRangeEnumeratorRaw` and `ForeachWithRangeEnumeratorRawWithLocal`. Let me go over the key differences:

In case of `ForeachWithRangeEnumeratorRaw` method, Linux's JIT generates:

``````M00_L00:
cmp       esi,eax
jne       short M00_L00``````

Whereas Windows' JIT yields this:

``````M00_L00:
mov       eax,[rsp+38]
mov       [rsp+38],eax
mov       eax,[rsp+38]
cmp       eax,[rsp+30]
jne       short M00_L00``````

Unlike that one, in case of `ForeachWithRangeEnumeratorRawWithLocal` Linux gives

``jmp       near ptr Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|14_0(Library.RangeEnumerator)``

in the method, and in local function it generates:

``````M02_L00:
mov       edi,[rbp-8]
mov       [rbp-8],edi
cmp       edi,esi
jne       short M02_L00``````

Meanwhile, Windows's JIT gives call:

``````call      Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|14_0(Library.RangeEnumerator)
``````

and the iteration codegen:

``````M02_L00:
cmp       ecx,eax
jne       short M02_L00``````

(all that codegen see in the gist)

## Conclusion

First, although we got what we wanted, it's definitely easier to write a `for` loop than caring about whether the compiler put our values onto registers.

Second, it's an interesting example how sometimes inlining can literally ruin the performance (in our case, on Windows it took 5-6 times more time on 10k iterations, 3.5 us vs 18 us).

## References

All the code I wrote is on a github repo.

+3