Hi. I want to share some experience on creating a enumerator for foreach
loop so that it would be as performant as if it was a for
loop.
Problem statement
As we know, System.Range
has a nice syntax sugar:
var a = 1..2; // same as new Range(1, 2)
That's why I decided to abuse it as a replacement for a for
loop:
foreach (var i in 1..5)
Console.Write(i);
(prints "12345")
When you iterate over something with foreach
, Roslyn requires that "something" to have a GetEnumerator
method, which should return something, which has bool MoveNext()
and T Current
. We can add it as an extension method:
public static class Extensions
{
public ... GetEnumerator(this System.Range range) => ...
}
Baseline (our goal)
Theoretically, for
is more functional than just going over a sequence of natural numbers. But because so often we see
for (int i = 0; i < n; i++)
that it deserves its own use case of for
. We are going to compare against a for
loop with i += inc
as a step:
for (int i = 0; i < n; i += inc)
In terms of performance, it's identical to the one with i++
:
Method | Iterations | Mean | Error | StdDev |
---|---|---|---|---|
i++ | 10000 | 3.380 us | 0.0534 us | 0.0446 us |
i += inc | 10000 | 3.385 us | 0.0575 us | 0.0685 us |
The code for baseline for
:
[Benchmark]
public int ForWithCustomIncrement()
{
var iters = Iterations;
var a = 0;
var inc = Inc;
for (int i = 0; i < iters; i += inc)
a += i;
return a;
}
(we loaded all fields/properties onto local variables to avoid repeatitive reading from far RAM)
Codegen for it:
L000f: add edx, r8d
L0012: add r8d, ecx
L0015: cmp r8d, eax
L0018: jl short L000f
We only care about codegen of iteration of the loop, ignoring the initial overhead and all other operations around.
Experiment 1. Enumerable.Range
It might make sense to start with trying the already existing API from BCL: Enumerable.Range, whose source code we can find here.
Here's what our benchmark looks like:
[Benchmark]
public int ForeachWithEnumerableRange()
{
var a = 0;
foreach (var i in Enumerable.Range(0, Iterations))
a += i;
return a;
}
This code is interpreted by roslyn as
public int ForeachWithEnumerableRange()
{
int num = 0;
IEnumerator<int> enumerator = Enumerable.Range(0, Iterations).GetEnumerator();
try
{
while (enumerator.MoveNext())
{
int current = enumerator.Current;
num += current;
}
return num;
}
finally
{
if (enumerator != null)
{
enumerator.Dispose();
}
}
}
No need to study the asm codegen, as long as the benchmarks are pretty illustrative:
Method | Iterations | Mean | Error | StdDev |
---|---|---|---|---|
ForWithCustomIncrement | 10000 | 3.448 us | 0.0689 us | 0.0896 us |
ForeachWithEnumerableRange | 10000 | 43.946 us | 0.8685 us | 1.2456 us |
Experiment 2. yield return
The second easiest way is to create a custom generator:
private static IEnumerable<int> MyRange(int from, int to, int inc)
{
for (int i = from; i <= to; i += inc)
yield return i;
}
[Benchmark]
public int ForeachWithYieldReturn()
{
var a = 0;
foreach (var i in MyRange(0, Iterations - 1, 1))
a += i;
return a;
}
Nothing completely new, our enumerator is still a big reference type with a dozen of fields. The performance is even worse than that with Enumerable.Range:
Method | Iterations | Mean | Error | StdDev |
---|---|---|---|---|
ForWithCustomIncrement | 10000 | 3.548 us | 0.0661 us | 0.1320 us |
ForeachWithEnumerableRange | 10000 | 51.663 us | 0.9103 us | 1.2460 us |
ForeachWithYieldReturn | 10000 | 56.580 us | 0.3561 us | 0.2780 us |
Experiment 3. Custom enumerator
Now we are going to write our own enumerator. First of all, it will be a struct
. To make it work with foreach
, we will need a method bool MoveNext()
and a property int Current
.
public struct RangeEnumerator
{
private readonly int to;
private readonly int step;
private int curr;
public RangeEnumerator(int @from, int to, int step)
{
this.to = to + step;
this.step = step;
this.curr = @from - step;
}
public bool MoveNext()
{
curr += step;
return curr != to;
}
public int Current => curr;
}
To make System.Range
iterable with foreach
, we are making an extension method:
public static class Extensions
{
public static RangeEnumerator GetEnumerator(this Range @this)
=> (@this.Start, @this.End) switch
{
({ IsFromEnd: true, Value: 0 }, { IsFromEnd: true, Value: 0 }) => new RangeEnumerator(0, int.MaxValue, 1),
({ IsFromEnd: true, Value: 0 }, { IsFromEnd: false, Value: var to }) => new RangeEnumerator(0, to + 1, 1),
({ IsFromEnd: false, Value: var from }, { IsFromEnd: true, Value: 0 }) => new RangeEnumerator(from, int.MaxValue, 1),
({ IsFromEnd: false, Value: var from }, { IsFromEnd: false, Value: var to })
=> (from < to) switch
{
true => new RangeEnumerator(from, to, 1),
false => new RangeEnumerator(from, to, -1),
},
_ => throw new InvalidOperationException("Invalid range")
};
}
The code of the benching method:
[Benchmark]
public int ForeachWithRangeEnumerator()
{
var a = 0;
foreach (var i in 0..(Iterations - 1))
a += i;
return a;
}
and is interpreted by Roslyn as:
public int ForeachWithRangeEnumerator()
{
int num = 0;
RangeEnumerator enumerator = Extensions.GetEnumerator(new Range(0, Iterations - 1));
while (enumerator.MoveNext())
{
int current = enumerator.Current;
num += current;
}
return num;
}
This time it's not a reference type. Also, because our struct does not implement IDisposable
, Roslyn does not wrap it with try-finally block. Let's have a look at the benchmark results:
Method | Iterations | Mean | Error | StdDev |
---|---|---|---|---|
ForWithCustomIncrement | 10000 | 3.477 us | 0.0690 us | 0.0989 us |
ForeachWithEnumerableRange | 10000 | 50.501 us | 0.9984 us | 1.2627 us |
ForeachWithYieldReturn | 10000 | 53.782 us | 1.0583 us | 0.9900 us |
ForeachWithRangeEnumerator | 10000 | 18.061 us | 0.1731 us | 0.1619 us |
18 microseconds per 10k operations. That's much better than the previous attempts, but still worse than for
. Let's check the codegen:
One loop iteration looks like:
L003e: add esi, eax
L0040: mov eax, [rsp+0x38]
L0044: add eax, [rsp+0x34]
L0048: mov [rsp+0x38], eax
L004c: mov eax, [rsp+0x38]
L0050: cmp eax, [rsp+0x30]
L0054: jne short L003a
Despite the enumerator being a struct, and hence, on stack, it's still read from RAM. Let's see what else we can do!
Experiment 4. Separating the enumerator into a local function
Why...? Because what we want is in fact to put the enumerator itself onto registers instead of reading/writing from/to RAM all the time. But probably because of too many local variables, the JIT compiler decided not to put our struct onto registers, so let us help it a bit.
From experiment 3 we have
[Benchmark]
public int ForeachWithRangeEnumeratorRaw()
{
var a = 0;
var enumerator = (0..(Iterations - 1)).GetEnumerator();
while (enumerator.MoveNext())
a += enumerator.Current;
return a;
}
Note, that I replaced foreach
with the equivalent code, and added Raw
as postfix. In Experiment 3 it's shown how foreach
works under the hood - the same as I wrote manually.
To reduce the number of local variables, we move the loop to a local function:
[Benchmark]
public int ForeachWithRangeEnumeratorRawWithLocal()
{
var enumerator = (0..(Iterations - 1)).GetEnumerator();
return EnumerateItAll(enumerator);
static int EnumerateItAll(RangeEnumerator enumerator)
{
var a = 0;
while (enumerator.MoveNext())
a += enumerator.Current;
return a;
}
}
And run the benchmark:
Method | Iterations | Mean | Error | StdDev |
---|---|---|---|---|
ForWithCustomIncrement | 10000 | 3.498 us | 0.0658 us | 0.1153 us |
ForeachWithEnumerableRange | 10000 | 43.313 us | 0.8207 us | 1.0079 us |
ForeachWithYieldReturn | 10000 | 56.963 us | 1.0638 us | 1.0448 us |
ForeachWithRangeEnumerator | 10000 | 17.931 us | 0.2047 us | 0.1915 us |
ForeachWithRangeEnumeratorRaw | 10000 | 17.932 us | 0.1486 us | 0.1390 us |
ForeachWithRangeEnumeratorRawWithLocal | 10000 | 3.501 us | 0.0678 us | 0.0807 us |
ForeachWithRangeEnumeratorRawWithLocal
is the last method we wrote - and it's almost 6 times faster, than the one with the loop inlined into the body! What happened?
Let's look at the codegen.
In our method the local function hasn't been inlined:
Benchmarks.ForeachWithRangeEnumeratorRawWithLocal()
...
L0044: call Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|8_0(RangeEnumerator)
And it's good news. Let's have a look at our local function:
L0000: mov eax, [rcx]
L0002: mov edx, [rcx+4]
L0005: mov ecx, [rcx+8]
L0008: xor r8d, r8d
L000b: jmp short L0010
>> L000d: add r8d, ecx
>> L0010: add ecx, edx
>> L0012: cmp ecx, eax
>> L0014: jne short L000d
L0016: mov eax, r8d
L0019: ret
The four marked lines
L000d: add r8d, ecx
L0010: add ecx, edx
L0012: cmp ecx, eax
L0014: jne short L000d
are what we were trying to do. It's identical to the codegen for the for
loop!
The first three lines
L0000: mov eax, [rcx]
L0002: mov edx, [rcx+4]
L0005: mov ecx, [rcx+8]
simply load the local variables from stack onto registers. That is why our method is so fast.
Windows vs Linux
There is in fact a huge difference between what was generated on windows vs linux. What I wrote before is how it works on Windows.
To illustrate the difference I created a gist for comparing codegens for ForeachWithRangeEnumeratorRaw
and ForeachWithRangeEnumeratorRawWithLocal
. Let me go over the key differences:
In case of ForeachWithRangeEnumeratorRaw
method, Linux's JIT generates:
M00_L00:
add ebx,esi
add esi,edi
cmp esi,eax
jne short M00_L00
Whereas Windows' JIT yields this:
M00_L00:
add esi,[rsp+38]
mov eax,[rsp+38]
add eax,[rsp+34]
mov [rsp+38],eax
mov eax,[rsp+38]
cmp eax,[rsp+30]
jne short M00_L00
Unlike that one, in case of ForeachWithRangeEnumeratorRawWithLocal
Linux gives
jmp near ptr Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|14_0(Library.RangeEnumerator)
in the method, and in local function it generates:
M02_L00:
mov edi,[rbp-8]
add eax,edi
add edi,[rbp-0C]
mov [rbp-8],edi
cmp edi,esi
jne short M02_L00
Meanwhile, Windows's JIT gives call:
call Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|14_0(Library.RangeEnumerator)
and the iteration codegen:
M02_L00:
add r8d,ecx
add ecx,edx
cmp ecx,eax
jne short M02_L00
(all that codegen see in the gist)
Conclusion
First, although we got what we wanted, it's definitely easier to write a for
loop than caring about whether the compiler put our values onto registers.
Second, it's an interesting example how sometimes inlining can literally ruin the performance (in our case, on Windows it took 5-6 times more time on 10k iterations, 3.5 us vs 18 us).
References
All the code I wrote is on a github repo.
Tests (to verify that the benched methods are correct)
BenchmarkDotNet - tool for accurate performance measurement