WhiteBlackGoose Sep 2 2021 at 10:06

Making «foreach» loop as fast as «for» loop

6 min

15K

Translation

Hi. I want to share some experience on creating a enumerator for foreach loop so that it would be as performant as if it was a for loop.

Problem statement

As we know, System.Range has a nice syntax sugar:

var a = 1..2; // same as new Range(1, 2)

That's why I decided to abuse it as a replacement for a for loop:

foreach (var i in 1..5)
    Console.Write(i);

(prints "12345")

When you iterate over something with foreach, Roslyn requires that "something" to have a GetEnumerator method, which should return something, which has bool MoveNext() and T Current. We can add it as an extension method:

public static class Extensions
{
    public ... GetEnumerator(this System.Range range) => ...
}

Baseline (our goal)

Theoretically, for is more functional than just going over a sequence of natural numbers. But because so often we see

for (int i = 0; i < n; i++)

that it deserves its own use case of for. We are going to compare against a for loop with i += inc as a step:

for (int i = 0; i < n; i += inc)

In terms of performance, it's identical to the one with i++:

Method	Iterations	Mean	Error	StdDev
i++	10000	3.380 us	0.0534 us	0.0446 us
i += inc	10000	3.385 us	0.0575 us	0.0685 us

The code for baseline for:

[Benchmark]
public int ForWithCustomIncrement()
{
    var iters = Iterations;
    var a = 0;
    var inc = Inc;
    for (int i = 0; i < iters; i += inc)
        a += i;
    return a;
}

(we loaded all fields/properties onto local variables to avoid repeatitive reading from far RAM)

Codegen for it:

    L000f: add edx, r8d
    L0012: add r8d, ecx
    L0015: cmp r8d, eax
    L0018: jl short L000f

We only care about codegen of iteration of the loop, ignoring the initial overhead and all other operations around.

Experiment 1. Enumerable.Range

It might make sense to start with trying the already existing API from BCL: Enumerable.Range, whose source code we can find here.

Here's what our benchmark looks like:

[Benchmark]
public int ForeachWithEnumerableRange()
{
    var a = 0;
    foreach (var i in Enumerable.Range(0, Iterations))
        a += i;
    return a;
}

This code is interpreted by roslyn as

public int ForeachWithEnumerableRange()
{
    int num = 0;
    IEnumerator<int> enumerator = Enumerable.Range(0, Iterations).GetEnumerator();
    try
    {
        while (enumerator.MoveNext())
        {
            int current = enumerator.Current;
            num += current;
        }
        return num;
    }
    finally
    {
        if (enumerator != null)
        {
            enumerator.Dispose();
        }
    }
}

No need to study the asm codegen, as long as the benchmarks are pretty illustrative:

Method	Iterations	Mean	Error	StdDev
ForWithCustomIncrement	10000	3.448 us	0.0689 us	0.0896 us
ForeachWithEnumerableRange	10000	43.946 us	0.8685 us	1.2456 us

Experiment 2. yield return

The second easiest way is to create a custom generator:

private static IEnumerable<int> MyRange(int from, int to, int inc)
{
    for (int i = from; i <= to; i += inc)
        yield return i;
}

[Benchmark]
public int ForeachWithYieldReturn()
{
    var a = 0;
    foreach (var i in MyRange(0, Iterations - 1, 1))
        a += i;
    return a;
}

Nothing completely new, our enumerator is still a big reference type with a dozen of fields. The performance is even worse than that with Enumerable.Range:

Method	Iterations	Mean	Error	StdDev
ForWithCustomIncrement	10000	3.548 us	0.0661 us	0.1320 us
ForeachWithEnumerableRange	10000	51.663 us	0.9103 us	1.2460 us
ForeachWithYieldReturn	10000	56.580 us	0.3561 us	0.2780 us

Experiment 3. Custom enumerator

Now we are going to write our own enumerator. First of all, it will be a struct. To make it work with foreach, we will need a method bool MoveNext() and a property int Current.

public struct RangeEnumerator
{
    private readonly int to;
    private readonly int step;

    private int curr;

    public RangeEnumerator(int @from, int to, int step)
    {
        this.to = to + step;
        this.step = step;
        this.curr = @from - step;
    }

    public bool MoveNext()
    {
        curr += step;
        return curr != to;
    }
    
    public int Current => curr;
}

To make System.Range iterable with foreach, we are making an extension method:

public static class Extensions
{
    public static RangeEnumerator GetEnumerator(this Range @this)
        => (@this.Start, @this.End) switch
        {
            ({ IsFromEnd: true, Value: 0 }, { IsFromEnd: true, Value: 0 }) => new RangeEnumerator(0, int.MaxValue, 1),
            ({ IsFromEnd: true, Value: 0 }, { IsFromEnd: false, Value: var to }) => new RangeEnumerator(0, to + 1, 1),
            ({ IsFromEnd: false, Value: var from }, { IsFromEnd: true, Value: 0 }) => new RangeEnumerator(from, int.MaxValue, 1),
            ({ IsFromEnd: false, Value: var from }, { IsFromEnd: false, Value: var to })
                => (from < to) switch
                {
                    true => new RangeEnumerator(from, to, 1),
                    false => new RangeEnumerator(from, to, -1),
                },
            _ => throw new InvalidOperationException("Invalid range")
        };
}

The code of the benching method:

[Benchmark]
public int ForeachWithRangeEnumerator()
{
    var a = 0;
    foreach (var i in 0..(Iterations - 1))
        a += i;
    return a;
}

and is interpreted by Roslyn as:

public int ForeachWithRangeEnumerator()
{
    int num = 0;
    RangeEnumerator enumerator = Extensions.GetEnumerator(new Range(0, Iterations - 1));
    while (enumerator.MoveNext())
    {
        int current = enumerator.Current;
        num += current;
    }
    return num;
}

This time it's not a reference type. Also, because our struct does not implement IDisposable, Roslyn does not wrap it with try-finally block. Let's have a look at the benchmark results:

Method	Iterations	Mean	Error	StdDev
ForWithCustomIncrement	10000	3.477 us	0.0690 us	0.0989 us
ForeachWithEnumerableRange	10000	50.501 us	0.9984 us	1.2627 us
ForeachWithYieldReturn	10000	53.782 us	1.0583 us	0.9900 us
ForeachWithRangeEnumerator	10000	18.061 us	0.1731 us	0.1619 us

18 microseconds per 10k operations. That's much better than the previous attempts, but still worse than for. Let's check the codegen:

One loop iteration looks like:

    L003e: add esi, eax
    L0040: mov eax, [rsp+0x38]
    L0044: add eax, [rsp+0x34]
    L0048: mov [rsp+0x38], eax
    L004c: mov eax, [rsp+0x38]
    L0050: cmp eax, [rsp+0x30]
    L0054: jne short L003a

Despite the enumerator being a struct, and hence, on stack, it's still read from RAM. Let's see what else we can do!

Experiment 4. Separating the enumerator into a local function

Why...? Because what we want is in fact to put the enumerator itself onto registers instead of reading/writing from/to RAM all the time. But probably because of too many local variables, the JIT compiler decided not to put our struct onto registers, so let us help it a bit.

From experiment 3 we have

[Benchmark]
public int ForeachWithRangeEnumeratorRaw()
{
    var a = 0;
    var enumerator = (0..(Iterations - 1)).GetEnumerator();
    while (enumerator.MoveNext())
        a += enumerator.Current;
    return a;
}

Note, that I replaced foreach with the equivalent code, and added Raw as postfix. In Experiment 3 it's shown how foreach works under the hood - the same as I wrote manually.

To reduce the number of local variables, we move the loop to a local function:

[Benchmark]
public int ForeachWithRangeEnumeratorRawWithLocal()
{
    var enumerator = (0..(Iterations - 1)).GetEnumerator();        
    return EnumerateItAll(enumerator);

    static int EnumerateItAll(RangeEnumerator enumerator)
    {
        var a = 0;
        while (enumerator.MoveNext())
            a += enumerator.Current;
        return a;
    }
}

And run the benchmark:

Method	Iterations	Mean	Error	StdDev
ForWithCustomIncrement	10000	3.498 us	0.0658 us	0.1153 us
ForeachWithEnumerableRange	10000	43.313 us	0.8207 us	1.0079 us
ForeachWithYieldReturn	10000	56.963 us	1.0638 us	1.0448 us
ForeachWithRangeEnumerator	10000	17.931 us	0.2047 us	0.1915 us
ForeachWithRangeEnumeratorRaw	10000	17.932 us	0.1486 us	0.1390 us
ForeachWithRangeEnumeratorRawWithLocal	10000	3.501 us	0.0678 us	0.0807 us

ForeachWithRangeEnumeratorRawWithLocal is the last method we wrote - and it's almost 6 times faster, than the one with the loop inlined into the body! What happened?

Let's look at the codegen.

In our method the local function hasn't been inlined:

Benchmarks.ForeachWithRangeEnumeratorRawWithLocal()
    ...
    L0044: call Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|8_0(RangeEnumerator)

And it's good news. Let's have a look at our local function:

    L0000: mov eax, [rcx]
    L0002: mov edx, [rcx+4]
    L0005: mov ecx, [rcx+8]
    L0008: xor r8d, r8d
    L000b: jmp short L0010
>>  L000d: add r8d, ecx         
>>  L0010: add ecx, edx         
>>  L0012: cmp ecx, eax         
>>  L0014: jne short L000d      
    L0016: mov eax, r8d
    L0019: ret

The four marked lines

    L000d: add r8d, ecx
    L0010: add ecx, edx
    L0012: cmp ecx, eax
    L0014: jne short L000d

are what we were trying to do. It's identical to the codegen for the for loop!

The first three lines

    L0000: mov eax, [rcx]
    L0002: mov edx, [rcx+4]
    L0005: mov ecx, [rcx+8]

simply load the local variables from stack onto registers. That is why our method is so fast.

Windows vs Linux

There is in fact a huge difference between what was generated on windows vs linux. What I wrote before is how it works on Windows.

To illustrate the difference I created a gist for comparing codegens for ForeachWithRangeEnumeratorRaw and ForeachWithRangeEnumeratorRawWithLocal. Let me go over the key differences:

In case of ForeachWithRangeEnumeratorRaw method, Linux's JIT generates:

M00_L00:
       add       ebx,esi
       add       esi,edi
       cmp       esi,eax
       jne       short M00_L00

Whereas Windows' JIT yields this:

M00_L00:
       add       esi,[rsp+38]
       mov       eax,[rsp+38]
       add       eax,[rsp+34]
       mov       [rsp+38],eax
       mov       eax,[rsp+38]
       cmp       eax,[rsp+30]
       jne       short M00_L00

Unlike that one, in case of ForeachWithRangeEnumeratorRawWithLocal Linux gives

jmp       near ptr Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|14_0(Library.RangeEnumerator)

in the method, and in local function it generates:

M02_L00:
       mov       edi,[rbp-8]
       add       eax,edi
       add       edi,[rbp-0C]
       mov       [rbp-8],edi
       cmp       edi,esi
       jne       short M02_L00

Meanwhile, Windows's JIT gives call:

call      Benchmarks.<ForeachWithRangeEnumeratorRawWithLocal>g__EnumerateItAll|14_0(Library.RangeEnumerator)

and the iteration codegen:

M02_L00:
       add       r8d,ecx
       add       ecx,edx
       cmp       ecx,eax
       jne       short M02_L00

(all that codegen see in the gist)

Conclusion

First, although we got what we wanted, it's definitely easier to write a for loop than caring about whether the compiler put our values onto registers.

Second, it's an interesting example how sometimes inlining can literally ruin the performance (in our case, on Windows it took 5-6 times more time on 10k iterations, 3.5 us vs 18 us).

References

All the code I wrote is on a github repo.

Benchmarks
The enumerator's implementation
Tests (to verify that the benched methods are correct)
BenchmarkDotNet - tool for accurate performance measurement

Hubs: