c# – Loop through an array: forward, backward, or iterator?

Question:

Suppose there is an array, data type and length are any, for example:

byte[] arr = new byte[16777216];

And you need to go through all the elements of the array (for example, find the sum of the elements). Is there any difference, in terms of performance, if I go through the array using an iterator

foreach (byte b in arr)
    sum += b;

by index by forward stroke

for (int i = 0; i < arr.Length; i++)
    sum += arr[i];

or vice versa

for (int i = arr.Length - 1; i >= 0; i--)
    sum += arr[i];

?

Are there any general recommendations, or will everything depend heavily on the type of the array element?

upd.

Of course, I tried to compare these approaches (the code that I used is below), and got approximately the following results:

C:\...\bin\x64\Release>ConsoleApplication2.exe
foreach: 11135 msec
for forward: 11141 msec
for backward: 15147 msec

C:\...\bin\x86\Release>ConsoleApplication2.exe
foreach: 11304 msec
for forward: 69305 msec
for backward: 68075 msec

I don't quite understand two things. The first is the impressive difference between for and foreach on x86. I tried to study the question superficially, but I did not come across this right away, they basically write that for / foreach should be comparable. Those. maybe I'm just somehow incorrectly comparing, or it is specifically due to the fact that the operation in the loop is the sum. And second, what (at least approximately) can be connected with the fact that in x64 the reverse is 10-15% slower.

Here is the code I used:

public static void Main()
{
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
    Thread.CurrentThread.Priority = ThreadPriority.Highest;

    const int repeats = 2000;
    byte[] arr = new byte[16777216];

    Random rnd = new Random();
    rnd.NextBytes(arr);

    long sum = 0;

    Stopwatch sw = new Stopwatch();

    long ms_foreach = 0;
    long ms_for_fwd = 0;
    long ms_for_back = 0;

    for (int r = 0; r < repeats; r++)
    {
        sum = 0;
        sw.Reset();
        sw.Start();
        foreach (byte b in arr)
            sum += b;
        sw.Stop();
        ms_foreach += sw.ElapsedMilliseconds;

        sum = 0;
        sw.Reset();
        sw.Start();
        for (int i = 0; i < arr.Length; i++)
            sum += arr[i];
        sw.Stop();
        ms_for_fwd += sw.ElapsedMilliseconds;

        sum = 0;
        sw.Reset();
        sw.Start();
        for (int i = arr.Length - 1; i >= 0; i--)
            sum += arr[i];
        sw.Stop();
        ms_for_back += sw.ElapsedMilliseconds;
    }

    Console.WriteLine("foreach: {0} msec", ms_foreach);
    Console.WriteLine("for forward: {0} msec", ms_for_fwd);
    Console.WriteLine("for backward: {0} msec", ms_for_back);
}

I tested the code on different machines and compiled it under .net 2.0, 3.5, 4.0, 4.5 – the results are similar.

Answer:

Ok, why may or may not be a difference.

[MethodImpl(MethodImplOptions.NoInlining)]
public static long ForEach(byte [] arr)
{
    long sum = 0;
    foreach (byte b in arr)
        sum += b;
    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static long ForForward(byte[] arr)
{
    long sum = 0;
    for (int i = 0; i < arr.Length; i++)
        sum += arr[i];
    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static long ForBackward(byte[] arr)
{
    long sum = 0;
    for (int i = 0; i < arr.Length; i++)
        sum += arr[i];
    return sum;
}

Your C # code goes through two stages of compilation.

The first is compilation to IL. After that, obviously, the methods will be copied into slightly different IL code. By itself, foreach will be turned into a regular iteration of the array element by element, without calling IEnumerable:

.method public hidebysig static int64  ForEach(uint8[] arr) cil managed noinlining
{
  // Code size       30 (0x1e)
  .maxstack  2
  .locals init ([0] int64 sum,
           [1] uint8[] V_1,
           [2] int32 V_2,
           [3] uint8 b)
  IL_0000:  ldc.i4.0
  IL_0001:  conv.i8
  IL_0002:  stloc.0
  IL_0003:  ldarg.0
  IL_0004:  stloc.1
  IL_0005:  ldc.i4.0
  IL_0006:  stloc.2
  IL_0007:  br.s       IL_0016
  IL_0009:  ldloc.1
  IL_000a:  ldloc.2
  IL_000b:  ldelem.u1
  IL_000c:  stloc.3
  IL_000d:  ldloc.0
  IL_000e:  ldloc.3
  IL_000f:  conv.u8
  IL_0010:  add
  IL_0011:  stloc.0
  IL_0012:  ldloc.2
  IL_0013:  ldc.i4.1
  IL_0014:  add
  IL_0015:  stloc.2
  IL_0016:  ldloc.2
  IL_0017:  ldloc.1
  IL_0018:  ldlen
  IL_0019:  conv.i4
  IL_001a:  blt.s      IL_0009
  IL_001c:  ldloc.0
  IL_001d:  ret
} // end of method Program::ForEach

The variant for For will differ from it by the absence of one local variable – [3] uint8 b , and, accordingly, there will be a couple of operations less.

After that, when directly executed, the code is JIT-compiled into machine code. Here is the result (on my machine) for x64:

ForEach:

            long sum = 0;
00007FFB5E8604F0  xor         eax,eax  
            foreach (byte b in arr)
00007FFB5E8604F2  mov         rdx,rcx  
00007FFB5E8604F5  xor         ecx,ecx  
00007FFB5E8604F7  mov         r8d,dword ptr [rdx+8]  
00007FFB5E8604FB  test        r8d,r8d  
00007FFB5E8604FE  jle         00007FFB5E86051A  
00007FFB5E860500  movsxd      r9,ecx  
00007FFB5E860503  movzx       r9d,byte ptr [rdx+r9+10h]  
                sum += b;
00007FFB5E860509  and         r9d,0FFFFFFFFh  
00007FFB5E860510  add         rax,r9  
00007FFB5E860513  inc         ecx  
            foreach (byte b in arr)
00007FFB5E860515  cmp         r8d,ecx  
00007FFB5E860518  jg          00007FFB5E860500  
00007FFB5E86051A  ret  

For:

            long sum = 0;
00007FFB5E860530  xor         eax,eax  
            for (int i = 0; i < arr.Length; i++)
00007FFB5E860532  xor         edx,edx  
00007FFB5E860534  mov         r8d,dword ptr [rcx+8]  
00007FFB5E860538  test        r8d,r8d  
00007FFB5E86053B  jle         00007FFB5E860557  
                sum += arr[i];
00007FFB5E86053D  movsxd      r9,edx  
00007FFB5E860540  movzx       r9d,byte ptr [rcx+r9+10h]  
00007FFB5E860546  and         r9d,0FFFFFFFFh  
00007FFB5E86054D  add         rax,r9  
            for (int i = 0; i < arr.Length; i++)
00007FFB5E860550  inc         edx  
00007FFB5E860552  cmp         r8d,edx  
00007FFB5E860555  jg          00007FFB5E86053D  
00007FFB5E860557  ret  

Reverse For:

            long sum = 0;
00007FFB5E850570  sub         rsp,28h  
00007FFB5E850574  xor         eax,eax  
            for (int i = arr.Length - 1; i >= 0; i--)
00007FFB5E850576  mov         edx,dword ptr [rcx+8]  
00007FFB5E850579  lea         r8d,[rdx-1]  
00007FFB5E85057D  test        r8d,r8d  
00007FFB5E850580  jl          00007FFB5E8505A2  
                sum += arr[i];
00007FFB5E850582  cmp         r8d,edx  
00007FFB5E850585  jae         00007FFB5E8505A7  
00007FFB5E850587  movsxd      r9,r8d  
00007FFB5E85058A  movzx       r9d,byte ptr [rcx+r9+10h]  
00007FFB5E850590  and         r9d,0FFFFFFFFh  
00007FFB5E850597  add         rax,r9  
            for (int i = arr.Length - 1; i >= 0; i--)
00007FFB5E85059A  dec         r8d  
00007FFB5E85059D  test        r8d,r8d  
00007FFB5E8505A0  jge         00007FFB5E850582  
00007FFB5E8505A2  add         rsp,28h  
00007FFB5E8505A6  ret  
00007FFB5E8505A7  call        00007FFBBE301A08  
00007FFB5E8505AC  int         3  

Two things are visible:

  • The executable code for foreach and for is identical . Only registers differ (edx and ecx are swapped)
  • The executable code for the inverse for differs only in the direction of changing the variable and in checking it – test instead of cmp.

Once upon a time, back in the days of the dinosaurs, test was noticeably faster than cmp – and inverse fors were all the rage. Now – there is no difference, at least on desktop processors.

The reverse for can be a little slower due to the peculiarities of the memory prefetch – the processor assumes that memory will be used in forward order, and the call to the previous page becomes a surprise to it. But then again, modern processors are relatively all the same.


What's going on on x86?

On this platform the long problem is shooting. It simply does not fit into registers, and the JIT has to spin, breaking the addition operations into two. And, accordingly, using two registers to store the result. And at the same time he manages to fix in the case of For. Here's a piece of the summation:

foreach:

        foreach (byte b in arr)
011F04BA  movzx       eax,byte ptr [ecx+esi+8]  
            sum += b;
011F04BF  xor         edx,edx  
011F04C1  add         ebx,eax  
011F04C3  adc         edi,edx  
011F04C5  inc         esi  
        foreach (byte b in arr)
011F04C6  cmp         dword ptr [ecx+4],esi  
011F04C9  jg          011F04BA  
        return sum;

в for:

                sum += arr[i];
011F04FF  movzx       eax,byte ptr [ecx+esi+8]  
011F0504  xor         edx,edx  
011F0506  add         eax,ebx  
011F0508  adc         edx,edi  
011F050A  mov         ebx,eax  
011F050C  mov         edi,edx  
            for (int i = 0; i < arr.Length; i++)
011F050E  inc         esi  
011F050F  cmp         dword ptr [ebp-10h],esi  
011F0512  jg          011F04FF  

The direction of addition is different. For foreach, it is usually done:

(edi ebx) += (edx eax)

For for, the optimizer messes up and does

(edx eax) += (edi ebx)
(edi ebx) =  (edx eax)

This shifting gives your difference on x86.

The main takeaway from this is to use x64. This is good. There is a new good JIT for it. And x86 is evil 🙂

Scroll to Top