## Question:

In general, I read an article on Habré , which is a translation of the T-shirt blog.

The article describes SIMD features and provides examples of optimizations without using this feature.

Trivial loop

```
public int Sum(ReadOnlySpan<int> source)
{
int result = 0;
for (int i = 0; i < source.Length; i++)
{
result += source[i];
}
return result;
}
```

A cycle that is 20% faster than the previous one:

```
public unsafe int SumUnrolled(ReadOnlySpan<int> source)
{
int result = 0;
int i = 0;
int lastBlockIndex = source.Length - (source.Length % 4);
// Pin source so we can elide the bounds checks
fixed (int* pSource = source)
{
while (i < lastBlockIndex)
{
result += pSource[i + 0];
result += pSource[i + 1];
result += pSource[i + 2];
result += pSource[i + 3];
i += 4;
}
while (i < source.Length)
{
result += pSource[i];
i += 1;
}
}
return result;
}
```

And how is this speed achieved? Ok, I agree that disabling array bounds checking improves performance. However, does it have any impact that several summations are performed at 1 iteration of the loop at a time?

Yes, most likely the number of jumps has decreased at a low level, but isn't the processor itself capable of solving this with the help of predictions and speculative execution?

## Answer:

In the article right before the listing it says:

Most modern processors can perform four addition operations in one cycle (under optimal conditions), as a result of which, with the correct "layout" of the code, you can sometimes improve performance, even in a single-threaded implementation.

The cores in modern processors contain multiple ALUs. Starting with Haswell, Intel processor cores contain four arithmetic calculators and four address blocks . Therefore, each core can perform four arithmetic micro-ops at the same time. And four memory access operations.

```
result += pSource[i + 0];
result += pSource[i + 1];
result += pSource[i + 2];
result += pSource[i + 3];
```

Here you have four indirect addresses, which load four address blocks, and four accumulator additions, which load four arithmetic blocks. Therefore, the ALU in the processor is maximally loaded.

In addition, there are four times fewer comparisons in the expanded loop, that is, four times fewer *extra instructions* in the ALU.