c# – Managed versus unmanaged code when working with X86 Intrinsics

Question:

I learned that when working with intrinsics in System.Runtime.Intrinsics.X86 , it is not necessary to use pointers to address data, but you can simply cast an array of data using System.Runtime.InteropServices.MemoryMarshal , and this will work about as fast as through pointers in unsafe code. I got surprised and tested the performance with Benchmark.NET .

Wrote 4 benchmarks, a scalar one to check the result, using System.Numerics.Vector<T> to check performance (it's interesting) and actually 2 tests based on Vector256<int> with managed and unmanaged code.

I took the simplest task – the sum of the elements of an array of 10 million elements. The implementation has a limitation I realize, the length of the array must be a multiple of 8 – the length of a 256-bit vector (8 x 32), otherwise the result in the output will be unpredictable.

public class SumTest
{
    private static readonly int[] _numbers = Enumerable.Repeat(2, 100000000).ToArray();

    public IEnumerable<object> Params
    {
        get
        {
            yield return _numbers;
        }
    }

    [Benchmark]
    [ArgumentsSource(nameof(Params))]
    public int SumScalar(int[] numbers)
    {
        int result = 0;
        for (int i = 0; i < numbers.Length; i++)
        {
            result += numbers[i];
        }
        return result;
    }

    [Benchmark]
    [ArgumentsSource(nameof(Params))]
    public int SumNumerics(int[] numbers)
    {
        Vector<int> acc = Vector<int>.Zero;
        for (int i = 0; i < numbers.Length; i += Vector<int>.Count)
        {
            Vector<int> v = new Vector<int>(numbers, i);
            acc += v;
        }
        return Vector.Dot(acc, Vector<int>.One);
    }

    [Benchmark]
    [ArgumentsSource(nameof(Params))]
    public int SumIntrinsics(int[] numbers)
    {
        ReadOnlySpan<Vector256<int>> vectors = MemoryMarshal.Cast<int, Vector256<int>>(numbers);
        Vector256<int> acc = Vector256<int>.Zero;
        for (int i = 0; i < vectors.Length; i++)
        {
            acc = Avx2.Add(acc, vectors[i]);
        }
        Vector128<int> r = Ssse3.HorizontalAdd(acc.GetUpper(), acc.GetLower());
        r = Ssse3.HorizontalAdd(r, r);
        r = Ssse3.HorizontalAdd(r, r);
        return r.GetElement(0);
    }

    [Benchmark]
    [ArgumentsSource(nameof(Params))]
    public unsafe int SumIntrinsicsUnsafe(int[] numbers)
    {
        Vector256<int> acc = Vector256<int>.Zero;
        fixed (int* numPtr = numbers)
        {
            int* endPtr = numPtr + numbers.Length;
            for (int* numPos = numPtr; numPos < endPtr; numPos += 8)
            {
                Vector256<int> v = Avx.LoadVector256(numPos);
                acc = Avx2.Add(acc, v);
            }
            Vector128<int> r = Ssse3.HorizontalAdd(acc.GetUpper(), acc.GetLower());
            r = Ssse3.HorizontalAdd(r, r);
            r = Ssse3.HorizontalAdd(r, r);
            return r.GetElement(0);
        }
    }
}

Checked the output

int[] numbers = Enumerable.Repeat(2, 100000000).ToArray();
SumTest sum = new SumTest();
Console.WriteLine(sum.SumScalar(numbers));
Console.WriteLine(sum.SumNumerics(numbers));
Console.WriteLine(sum.SumIntrinsics(numbers));
Console.WriteLine(sum.SumIntrinsicsUnsafe(numbers));
200000000
200000000
200000000
200000000

That is, everything is OK.

And I collected and launched the benchmark.

var summary = BenchmarkRunner.Run<SumTest>();

And he was surprised again.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT

|              Method |          numbers |     Mean |    Error |   StdDev |
|-------------------- |----------------- |---------:|---------:|---------:|
|           SumScalar | Int32[100000000] | 83.69 ms | 0.466 ms | 0.436 ms |
|         SumNumerics | Int32[100000000] | 31.30 ms | 0.303 ms | 0.268 ms |
|       SumIntrinsics | Int32[100000000] | 28.98 ms | 0.282 ms | 0.236 ms |
| SumIntrinsicsUnsafe | Int32[100000000] | 28.80 ms | 0.191 ms | 0.169 ms |

That is, the discrepancy between SumIntrinsics and SumIntrinsicsUnsafe within the statistical error ( StdDev ).

Question : What kind of beast is this MemoryMarshal , and is there any point in using unsafe now when working with intrinsics, and indeed with vectors?

If the question is whether it is possible to write the results of calculations to an array in a safe code – yes, it is possible, the same way the array is cast and all the information written to the vectors will be in the array, that is, the work is exactly the same as with a regular array of structures. In other words, the benefits of unsafe code are not immediately visible. Well, only if the source data originally came in the form of a pointer, and not in the form of a managed array, but there may be nuances there, I'm not deep in the topic.

By the way, I was pleased with Vector<T> . I think in cases where the code is not super sensitive to performance, you can use Numerics in favor of cross-processor.


Addition

I also tried to rewrite the SumNumerics method or added another version of the SumIntrinsicsHybrid implementation.

[Benchmark]
[ArgumentsSource(nameof(Params))]
public int SumNumerics(int[] numbers)
{
    ReadOnlySpan<Vector<int>> vectors = MemoryMarshal.Cast<int, Vector<int>>(numbers);
    Vector<int> acc = Vector<int>.Zero;
    for (int i = 0; i < vectors.Length; i ++)
    {
        acc += vectors[i];
    }
    return Vector.Dot(acc, Vector<int>.One);
}

[Benchmark]
[ArgumentsSource(nameof(Params))]
public unsafe int SumIntrinsicsHybrid(int[] numbers)
{
    ReadOnlySpan<Vector256<int>> vectors = MemoryMarshal.Cast<int, Vector256<int>>(numbers);
    Vector256<int> acc = Vector256<int>.Zero;
    fixed (Vector256<int>* numPtr = vectors)
    {
        Vector256<int>* endPtr = numPtr + vectors.Length;
        for (Vector256<int>* numPos = numPtr; numPos < endPtr; numPos++)
        {
            acc = Avx2.Add(acc, *numPos);
        }
        Vector128<int> r = Ssse3.HorizontalAdd(acc.GetUpper(), acc.GetLower());
        r = Ssse3.HorizontalAdd(r, r);
        r = Ssse3.HorizontalAdd(r, r);
        return r.GetElement(0);
    }
}

The benchmark again shows that casting with MemoryMarshal if not free, is fully paid off.

|              Method |          numbers |     Mean |    Error |   StdDev |
|-------------------- |----------------- |---------:|---------:|---------:|
|           SumScalar | Int32[100000000] | 83.30 ms | 0.214 ms | 0.189 ms |
|         SumNumerics | Int32[100000000] | 28.85 ms | 0.222 ms | 0.207 ms |
|       SumIntrinsics | Int32[100000000] | 28.74 ms | 0.145 ms | 0.136 ms |
| SumIntrinsicsUnsafe | Int32[100000000] | 28.14 ms | 0.234 ms | 0.195 ms |
| SumIntrinsicsHybrid | Int32[100000000] | 28.09 ms | 0.174 ms | 0.163 ms |

Small array test

|              Method |     numbers |      Mean |    Error |   StdDev |
|-------------------- |------------ |----------:|---------:|---------:|
|           SumScalar | Int32[1000] | 712.65 ns | 2.889 ns | 2.702 ns |
|         SumNumerics | Int32[1000] |  81.22 ns | 0.466 ns | 0.436 ns |
|       SumIntrinsics | Int32[1000] |  82.63 ns | 0.311 ns | 0.291 ns |
| SumIntrinsicsUnsafe | Int32[1000] |  60.66 ns | 0.347 ns | 0.308 ns |
| SumIntrinsicsHybrid | Int32[1000] |  61.01 ns | 0.418 ns | 0.370 ns |

Answer:

Perhaps your question can be transformed into what is Span<T> / ReadOnlySpan<T> . Although superficial, an overview article gives an idea of ​​\u200b\u200bthis:

In short, the things you're interested in are defining the type:

public readonly ref struct Span<T>
{
  private readonly ref T _pointer;
  private readonly int _length;

  ...
}

Span is a stack- only structure that contains this trick:

readonly ref T _pointer

Without context, it's not clear whether this is closer to links or pointers in C++ terms, so I'll use the word pointer. Now, this new internal type is a tracked pointer. And, unlike the fixed operator, nothing is fixed on the heap now, and the GC itself will change the address of this pointer after the compression phase.

According to the documentation, keeping track of such pointers is expensive in terms of performance, so Span was made a ref structure that cannot move to the heap even as part of an object.

These references are called interior pointers, and tracking them is a relatively expensive operation for the .NET runtime's garbage collector. As such, the runtime constraints these refs to only live on the stack, as it provides an implicit low limit on the number of interior pointers that might be in existence

Actually, the span has advanced relatives Memory<T> and ReadOnlyMemory<T> which can wrap not only arrays. But the article reveals them poorly 🙁


Here's something else I thought about. If the GC changes the address of such a link dynamically, after the compression phase. But, as we know, large objects end up in LOH where they already become unrelocated, then if you change your test in such a way that it works with small arrays but against the background of the garbage collector, then perhaps it will show a more or less significant performance drop?

Scroll to Top