Skip to content

Optimize Enumerable Min/Max final reduction with shuffles#127995

Open
EgorBo wants to merge 3 commits intodotnet:mainfrom
EgorBo:ai/linq-minmax-shuffle-reduction
Open

Optimize Enumerable Min/Max final reduction with shuffles#127995
EgorBo wants to merge 3 commits intodotnet:mainfrom
EgorBo:ai/linq-minmax-shuffle-reduction

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented May 9, 2026

Note

This PR was filed by AI (GitHub Copilot CLI).

Inspired by a similar helper in Tensors. I guess we don't want to consume Tensors in Linq due to IL size concerns. It also doesn't look beneficial to share this function via source file due to different generics being used. Maybe Vector128 will get this as public API at some point.

Benchmark

[Params(16, 64)]
public int Length;

private byte[]   _bytes;
private sbyte[]  _sbytes;
private short[]  _shorts;
private ushort[] _ushorts;
private int[]    _ints;
private uint[]   _uints;
private long[]   _longs;
private ulong[]  _ulongs;

[Benchmark] public byte   MaxByte()   => _bytes.Max();
[Benchmark] public sbyte  MaxSByte()  => _sbytes.Max();
[Benchmark] public short  MaxShort()  => _shorts.Max();
[Benchmark] public ushort MaxUShort() => _ushorts.Max();
[Benchmark] public int    MaxInt()    => _ints.Max();
[Benchmark] public uint   MaxUInt()   => _uints.Max();
[Benchmark] public long   MaxLong()   => _longs.Max();
[Benchmark] public ulong  MaxULong()  => _ulongs.Max();

Full benchmark request and raw output: EgorBot/Benchmarks#198

Results

Speed-up = main time / PR time (higher is better). Numbers below are from EgorBot.

ARM64 — Neoverse-N2 (ubuntu24_azure_cobalt100)

Method Length main PR Speed-up
MaxByte 16 51.54 ns 1.85 ns 27.85×
MaxSByte 16 49.58 ns 1.90 ns 26.16×
MaxShort 16 23.66 ns 0.98 ns 24.16×
MaxUShort 16 21.82 ns 1.22 ns 17.93×
MaxInt 16 1.72 ns 1.84 ns 0.93×
MaxUInt 16 1.93 ns 1.74 ns 1.11×
MaxLong 16 3.77 ns 3.91 ns 0.96×
MaxULong 16 3.81 ns 3.93 ns 0.97×
MaxByte 64 49.64 ns 2.03 ns 24.51×
MaxSByte 64 48.76 ns 2.04 ns 23.95×
MaxShort 64 22.35 ns 3.53 ns 6.33×
MaxUShort 64 22.33 ns 3.36 ns 6.64×
MaxInt 64 6.33 ns 6.21 ns 1.02×
MaxUInt 64 6.17 ns 6.51 ns 0.95×
MaxLong 64 23.55 ns 23.51 ns 1.00×
MaxULong 64 23.36 ns 23.44 ns 1.00×

Intel — Emerald Rapids / AVX-512 (ubuntu24_azure_emeraldrapids)

Method Length main PR Speed-up
MaxByte 16 8.88 ns 0.93 ns 9.54×
MaxSByte 16 9.86 ns 0.89 ns 11.10×
MaxShort 16 4.42 ns 1.55 ns 2.87×
MaxUShort 16 4.45 ns 1.02 ns 4.38×
MaxInt 16 1.66 ns 0.88 ns 1.89×
MaxUInt 16 1.81 ns 1.50 ns 1.20×
MaxLong 16 0.96 ns 0.89 ns 1.08×
MaxULong 16 1.00 ns 0.94 ns 1.06×
MaxByte 64 16.31 ns 1.44 ns 11.39×
MaxSByte 64 10.03 ns 1.59 ns 6.33×
MaxShort 64 6.52 ns 1.37 ns 4.77×
MaxUShort 64 6.54 ns 1.54 ns 4.26×
MaxInt 64 2.44 ns 1.68 ns 1.47×
MaxUInt 64 2.84 ns 1.94 ns 1.47×
MaxLong 64 2.72 ns 3.17 ns 0.86×
MaxULong 64 3.17 ns 2.97 ns 1.07×

AMD — EPYC 9V45 (Turin) / AVX-512 (ubuntu24_azure_turin)

Method Length main PR Speed-up
MaxByte 16 4.41 ns 0.70 ns 6.27×
MaxSByte 16 4.96 ns 0.83 ns 5.96×
MaxShort 16 3.01 ns 0.87 ns 3.45×
MaxUShort 16 2.09 ns 0.89 ns 2.35×
MaxInt 16 1.48 ns 0.98 ns 1.52×
MaxUInt 16 1.64 ns 0.98 ns 1.68×
MaxLong 16 0.94 ns 0.91 ns 1.03×
MaxULong 16 1.05 ns 0.91 ns 1.16×
MaxByte 64 5.86 ns 1.27 ns 4.61×
MaxSByte 64 7.57 ns 1.28 ns 5.93×
MaxShort 64 2.78 ns 1.39 ns 2.01×
MaxUShort 64 2.96 ns 1.13 ns 2.61×
MaxInt 64 2.27 ns 1.61 ns 1.41×
MaxUInt 64 1.97 ns 1.62 ns 1.22×
MaxLong 64 3.02 ns 3.00 ns 1.01×
MaxULong 64 2.87 ns 2.10 ns 1.37×

Replaces the scalar log-N reduction loop on the final 128-bit accumulator with a shuffle-based tree reduction (log2(Vector128<T>.Count) compares), introducing a HorizontalMinMax helper modeled after TensorPrimitives.HorizontalAggregate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 9, 2026 16:46
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 9, 2026

@MihuBot

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 9, 2026

Note

Benchmark generated by AI (GitHub Copilot CLI).

@EgorBot -linux_amd -linux_intel -linux_arm64

using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Params(16, 64)]
    public int Length;

    private byte[]   _bytes   = default!;
    private sbyte[]  _sbytes  = default!;
    private short[]  _shorts  = default!;
    private ushort[] _ushorts = default!;
    private int[]    _ints    = default!;
    private uint[]   _uints   = default!;
    private long[]   _longs   = default!;
    private ulong[]  _ulongs  = default!;

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);

        _bytes = new byte[Length];
        rng.NextBytes(_bytes);

        _sbytes = new sbyte[Length];
        for (int i = 0; i < Length; i++) _sbytes[i] = (sbyte)rng.Next(sbyte.MinValue, sbyte.MaxValue + 1);

        _shorts = new short[Length];
        for (int i = 0; i < Length; i++) _shorts[i] = (short)rng.Next(short.MinValue, short.MaxValue + 1);

        _ushorts = new ushort[Length];
        for (int i = 0; i < Length; i++) _ushorts[i] = (ushort)rng.Next(0, ushort.MaxValue + 1);

        _ints = new int[Length];
        for (int i = 0; i < Length; i++) _ints[i] = rng.Next();

        _uints = new uint[Length];
        for (int i = 0; i < Length; i++) _uints[i] = (uint)rng.Next();

        _longs = new long[Length];
        for (int i = 0; i < Length; i++) _longs[i] = rng.NextInt64();

        _ulongs = new ulong[Length];
        for (int i = 0; i < Length; i++) _ulongs[i] = (ulong)rng.NextInt64();
    }

    [Benchmark] public byte   MaxByte()   => _bytes.Max();
    [Benchmark] public sbyte  MaxSByte()  => _sbytes.Max();
    [Benchmark] public short  MaxShort()  => _shorts.Max();
    [Benchmark] public ushort MaxUShort() => _ushorts.Max();
    [Benchmark] public int    MaxInt()    => _ints.Max();
    [Benchmark] public uint   MaxUInt()   => _uints.Max();
    [Benchmark] public long   MaxLong()   => _longs.Max();
    [Benchmark] public ulong  MaxULong()  => _ulongs.Max();
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the final reduction step in System.Linq’s vectorized integer Min/Max implementation to reduce a Vector128<T> accumulator down to a single scalar using shuffle-based horizontal reduction instead of a scalar per-lane loop.

Changes:

  • Replaced the per-element for loop reduction of best128 with a helper (HorizontalMinMax) that performs log2(lane-count) reductions using Vector128.Shuffle.
  • Added HorizontalMinMax<T, TMinMax> to centralize the reduction logic for different element sizes (byte/short/int/long lane counts).

Comment thread src/libraries/System.Linq/src/System/Linq/MaxMin.cs
Comment thread src/libraries/System.Linq/src/System/Linq/MaxMin.cs Outdated
Copilot AI review requested due to automatic review settings May 9, 2026 20:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants