Skip to content

feat: support softmax operator on metax#1347

Open
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-softmax
Open

feat: support softmax operator on metax#1347
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-softmax

Conversation

@LindseyMei

@LindseyMei LindseyMei commented Jun 29, 2026

Copy link
Copy Markdown

Summary

Add MetaX backend support for the softmax operator, following the same pattern as causal_softmax/metax and reusing softmax/cuda/kernel.cuh.

Changes

  • src/infiniop/ops/softmax/metax/softmax_metax.h: MetaX descriptor declaration.
  • src/infiniop/ops/softmax/metax/softmax_metax.maca: MetaX kernel launch logic, supporting F16/BF16/F32 with block/warp softmax paths.
  • src/infiniop/ops/softmax/operator.cc: Register MetaX backend in CREATE/GET/CALCULATE/DESTROY switches.

Validation

Tested on MetaX C500 (MACA 3.3.0.15):

python3 test/infiniop/softmax.py --metax

Result: Test passed! covering shapes (4,4), (12,16,512,512), (1,16,512,512) across all axes, F16/F32, and inplace/out-of-place modes.

Performance

Because the built-in test/infiniop/softmax.py --profile reports implausibly small timings for softmax (~0.007 ms for 50M elements), throughput was measured with standalone scripts that explicitly call torch.cuda.synchronize() around each iteration.

Softmax along the last axis (axis=1) on 2-D tensors:

Shape rows dimsize dtype lib (ms) torch (ms) lib/torch lib bandwidth (GB/s)
(4096, 4096) 4096 4096 F16 0.274 0.065 4.21x 245
(4096, 4096) 4096 4096 F32 0.274 0.101 2.72x 489
(8192, 8192) 8192 8192 F16 0.809 0.192 4.21x 332
(8192, 8192) 8192 8192 F32 0.835 0.371 2.25x 643
(16384, 16384) 16384 16384 F16 2.747 0.733 3.75x 391
(16384, 16384) 16384 16384 F32 2.907 2.919 1.00x 739
(16384, 16384) 16384 16384 BF16 2.946 0.773 3.81x 364
  • F16/BF16 is ~3.6-4.2x slower than torch (mainly due to the generic elementwise template not vectorizing memory access plus softmax reduction overhead).
  • F32 on the largest dimsize is on par with torch, indicating the workload is compute/reduction-bound rather than bandwidth-bound.

Notes

  • No changes to cuda/kernel.cuh or nvidia backend.
  • Uses hccub/block/block_reduce.cuh for MACA and <cub/block/block_reduce.cuh> for MC path.
  • All modified files pass clang-format --dry-run --Werror.

Add MetaX backend for softmax operator, reusing cuda/kernel.cuh with hccub/cub block reduce headers.

Tested on MetaX C500: all shapes/axes/dtypes/inplace modes pass.

Signed-off-by: LindseyMei <648816901@qq.com>
@LindseyMei LindseyMei requested a review from a team June 29, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant