feat: support softmax operator on metax by LindseyMei · Pull Request #1347 · InfiniTensor/InfiniCore

LindseyMei · 2026-06-29T08:40:33Z

Summary

Add MetaX backend support for the softmax operator, following the same pattern as causal_softmax/metax and reusing softmax/cuda/kernel.cuh.

Changes

src/infiniop/ops/softmax/metax/softmax_metax.h: MetaX descriptor declaration.
src/infiniop/ops/softmax/metax/softmax_metax.maca: MetaX kernel launch logic, supporting F16/BF16/F32 with block/warp softmax paths.
src/infiniop/ops/softmax/operator.cc: Register MetaX backend in CREATE/GET/CALCULATE/DESTROY switches.

Validation

Tested on MetaX C500 (MACA 3.3.0.15):

python3 test/infiniop/softmax.py --metax

Result: Test passed! covering shapes (4,4), (12,16,512,512), (1,16,512,512) across all axes, F16/F32, and inplace/out-of-place modes.

Performance

Because the built-in test/infiniop/softmax.py --profile reports implausibly small timings for softmax (~0.007 ms for 50M elements), throughput was measured with standalone scripts that explicitly call torch.cuda.synchronize() around each iteration.

Softmax along the last axis (axis=1) on 2-D tensors:

Shape	rows	dimsize	dtype	lib (ms)	torch (ms)	lib/torch	lib bandwidth (GB/s)
(4096, 4096)	4096	4096	F16	0.274	0.065	4.21x	245
(4096, 4096)	4096	4096	F32	0.274	0.101	2.72x	489
(8192, 8192)	8192	8192	F16	0.809	0.192	4.21x	332
(8192, 8192)	8192	8192	F32	0.835	0.371	2.25x	643
(16384, 16384)	16384	16384	F16	2.747	0.733	3.75x	391
(16384, 16384)	16384	16384	F32	2.907	2.919	1.00x	739
(16384, 16384)	16384	16384	BF16	2.946	0.773	3.81x	364

F16/BF16 is ~3.6-4.2x slower than torch (mainly due to the generic elementwise template not vectorizing memory access plus softmax reduction overhead).
F32 on the largest dimsize is on par with torch, indicating the workload is compute/reduction-bound rather than bandwidth-bound.

Notes

No changes to cuda/kernel.cuh or nvidia backend.
Uses hccub/block/block_reduce.cuh for MACA and <cub/block/block_reduce.cuh> for MC path.
All modified files pass clang-format --dry-run --Werror.

Add MetaX backend for softmax operator, reusing cuda/kernel.cuh with hccub/cub block reduce headers. Tested on MetaX C500: all shapes/axes/dtypes/inplace modes pass. Signed-off-by: LindseyMei <648816901@qq.com>

feat: support softmax operator on metax

dd1824a

Add MetaX backend for softmax operator, reusing cuda/kernel.cuh with hccub/cub block reduce headers. Tested on MetaX C500: all shapes/axes/dtypes/inplace modes pass. Signed-off-by: LindseyMei <648816901@qq.com>

LindseyMei requested a review from a team June 29, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support softmax operator on metax#1347

feat: support softmax operator on metax#1347
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-softmax

LindseyMei commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LindseyMei commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Performance

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LindseyMei commented Jun 29, 2026 •

edited

Loading