Feat/selective offload on srelu fuser by lhb8125 · Pull Request #3047 · NVIDIA/TransformerEngine

lhb8125 · 2026-05-27T07:22:45Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-27T07:39:40Z

Greptile Summary

This PR adds selective CPU-offload control to the SReLU fuser path of _ForwardGroupedMLP_CuTeGEMMBase. When is_cpu_offload_enabled() returns true, weights and saved FC2 intermediate tensors are unconditionally kept resident on GPU, while the FC1 input activation and MoE activation tensors can optionally be kept resident via a per-op flag.

GroupedTensorStorage.get_data_tensors() is introduced to expose component tensors to the V1 offload path's mark_activation_offload helper.
grouped_fc2_x in the SFD kernel path is promoted from GroupedTensor to GroupedTensorStorage, and fc1_input_quantizer.internal = True is set unconditionally so both input branches produce GroupedTensorStorage.
mark_not_offload calls are wired around save_for_backward for weights, FC2 input, FC1 input, and MoE activation tensors, gated on new per-op no_offload_activation attributes.

Confidence Score: 3/5

Not ready to merge without clarification on two forward_grouped_mlp.py changes.

The no_offload_activation attribute controls whether critical tensors are kept on GPU, but it is never set anywhere in the codebase and has no public API, making the selective-offload opt-out path unreachable without undocumented monkey-patching. Separately, fc1_input_quantizer.internal = True is set unconditionally on every forward pass, changing grouped_fc1_x from a GroupedTensor to a GroupedTensorStorage regardless of whether offloading is active, which could silently break downstream consumers expecting a torch.Tensor subclass.

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py — specifically the no_offload_activation gating logic (lines 680-681) and the unconditional fc1_input_quantizer.internal = True assignment (line 302).

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Adds selective offload logic: fc1_input_quantizer.internal is set unconditionally (not just during offloading), grouped_fc2_x is changed from GroupedTensor to GroupedTensorStorage in the SFD path, mark_not_offload is called for weights/activations, and a new no_offload_activation attribute controls per-tensor opt-out — but this attribute is never set or documented anywhere in the repo.
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	Adds get_data_tensors() that mirrors the field list in prepare_for_saving/restore_from_saved, enabling the V1 offload path to enumerate component tensors for marking.

Sequence Diagram

sequenceDiagram
    participant Caller as Megatron / Caller
    participant FwdMLP as _ForwardGroupedMLP_CuTeGEMMBase.forward
    participant Offload as cpu_offload (V1 or V2)
    participant Ctx as fc1_ctx / activation_ctx / fc2_ctx

    Caller->>FwdMLP: forward(input_, ...)
    FwdMLP->>FwdMLP: Build grouped_fc1_x (GroupedTensorStorage)
    FwdMLP->>FwdMLP: GEMM to activation_in, grouped_fc2_x (GroupedTensorStorage)
    FwdMLP->>Offload: is_cpu_offload_enabled()?
    alt "cpu_offloading = True"
        FwdMLP->>Offload: mark_not_offload fc1_weight_tensors
        opt "no_offload_fc1_activation=True"
            FwdMLP->>Offload: mark_not_offload grouped_fc1_x
        end
        FwdMLP->>Ctx: fc1_ctx.save_for_backward
        opt "no_offload_moe_activation=True"
            FwdMLP->>Offload: mark_not_offload activation_in scales
        end
        FwdMLP->>Ctx: activation_ctx.save_for_backward
        opt saved_grouped_fc2_x not None
            FwdMLP->>Offload: mark_not_offload saved_grouped_fc2_x
        end
        FwdMLP->>Offload: mark_not_offload fc2_weight_tensors
        FwdMLP->>Ctx: fc2_ctx.save_for_backward
    end
    FwdMLP-->>Caller: fc2_out

_{Reviews (9): Last reviewed commit: "Support selective offload for fused grou..." | Re-trigger Greptile}

timmoon10

Overall looks good, but with one design suggestion.

Followup tasks after merging this PR:

Enable activation checkpointing in the unfused grouped linear op.
Update activation checkpointing to support v2 infrastructure from #1762, which is opt-out rather than opt-in.

cherry-pick NVIDIA#3047

lhb8125 · 2026-06-02T11:01:20Z

/te-ci pytorch L1

Signed-off-by: hongbinl <hongbinl@nvidia.com>

timmoon10 · 2026-06-03T17:24:08Z

+            no_offload_fc1_activation = bool(getattr(fc1_op, "no_offload_activation", False))
+            no_offload_moe_activation = bool(getattr(activation_op, "no_offload_activation", False))


Fine, but we should avoid relying on these in Mcore. They should be considered as debugging tools and we don't make guarantees to maintain them.

If we do want to control this in Mcore, we should do it through proper, user-facing APIs. We can add it as an arg to the unfused ops, but we should also make sure the unfused ops actually respect it too.

timmoon10 · 2026-06-03T17:25:32Z

            and isinstance(input_quantizer, NVFP4Quantizer)
        ):
-            grouped_fc1_x = input_
+            grouped_fc1_x = GroupedTensorStorage(


Now that we set the input quantizer with .internal = True, isn't it redundant to repack grouped_fc1_x into a GroupedTensorStorage?

lhb8125 requested review from ksivaman and timmoon10 as code owners May 27, 2026 07:22

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 27, 2026

lhb8125 force-pushed the feat/selective-offload-on-srelu-fuser branch from dba3531 to 53e6511 Compare May 27, 2026 07:26

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

lhb8125 force-pushed the feat/selective-offload-on-srelu-fuser branch from 2c59510 to 6e01d0a Compare May 27, 2026 07:49

timmoon10 reviewed May 28, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

fanshiqing added a commit to fanshiqing/TransformerEngine that referenced this pull request Jun 2, 2026

selective offload on srelu fuser

1d6bf2a

cherry-pick NVIDIA#3047

lhb8125 force-pushed the feat/selective-offload-on-srelu-fuser branch from 6f5ef0a to c14deb7 Compare June 2, 2026 10:58

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

Support selective offload for fused grouped MLP

f25a1f5

Signed-off-by: hongbinl <hongbinl@nvidia.com>

lhb8125 force-pushed the feat/selective-offload-on-srelu-fuser branch from c14deb7 to f25a1f5 Compare June 2, 2026 11:09

timmoon10 reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/selective offload on srelu fuser#3047

Feat/selective offload on srelu fuser#3047
lhb8125 wants to merge 1 commit into
NVIDIA:mainfrom
lhb8125:feat/selective-offload-on-srelu-fuser

lhb8125 commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

timmoon10 left a comment •

edited

Loading

Uh oh!

Uh oh!

lhb8125 commented Jun 2, 2026

Uh oh!

Uh oh!

timmoon10 Jun 3, 2026 •

edited

Loading

Uh oh!

timmoon10 Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		no_offload_fc1_activation = bool(getattr(fc1_op, "no_offload_activation", False))
		no_offload_moe_activation = bool(getattr(activation_op, "no_offload_activation", False))

Conversation

lhb8125 commented May 27, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhb8125 commented Jun 2, 2026

Uh oh!

Uh oh!

timmoon10 Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 27, 2026 •

edited

Loading

timmoon10 left a comment •

edited

Loading

timmoon10 Jun 3, 2026 •

edited

Loading