Feat/selective offload on srelu fuser#3047
Conversation
dba3531 to
53e6511
Compare
Greptile SummaryThis PR adds selective CPU-offload control to the SReLU fuser path of
Confidence Score: 3/5Not ready to merge without clarification on two forward_grouped_mlp.py changes. The no_offload_activation attribute controls whether critical tensors are kept on GPU, but it is never set anywhere in the codebase and has no public API, making the selective-offload opt-out path unreachable without undocumented monkey-patching. Separately, fc1_input_quantizer.internal = True is set unconditionally on every forward pass, changing grouped_fc1_x from a GroupedTensor to a GroupedTensorStorage regardless of whether offloading is active, which could silently break downstream consumers expecting a torch.Tensor subclass. transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py — specifically the no_offload_activation gating logic (lines 680-681) and the unconditional fc1_input_quantizer.internal = True assignment (line 302). Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as Megatron / Caller
participant FwdMLP as _ForwardGroupedMLP_CuTeGEMMBase.forward
participant Offload as cpu_offload (V1 or V2)
participant Ctx as fc1_ctx / activation_ctx / fc2_ctx
Caller->>FwdMLP: forward(input_, ...)
FwdMLP->>FwdMLP: Build grouped_fc1_x (GroupedTensorStorage)
FwdMLP->>FwdMLP: GEMM to activation_in, grouped_fc2_x (GroupedTensorStorage)
FwdMLP->>Offload: is_cpu_offload_enabled()?
alt "cpu_offloading = True"
FwdMLP->>Offload: mark_not_offload fc1_weight_tensors
opt "no_offload_fc1_activation=True"
FwdMLP->>Offload: mark_not_offload grouped_fc1_x
end
FwdMLP->>Ctx: fc1_ctx.save_for_backward
opt "no_offload_moe_activation=True"
FwdMLP->>Offload: mark_not_offload activation_in scales
end
FwdMLP->>Ctx: activation_ctx.save_for_backward
opt saved_grouped_fc2_x not None
FwdMLP->>Offload: mark_not_offload saved_grouped_fc2_x
end
FwdMLP->>Offload: mark_not_offload fc2_weight_tensors
FwdMLP->>Ctx: fc2_ctx.save_for_backward
end
FwdMLP-->>Caller: fc2_out
Reviews (9): Last reviewed commit: "Support selective offload for fused grou..." | Re-trigger Greptile |
2c59510 to
6e01d0a
Compare
There was a problem hiding this comment.
Overall looks good, but with one design suggestion.
Followup tasks after merging this PR:
- Enable activation checkpointing in the unfused grouped linear op.
- Update activation checkpointing to support v2 infrastructure from #1762, which is opt-out rather than opt-in.
cherry-pick NVIDIA#3047
6f5ef0a to
c14deb7
Compare
|
/te-ci pytorch L1 |
Signed-off-by: hongbinl <hongbinl@nvidia.com>
c14deb7 to
f25a1f5
Compare
| no_offload_fc1_activation = bool(getattr(fc1_op, "no_offload_activation", False)) | ||
| no_offload_moe_activation = bool(getattr(activation_op, "no_offload_activation", False)) |
There was a problem hiding this comment.
Fine, but we should avoid relying on these in Mcore. They should be considered as debugging tools and we don't make guarantees to maintain them.
If we do want to control this in Mcore, we should do it through proper, user-facing APIs. We can add it as an arg to the unfused ops, but we should also make sure the unfused ops actually respect it too.
| and isinstance(input_quantizer, NVFP4Quantizer) | ||
| ): | ||
| grouped_fc1_x = input_ | ||
| grouped_fc1_x = GroupedTensorStorage( |
There was a problem hiding this comment.
Now that we set the input quantizer with .internal = True, isn't it redundant to repack grouped_fc1_x into a GroupedTensorStorage?
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: