increasing precision tolerance by francesco-bertolotti · Pull Request #3060 · NVIDIA/TransformerEngine

francesco-bertolotti · 2026-05-29T17:22:11Z

Here’s a cleaner Markdown rewrite of your PR description:

Description

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing 1e-2 tolerance.

This change slightly relaxes the fp16 tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.

Fixes: N/A (spurious test_kv_cache failure on sm80 / fp16 / head_dim=256)

Type of change

Checklist

I have read and followed the contributing guidelines (https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (N/A — test-only change)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works (N/A — test-only tolerance adjustment)
New and existing unit tests pass locally with my changes

Notes

Contributing guidelines and full test runs still need to be verified before final submission.
bfloat16 tolerance in this branch is unchanged, as it was not failing.
The Pyright import warnings are unrelated noise from missing stubs in the local environment.
Potential reviewer feedback: this could be further refined to apply the tolerance only conditionally by compute capability (e.g., sm < 90), rather than broadly for the branch.

Diff summary

# tests/pytorch/attention/test_kv_cache.py (get_tols)
- torch.half: (1e-2, 1e-2),
+ torch.half: (1.5e-2, 1.5e-2),

# plus explanatory comment added

greptile-apps · 2026-05-29T17:23:39Z

Greptile Summary

This PR relaxes the torch.half numerical tolerance in get_tols (inside test_kv_cache.py) from 1e-2 to 1.5e-2 for the TransformerLayer + head_dim > 128 + fused/flash-attention path, to fix a spurious failure observed on sm80 (Ampere) with older cuDNN. No runtime or library code is modified.

Tolerance bump is narrow and well-targeted: it only affects the TransformerLayer branch where head_dim > 128 and the backend is not UnfusedAttention; all other tolerances (bfloat16, DotProductAttention, head_dim <= 128) are untouched.
A clear inline comment explains the root cause (different kernels and masking strategies between full-sequence and incremental-KV-cache paths leading to accumulated fp16 rounding on Ampere).

Confidence Score: 5/5

Test-only change with no impact on library or runtime code; safe to merge.

The change touches exactly one tolerance constant in one branch of a test helper, is accompanied by a clear comment explaining the hardware-specific divergence, and does not affect any production path.

No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/attention/test_kv_cache.py	Bumps torch.half atol/rtol from 1e-2 to 1.5e-2 in the TransformerLayer / head_dim>128 / non-UnfusedAttention branch of get_tols, with an explanatory comment about sm80 + older cuDNN divergence.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_tols called] --> B{module == TransformerLayer?}
    B -- yes --> C{head_dim_qk <= 128?}
    B -- no --> G{module == DotProductAttention?}
    C -- yes --> D["tols: half=(5e-3,5e-3)\nbf16=(3.5e-2,3.5e-2)"]
    C -- no --> E{backend == UnfusedAttention?}
    E -- yes --> F["tols: half=(1.6e-2,1.6e-2)\nbf16=(1.2e-1,1e-1)"]
    E -- no --> H["tols: half=(1.5e-2,1.5e-2) ← changed\nbf16=(8e-2,7e-2)"]
    G -- yes --> I["tols: half=(1e-3,1e-3)\nbf16=(1e-2,1e-3)\nfp8=(2e-2,3e-2)"]
    D --> J[return tols at dtype]
    F --> J
    H --> J
    I --> J

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into f14-increased-t..." | Re-trigger Greptile}

cyanguwa · 2026-05-29T19:59:15Z

/te-ci pytorch L0

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

cyanguwa · 2026-06-01T14:21:42Z

/te-ci pytorch L0

cyanguwa · 2026-06-01T14:34:03Z

Hi @francesco-bertolotti, thanks for the contribution. The rational behind the PR makes sense, and indeed we have seen sporadic single-element mismatch failures for test_kv_cache.py, but I think as of last night, the nightly CI is passing with the existing tolerances, on all architectures (Ampere, Hopper and Blackwell). Were you observing some failures? What's the architecture and cuDNN version you're running with? Thanks!

francesco-bertolotti · 2026-06-01T15:25:58Z

Were you observing some failures?

At the moment, I'm mainly interested in the ring attention kernel, which I plan to integrate into a fork of TorchTitan. However, since I need to build Transformer Engine to get it working properly on the Leonardo HPC system, I'd like to first reach a state where the full test suite passes.

What's the architecture and cuDNN version you're running with?

I'm currently building against the cuDNN version that ships with the torch+cu126 wheels, which is cuDNN 9.10.2.

One thing I haven't tried yet is building against a newer cuDNN release. My concern is whether that could lead to issues if PyTorch is using cuDNN 9.10 while Transformer Engine is built against a different version.

timmoon10 · 2026-06-01T20:22:10Z

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

Wait, if it's true that the masking is different, why do we expect the full-sequence and KV-cache paths to match? Is it just a representation difference, or are they computing something actually different? If they are computing something different, then that's the true bug. Increasing tols to make two unlike things match is the same as not testing numerics at all.

ptrendx · 2026-06-01T21:21:10Z

@timmoon10 It is expected that the padding masks will be different - the full non-cached execution uses the full triangle (so the causal mask) and the cached execution works on 1 row at a time, so bottom right causal padding is correct (It could maybe even be None, depending on the shapes provided to the kernel).

cyanguwa · 2026-06-03T00:37:08Z

One thing I haven't tried yet is building against a newer cuDNN release. My concern is whether that could lead to issues if PyTorch is using cuDNN 9.10 while Transformer Engine is built against a different version.

I think that's ok - you can build TE against a different version of cuDNN than PyTorch. If it's in the system path, TE should be able to pick it up automatically; if not, you can specify the path with CUDNN_PATH.

With 9.10.2, I believe we did test it with TE 2.5 and the CI was clean too, but now we're on TE 2.16 and cuDNN 9.23, and I don't think we've gone back to those older versions for testing. Could you please try 9.23 and see if there's any tolerance issues? I don't have a problem with the changes in this PR, but just wanted to make sure we're not making any unnecessary changes :). Thanks.

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 29, 2026

cyanguwa self-requested a review May 29, 2026 19:58

increasing precision tolerance

2c76593

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

francesco-bertolotti force-pushed the f14-increased-tolerance branch from 3695fe4 to 2c76593 Compare May 30, 2026 10:19

Merge branch 'main' into f14-increased-tolerance

ba15735

ptrendx approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increasing precision tolerance#3060

increasing precision tolerance#3060
francesco-bertolotti wants to merge 2 commits into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance

francesco-bertolotti commented May 29, 2026

Uh oh!

greptile-apps Bot commented May 29, 2026 •

edited

Loading

Uh oh!

cyanguwa commented May 29, 2026

Uh oh!

cyanguwa commented Jun 1, 2026

Uh oh!

cyanguwa commented Jun 1, 2026 •

edited

Loading

Uh oh!

francesco-bertolotti commented Jun 1, 2026

Uh oh!

timmoon10 commented Jun 1, 2026 •

edited

Loading

Uh oh!

ptrendx commented Jun 1, 2026

Uh oh!

cyanguwa commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

francesco-bertolotti commented May 29, 2026

Description

Type of change

Checklist

Notes

Diff summary

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

cyanguwa commented May 29, 2026

Uh oh!

cyanguwa commented Jun 1, 2026

Uh oh!

cyanguwa commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

francesco-bertolotti commented Jun 1, 2026

Uh oh!

timmoon10 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptrendx commented Jun 1, 2026

Uh oh!

cyanguwa commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented May 29, 2026 •

edited

Loading

cyanguwa commented Jun 1, 2026 •

edited

Loading

timmoon10 commented Jun 1, 2026 •

edited

Loading

cyanguwa commented Jun 3, 2026 •

edited

Loading