Skip to content

increasing precision tolerance#3060

Open
francesco-bertolotti wants to merge 2 commits into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance
Open

increasing precision tolerance#3060
francesco-bertolotti wants to merge 2 commits into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance

Conversation

@francesco-bertolotti
Copy link
Copy Markdown
Contributor

Here’s a cleaner Markdown rewrite of your PR description:


Description

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing 1e-2 tolerance.

This change slightly relaxes the fp16 tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.

Fixes: N/A (spurious test_kv_cache failure on sm80 / fp16 / head_dim=256)


Type of change

  • Documentation change
  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Infra/Build change
  • Code refactoring

Checklist

  • I have read and followed the contributing guidelines (https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (N/A — test-only change)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works (N/A — test-only tolerance adjustment)
  • New and existing unit tests pass locally with my changes

Notes

  • Contributing guidelines and full test runs still need to be verified before final submission.
  • bfloat16 tolerance in this branch is unchanged, as it was not failing.
  • The Pyright import warnings are unrelated noise from missing stubs in the local environment.
  • Potential reviewer feedback: this could be further refined to apply the tolerance only conditionally by compute capability (e.g., sm < 90), rather than broadly for the branch.

Diff summary

# tests/pytorch/attention/test_kv_cache.py (get_tols)
- torch.half: (1e-2, 1e-2),
+ torch.half: (1.5e-2, 1.5e-2),

# plus explanatory comment added

@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 29, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR relaxes the torch.half numerical tolerance in get_tols (inside test_kv_cache.py) from 1e-2 to 1.5e-2 for the TransformerLayer + head_dim > 128 + fused/flash-attention path, to fix a spurious failure observed on sm80 (Ampere) with older cuDNN. No runtime or library code is modified.

  • Tolerance bump is narrow and well-targeted: it only affects the TransformerLayer branch where head_dim > 128 and the backend is not UnfusedAttention; all other tolerances (bfloat16, DotProductAttention, head_dim <= 128) are untouched.
  • A clear inline comment explains the root cause (different kernels and masking strategies between full-sequence and incremental-KV-cache paths leading to accumulated fp16 rounding on Ampere).

Confidence Score: 5/5

Test-only change with no impact on library or runtime code; safe to merge.

The change touches exactly one tolerance constant in one branch of a test helper, is accompanied by a clear comment explaining the hardware-specific divergence, and does not affect any production path.

No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/attention/test_kv_cache.py Bumps torch.half atol/rtol from 1e-2 to 1.5e-2 in the TransformerLayer / head_dim>128 / non-UnfusedAttention branch of get_tols, with an explanatory comment about sm80 + older cuDNN divergence.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_tols called] --> B{module == TransformerLayer?}
    B -- yes --> C{head_dim_qk <= 128?}
    B -- no --> G{module == DotProductAttention?}
    C -- yes --> D["tols: half=(5e-3,5e-3)\nbf16=(3.5e-2,3.5e-2)"]
    C -- no --> E{backend == UnfusedAttention?}
    E -- yes --> F["tols: half=(1.6e-2,1.6e-2)\nbf16=(1.2e-1,1e-1)"]
    E -- no --> H["tols: half=(1.5e-2,1.5e-2) ← changed\nbf16=(8e-2,7e-2)"]
    G -- yes --> I["tols: half=(1e-3,1e-3)\nbf16=(1e-2,1e-3)\nfp8=(2e-2,3e-2)"]
    D --> J[return tols at dtype]
    F --> J
    H --> J
    I --> J
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into f14-increased-t..." | Re-trigger Greptile

@cyanguwa cyanguwa self-requested a review May 29, 2026 19:58
@cyanguwa
Copy link
Copy Markdown
Collaborator

/te-ci pytorch L0

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
@cyanguwa
Copy link
Copy Markdown
Collaborator

cyanguwa commented Jun 1, 2026

/te-ci pytorch L0

@cyanguwa
Copy link
Copy Markdown
Collaborator

cyanguwa commented Jun 1, 2026

Hi @francesco-bertolotti, thanks for the contribution. The rational behind the PR makes sense, and indeed we have seen sporadic single-element mismatch failures for test_kv_cache.py, but I think as of last night, the nightly CI is passing with the existing tolerances, on all architectures (Ampere, Hopper and Blackwell). Were you observing some failures? What's the architecture and cuDNN version you're running with? Thanks!

@francesco-bertolotti
Copy link
Copy Markdown
Contributor Author

Were you observing some failures?

At the moment, I'm mainly interested in the ring attention kernel, which I plan to integrate into a fork of TorchTitan. However, since I need to build Transformer Engine to get it working properly on the Leonardo HPC system, I'd like to first reach a state where the full test suite passes.

What's the architecture and cuDNN version you're running with?

I'm currently building against the cuDNN version that ships with the torch+cu126 wheels, which is cuDNN 9.10.2.

One thing I haven't tried yet is building against a newer cuDNN release. My concern is whether that could lead to issues if PyTorch is using cuDNN 9.10 while Transformer Engine is built against a different version.

@timmoon10
Copy link
Copy Markdown
Member

timmoon10 commented Jun 1, 2026

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

Wait, if it's true that the masking is different, why do we expect the full-sequence and KV-cache paths to match? Is it just a representation difference, or are they computing something actually different? If they are computing something different, then that's the true bug. Increasing tols to make two unlike things match is the same as not testing numerics at all.

@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented Jun 1, 2026

@timmoon10 It is expected that the padding masks will be different - the full non-cached execution uses the full triangle (so the causal mask) and the cached execution works on 1 row at a time, so bottom right causal padding is correct (It could maybe even be None, depending on the shapes provided to the kernel).

@cyanguwa
Copy link
Copy Markdown
Collaborator

cyanguwa commented Jun 3, 2026

One thing I haven't tried yet is building against a newer cuDNN release. My concern is whether that could lead to issues if PyTorch is using cuDNN 9.10 while Transformer Engine is built against a different version.

I think that's ok - you can build TE against a different version of cuDNN than PyTorch. If it's in the system path, TE should be able to pick it up automatically; if not, you can specify the path with CUDNN_PATH.

With 9.10.2, I believe we did test it with TE 2.5 and the CI was clean too, but now we're on TE 2.16 and cuDNN 9.23, and I don't think we've gone back to those older versions for testing. Could you please try 9.23 and see if there's any tolerance issues? I don't have a problem with the changes in this PR, but just wanted to make sure we're not making any unnecessary changes :). Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants