increasing precision tolerance#3060
Conversation
Greptile SummaryThis PR relaxes the
Confidence Score: 5/5Test-only change with no impact on library or runtime code; safe to merge. The change touches exactly one tolerance constant in one branch of a test helper, is accompanied by a clear comment explaining the hardware-specific divergence, and does not affect any production path. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[get_tols called] --> B{module == TransformerLayer?}
B -- yes --> C{head_dim_qk <= 128?}
B -- no --> G{module == DotProductAttention?}
C -- yes --> D["tols: half=(5e-3,5e-3)\nbf16=(3.5e-2,3.5e-2)"]
C -- no --> E{backend == UnfusedAttention?}
E -- yes --> F["tols: half=(1.6e-2,1.6e-2)\nbf16=(1.2e-1,1e-1)"]
E -- no --> H["tols: half=(1.5e-2,1.5e-2) ← changed\nbf16=(8e-2,7e-2)"]
G -- yes --> I["tols: half=(1e-3,1e-3)\nbf16=(1e-2,1e-3)\nfp8=(2e-2,3e-2)"]
D --> J[return tols at dtype]
F --> J
H --> J
I --> J
Reviews (3): Last reviewed commit: "Merge branch 'main' into f14-increased-t..." | Re-trigger Greptile |
|
/te-ci pytorch L0 |
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
3695fe4 to
2c76593
Compare
|
/te-ci pytorch L0 |
|
Hi @francesco-bertolotti, thanks for the contribution. The rational behind the PR makes sense, and indeed we have seen sporadic single-element mismatch failures for |
At the moment, I'm mainly interested in the ring attention kernel, which I plan to integrate into a fork of TorchTitan. However, since I need to build Transformer Engine to get it working properly on the Leonardo HPC system, I'd like to first reach a state where the full test suite passes.
I'm currently building against the cuDNN version that ships with the One thing I haven't tried yet is building against a newer cuDNN release. My concern is whether that could lead to issues if PyTorch is using cuDNN 9.10 while Transformer Engine is built against a different version. |
Wait, if it's true that the masking is different, why do we expect the full-sequence and KV-cache paths to match? Is it just a representation difference, or are they computing something actually different? If they are computing something different, then that's the true bug. Increasing tols to make two unlike things match is the same as not testing numerics at all. |
|
@timmoon10 It is expected that the padding masks will be different - the full non-cached execution uses the full triangle (so the causal mask) and the cached execution works on 1 row at a time, so bottom right causal padding is correct (It could maybe even be None, depending on the shapes provided to the kernel). |
I think that's ok - you can build TE against a different version of cuDNN than PyTorch. If it's in the system path, TE should be able to pick it up automatically; if not, you can specify the path with CUDNN_PATH. With 9.10.2, I believe we did test it with TE 2.5 and the CI was clean too, but now we're on TE 2.16 and cuDNN 9.23, and I don't think we've gone back to those older versions for testing. Could you please try 9.23 and see if there's any tolerance issues? I don't have a problem with the changes in this PR, but just wanted to make sure we're not making any unnecessary changes :). Thanks. |
Here’s a cleaner Markdown rewrite of your PR description:
Description
test_kv_cachecompares full-sequence attention against incremental KV-cache decoding. In theTransformerLayerconfiguration wherehead_dim > 128infp16, these two execution paths use different kernels and masking strategies (e.g.,causalvs.padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulatedfp16rounding differences.On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing
1e-2tolerance.This change slightly relaxes the
fp16tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.Fixes: N/A (spurious
test_kv_cachefailure on sm80 / fp16 / head_dim=256)Type of change
Checklist
Notes
bfloat16tolerance in this branch is unchanged, as it was not failing.Diff summary