Benchmark kernel GPU time via profiler#541
Merged
Merged
Conversation
The CUDA-event window around each call measured the GPU-timeline span, including idle bubbles from per-call allocation and (for autograd variants) Python/engine launch overhead between kernels. Those bubbles are benchmarking artifacts — in a real run the CPU stays ahead of the GPU and the caching allocator serves buffers without syncs — and they vary by variant, so they unfairly penalized the Python-autograd Triton path versus the eager C++ engine. They also made backward timing (derived as fwd_bwd - fwd) swing ~190% run-to-run on a fixed shape. Sum per-kernel device self-time from torch.profiler instead. Only CUDA-device entries are summed (custom-autograd CPU nodes carry their children's device time and would double-count); the L2-flush fill kernel is identified from a flush-only profile and excluded. Reproduces to <0.5% run-to-run and matches isolated-kernel timing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses review of the profiler timing change: - Move the L2 flush outside each rep's profiler context. The flush kernel is no longer recorded, so the fragile key-based exclusion (which could silently drop a benchmarked kernel's own int32 fill/memset, e.g. sparse/MoE accumulators) is gone entirely. - Profile each rep in its own CUDA-only session (drops the unused CPU activity); a single fn() call has no autograd CPU node, so no double-counting to filter. - Cap profiled reps at a small fixed count instead of up to max_reps=10_000. - Sum each rep's device self-time into a per-rep sample, then compute real median/mean/min/max/std — so verbose min/max/std reflect measured spread instead of identical values, and the harness demonstrates stability rather than asserting it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses review: the flush memset was enqueued async with no sync before the profiler context opened, so excluding it from the device-time sum relied on CUPTI-startup latency exceeding the ~76us 256MB memset rather than a guarantee. Sync after the flush so it completes before the capture window opens; the cold-L2 intent is preserved (the sync waits for the eviction, doesn't repopulate with fn's data). Also correct the rationale comment: what prevents the fwd_bwd autograd CPU node from double-counting is that CUDA-only profiling records no CPU op nodes (those carry their children's device time in self_device_time_total), with the device_type == CUDA filter as backstop for zero-device-time runtime entries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Match the project's third-party import convention (import package.module, keep fully qualified) and this file's existing torch.cuda.* style, rather than `from torch.profiler import ...` / `from torch.autograd import ...`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Triton kernel benchmark suite (
tools/benchmark/triton_kernels/) timed each variant with a CUDA-event window around the whole call. That window measures the GPU-timeline span, which includes GPU-idle bubbles from per-call allocation and — for autograd variants — Python/engine launch overhead between kernels.Those bubbles are benchmarking artifacts: in a real training step the CPU runs ahead of the GPU and the caching allocator serves buffers without syncs, so the GPU never sees them. They also vary by variant — the eager C++ autograd engine starves the GPU far less than a Python autograd
Function— so they unfairly penalized the Triton path relative to eager/compiled. And because backward time is derived asfwd_bwd − fwd, the variance landed in thebwdcolumn: on a fixed shape the same kernel swung ~190% run-to-run (normalization backward reported anywhere from ~200µs to ~580µs across runs).This measures GPU kernel time instead: each rep sums per-kernel device self-time from
torch.profiler— the work the GPU actually does, which is both the artifact-free number and the realistic one.How it works
torch.profilersession; sumself_device_time_totaloverdevice_type == CUDAentries → one per-rep sample.--verbosemin/max reflect measured spread.Validation (H100)
do_benchground truth (e.g. normalization backward: profilerk1+k2≈ the per-kernel sum measured separately).torch.compile).Notes / out of scope
reverse_klshows afwd > totalinconsistency at one size (→ negative derivedbwd). This is a pre-existing property of thebwd = total − fwdderivation for that kernel's forward measurement (related to Triton reverse_kl loss kernel is ~3x slower than torch.compile (single GPU) #497), not introduced here.🤖 Generated with Claude Code