Skip to content

Benchmark kernel GPU time via profiler#541

Merged
jlamypoirier merged 5 commits into
mainfrom
jlp_bench-profiler-kernel-time
Jun 16, 2026
Merged

Benchmark kernel GPU time via profiler#541
jlamypoirier merged 5 commits into
mainfrom
jlp_bench-profiler-kernel-time

Conversation

@jlamypoirier

@jlamypoirier jlamypoirier commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

The Triton kernel benchmark suite (tools/benchmark/triton_kernels/) timed each variant with a CUDA-event window around the whole call. That window measures the GPU-timeline span, which includes GPU-idle bubbles from per-call allocation and — for autograd variants — Python/engine launch overhead between kernels.

Those bubbles are benchmarking artifacts: in a real training step the CPU runs ahead of the GPU and the caching allocator serves buffers without syncs, so the GPU never sees them. They also vary by variant — the eager C++ autograd engine starves the GPU far less than a Python autograd Function — so they unfairly penalized the Triton path relative to eager/compiled. And because backward time is derived as fwd_bwd − fwd, the variance landed in the bwd column: on a fixed shape the same kernel swung ~190% run-to-run (normalization backward reported anywhere from ~200µs to ~580µs across runs).

This measures GPU kernel time instead: each rep sums per-kernel device self-time from torch.profiler — the work the GPU actually does, which is both the artifact-free number and the realistic one.

How it works

  • Profile each rep in its own CUDA-only torch.profiler session; sum self_device_time_total over device_type == CUDA entries → one per-rep sample.
  • The L2 flush runs outside each rep's profiler context, so its fill kernel isn't recorded (no kernel needs to be excluded from the sum) while each kernel still reads cold.
  • Compute real median / mean / min / max / std over the per-rep samples — so the harness demonstrates stability rather than asserting it, and --verbose min/max reflect measured spread.
  • Profiled rep count is capped at a small fixed number (device time is stable; no need to profile thousands).
  • Warmup / rep-estimate machinery is unchanged (still needed for compile/autotune/alloc).

Validation (H100)

  • Matches isolated single-kernel do_bench ground truth (e.g. normalization backward: profiler k1+k2 ≈ the per-kernel sum measured separately).
  • Headline (median) reproduces to <0.5% run-to-run, vs ~190% with the event method; measured per-rep spread is ~2–4%.
  • Generalizes across benchmark families (normalization, cross_entropy, z_loss, rotary) and variant types (Triton, eager, torch.compile).

Notes / out of scope

🤖 Generated with Claude Code

jlamypoirier and others added 5 commits June 15, 2026 17:03
The CUDA-event window around each call measured the GPU-timeline span,
including idle bubbles from per-call allocation and (for autograd variants)
Python/engine launch overhead between kernels. Those bubbles are benchmarking
artifacts — in a real run the CPU stays ahead of the GPU and the caching
allocator serves buffers without syncs — and they vary by variant, so they
unfairly penalized the Python-autograd Triton path versus the eager C++ engine.
They also made backward timing (derived as fwd_bwd - fwd) swing ~190%
run-to-run on a fixed shape.

Sum per-kernel device self-time from torch.profiler instead. Only CUDA-device
entries are summed (custom-autograd CPU nodes carry their children's device
time and would double-count); the L2-flush fill kernel is identified from a
flush-only profile and excluded. Reproduces to <0.5% run-to-run and matches
isolated-kernel timing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses review of the profiler timing change:

- Move the L2 flush outside each rep's profiler context. The flush kernel is no
  longer recorded, so the fragile key-based exclusion (which could silently drop
  a benchmarked kernel's own int32 fill/memset, e.g. sparse/MoE accumulators) is
  gone entirely.
- Profile each rep in its own CUDA-only session (drops the unused CPU activity);
  a single fn() call has no autograd CPU node, so no double-counting to filter.
- Cap profiled reps at a small fixed count instead of up to max_reps=10_000.
- Sum each rep's device self-time into a per-rep sample, then compute real
  median/mean/min/max/std — so verbose min/max/std reflect measured spread
  instead of identical values, and the harness demonstrates stability rather
  than asserting it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses review: the flush memset was enqueued async with no sync before the
profiler context opened, so excluding it from the device-time sum relied on
CUPTI-startup latency exceeding the ~76us 256MB memset rather than a guarantee.
Sync after the flush so it completes before the capture window opens; the
cold-L2 intent is preserved (the sync waits for the eviction, doesn't repopulate
with fn's data).

Also correct the rationale comment: what prevents the fwd_bwd autograd CPU node
from double-counting is that CUDA-only profiling records no CPU op nodes (those
carry their children's device time in self_device_time_total), with the
device_type == CUDA filter as backstop for zero-device-time runtime entries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Match the project's third-party import convention (import package.module, keep
fully qualified) and this file's existing torch.cuda.* style, rather than
`from torch.profiler import ...` / `from torch.autograd import ...`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jlamypoirier jlamypoirier merged commit 9b1fa19 into main Jun 16, 2026
2 checks passed
@jlamypoirier jlamypoirier deleted the jlp_bench-profiler-kernel-time branch June 16, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant