Benchmark kernel GPU time via profiler by jlamypoirier · Pull Request #541 · ServiceNow/Fast-LLM

jlamypoirier · 2026-06-15T21:14:30Z

Summary

The Triton kernel benchmark suite (tools/benchmark/triton_kernels/) timed each variant with a CUDA-event window around the whole call. That window measures the GPU-timeline span, which includes GPU-idle bubbles from per-call allocation and — for autograd variants — Python/engine launch overhead between kernels.

Those bubbles are benchmarking artifacts: in a real training step the CPU runs ahead of the GPU and the caching allocator serves buffers without syncs, so the GPU never sees them. They also vary by variant — the eager C++ autograd engine starves the GPU far less than a Python autograd Function — so they unfairly penalized the Triton path relative to eager/compiled. And because backward time is derived as fwd_bwd − fwd, the variance landed in the bwd column: on a fixed shape the same kernel swung ~190% run-to-run (normalization backward reported anywhere from ~200µs to ~580µs across runs).

This measures GPU kernel time instead: each rep sums per-kernel device self-time from torch.profiler — the work the GPU actually does, which is both the artifact-free number and the realistic one.

How it works

Profile each rep in its own CUDA-only torch.profiler session; sum self_device_time_total over device_type == CUDA entries → one per-rep sample.
The L2 flush runs outside each rep's profiler context, so its fill kernel isn't recorded (no kernel needs to be excluded from the sum) while each kernel still reads cold.
Compute real median / mean / min / max / std over the per-rep samples — so the harness demonstrates stability rather than asserting it, and --verbose min/max reflect measured spread.
Profiled rep count is capped at a small fixed number (device time is stable; no need to profile thousands).
Warmup / rep-estimate machinery is unchanged (still needed for compile/autotune/alloc).

Validation (H100)

Matches isolated single-kernel do_bench ground truth (e.g. normalization backward: profiler k1+k2 ≈ the per-kernel sum measured separately).
Headline (median) reproduces to <0.5% run-to-run, vs ~190% with the event method; measured per-rep spread is ~2–4%.
Generalizes across benchmark families (normalization, cross_entropy, z_loss, rotary) and variant types (Triton, eager, torch.compile).

Notes / out of scope

reverse_kl shows a fwd > total inconsistency at one size (→ negative derived bwd). This is a pre-existing property of the bwd = total − fwd derivation for that kernel's forward measurement (related to Triton reverse_kl loss kernel is ~3x slower than torch.compile (single GPU) #497), not introduced here.

🤖 Generated with Claude Code

The CUDA-event window around each call measured the GPU-timeline span, including idle bubbles from per-call allocation and (for autograd variants) Python/engine launch overhead between kernels. Those bubbles are benchmarking artifacts — in a real run the CPU stays ahead of the GPU and the caching allocator serves buffers without syncs — and they vary by variant, so they unfairly penalized the Python-autograd Triton path versus the eager C++ engine. They also made backward timing (derived as fwd_bwd - fwd) swing ~190% run-to-run on a fixed shape. Sum per-kernel device self-time from torch.profiler instead. Only CUDA-device entries are summed (custom-autograd CPU nodes carry their children's device time and would double-count); the L2-flush fill kernel is identified from a flush-only profile and excluded. Reproduces to <0.5% run-to-run and matches isolated-kernel timing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Addresses review of the profiler timing change: - Move the L2 flush outside each rep's profiler context. The flush kernel is no longer recorded, so the fragile key-based exclusion (which could silently drop a benchmarked kernel's own int32 fill/memset, e.g. sparse/MoE accumulators) is gone entirely. - Profile each rep in its own CUDA-only session (drops the unused CPU activity); a single fn() call has no autograd CPU node, so no double-counting to filter. - Cap profiled reps at a small fixed count instead of up to max_reps=10_000. - Sum each rep's device self-time into a per-rep sample, then compute real median/mean/min/max/std — so verbose min/max/std reflect measured spread instead of identical values, and the harness demonstrates stability rather than asserting it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Addresses review: the flush memset was enqueued async with no sync before the profiler context opened, so excluding it from the device-time sum relied on CUPTI-startup latency exceeding the ~76us 256MB memset rather than a guarantee. Sync after the flush so it completes before the capture window opens; the cold-L2 intent is preserved (the sync waits for the eviction, doesn't repopulate with fn's data). Also correct the rationale comment: what prevents the fwd_bwd autograd CPU node from double-counting is that CUDA-only profiling records no CPU op nodes (those carry their children's device time in self_device_time_total), with the device_type == CUDA filter as backstop for zero-device-time runtime entries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Match the project's third-party import convention (import package.module, keep fully qualified) and this file's existing torch.cuda.* style, rather than `from torch.profiler import ...` / `from torch.autograd import ...`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jlamypoirier and others added 5 commits June 15, 2026 17:03

Merge branch 'main' into jlp_bench-profiler-kernel-time

4d2d61f

jlamypoirier merged commit 9b1fa19 into main Jun 16, 2026
2 checks passed

jlamypoirier deleted the jlp_bench-profiler-kernel-time branch June 16, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark kernel GPU time via profiler#541

Benchmark kernel GPU time via profiler#541
jlamypoirier merged 5 commits into
mainfrom
jlp_bench-profiler-kernel-time

jlamypoirier commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Validation (H100)

Notes / out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented Jun 15, 2026 •

edited

Loading