Latency/Energy Calculation #240

krishnateja95 · 2023-05-17T01:42:14Z

krishnateja95
May 17, 2023

Is there is a method to calculate performance metrics (Latency, Energy) of a Transformer model (BERT, ViT) on the H100 Transformer Engine using Pytorch?

Advait251206 · 2026-07-02T08:28:09Z

Advait251206
Jul 2, 2026

Transformer Engine itself does not provide a built-in API to directly calculate end-to-end latency or energy consumption for Transformer models such as BERT or ViT. Its primary purpose is to accelerate training and inference using FP8/BF16 mixed precision on supported NVIDIA GPUs (such as the H100), rather than serving as a performance analysis framework.

Measuring Latency

The recommended approach is to benchmark your model using PyTorch with proper GPU synchronization to obtain accurate timings.

import time
import torch

model.eval()
x = ...

# Warm-up
for _ in range(20):
    with torch.no_grad():
        _ = model(x)

torch.cuda.synchronize()
start = time.perf_counter()

with torch.no_grad():
    _ = model(x)

torch.cuda.synchronize()
end = time.perf_counter()

print(f"Latency: {(end - start) * 1000:.3f} ms")

For reliable measurements:

Perform multiple warm-up iterations before timing.
Benchmark over 100–1000 iterations.
Report the average, median, and optionally the standard deviation.

Measuring Energy Consumption

Transformer Engine does not expose energy measurements directly. Instead, NVIDIA recommends using GPU telemetry tools such as:

nvidia-smi (power sampling)
NVML (NVIDIA Management Library) via the pynvml Python package
NVIDIA DCGM
NVIDIA Nsight Systems / Nsight Compute for detailed profiling

Energy is computed by integrating power over time:

[
\text{Energy (J)} = \int P(t),dt
]

If the power draw is approximately constant, you can estimate it as:

[
\text{Energy} \approx \text{Average Power (W)} \times \text{Latency (s)}
]

Typical Workflow on H100

A common evaluation pipeline is:

Enable FP8/BF16 execution using Transformer Engine.
Measure inference or training latency with PyTorch (torch.cuda.synchronize() before and after timing).
Collect GPU power measurements using NVML or nvidia-smi.
Compute energy using the recorded power and execution time.

This is the standard methodology used in many research papers evaluating Transformer Engine and H100 performance.

If this answer helped or pointed you in the right direction, I'd appreciate it if you could mark it as the accepted answer so it's easier for others with the same issue to find.

Also, if you found my contribution useful, I'd appreciate it if you could check out my GitHub profile, follow me, and star any repositories you find interesting.

GitHub: https://github.com/Advait251206

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latency/Energy Calculation #240

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Latency/Energy Calculation #240

Uh oh!

krishnateja95 May 17, 2023

Replies: 1 comment

Uh oh!

Advait251206 Jul 2, 2026

Measuring Latency

Measuring Energy Consumption

Typical Workflow on H100

krishnateja95
May 17, 2023

Advait251206
Jul 2, 2026