Add AMD/ROCm CI by kashif · Pull Request #5918 · huggingface/trl

kashif · 2026-06-02T15:54:57Z

Start of AMD GPU CI for TRL. Pairs with #5917.

scripts/run_amd_ci.sh — a small local runner. Prints an environment report, runs the suite split per test file with pytest --report-log, and prints a pass/fail/error/skip summary. Defaults to the full suite to surface anything broken on AMD; serial by default so a single GPU isn't OOM'd by parallel workers. Modeled on slow-tests.yml.

.github/workflows/tests-amd.yml — scheduled + manual workflow on a self-hosted AMD runner, using the official rocm/pytorch image. It installs into the image's ROCm-torch Python (a fresh uv venv would reinstall a CUDA torch and quietly break ROCm) and asserts torch.version.hip as a guard. The runner group is a placeholder that needs wiring to an actual AMD runner.

scripts/check_amd_flash_attn.py — Flash Attention smoke check. kernels-community/flash-attn2 has no ROCm build, but kernels-community/aiter-flash-attn does, so the workflow runs a fwd+bwd through it (works on gfx942/gfx950; the kernel doesn't yet ship a gfx1150 config).

Validated locally on a Radeon 890M (gfx1150, ROCm 7.2): TRL trains end-to-end and the runner/report work.

Note

Low Risk
CI-only workflow and dependency install steps; no changes to library runtime, auth, or data handling.

Overview
Adds a scheduled and manual GitHub Actions workflow (.github/workflows/tests-amd.yml) that runs the same slow test suite as CUDA CI, but on self-hosted AMD ROCm hardware instead of on-push AWS GPUs.

The job uses the official rocm/pytorch container with /dev/kfd and /dev/dri device access and ROCR_VISIBLE_DEVICES, sets TEST_TYPE to single_gpu_amd, and installs dependencies with system pip (not a fresh uv venv) so the image’s ROCm PyTorch is not replaced by a CUDA build from PyPI. It also installs a pinned ROCm flash-attention wheel, uninstalls the kernels hub package (no ROCm builds), asserts torch.version.hip before tests, then runs make slow_tests and scripts/log_reports.py into the step summary.

The runner group amd-mi325 is wired in with a TODO to point at the final provisioned AMD runner.

^{Reviewed by Cursor Bugbot for commit e54bcd8. Bugbot is set up for automated code reviews on this repo. Configure here.}

HuggingFaceDocBuilderDev · 2026-06-02T15:58:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dfddd6afb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

This PR introduces an initial AMD/ROCm CI path for TRL, centered around a scheduled/manual GitHub Actions workflow on self-hosted ROCm hardware plus local helper scripts for running and summarizing the test suite with pytest --report-log.

Changes:

Add a scheduled/manual AMD ROCm workflow (tests-amd.yml) using the rocm/pytorch container, validating HIP PyTorch, running a Flash Attention smoke test, and executing the full test suite via a new runner script.
Add scripts/run_amd_ci.sh to run tests split per file and generate a summarized report from report-log JSON lines.
Add a ROCm Flash Attention smoke-check script and update .gitignore to ignore generated AMD CI reports.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`scripts/run_amd_ci.sh`	New local/host AMD CI runner that executes pytest per-file and summarizes `--report-log` results.
`scripts/check_amd_flash_attn.py`	New ROCm Flash Attention smoke test using the `kernels-community/aiter-flash-attn` Hub kernel.
`.github/workflows/tests-amd.yml`	New scheduled/manual workflow to run AMD/ROCm CI on a self-hosted runner group inside `rocm/pytorch`.
`.gitignore`	Ignore `amd_ci_reports/` and fix the `/.claude/skills` ignore entry formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f99cb8a. Configure here.}

Abdennacer-Badaoui

Thanks @kashif for this.
This fits into our ongoing partnership with AMD, as TRL is one of the libraries the AMD team asked us to make sure we support on AMD GPUs.
The one open item before this can actually run in CI: the runs-on group (amd-mi325) is a placeholder and needs wiring to a real provisioned AMD runner. I'll reach out once we're able to allocate some runners for TRL to enable the CI.

LGTM!

kashif requested a review from Copilot June 2, 2026 15:57

Copilot started reviewing on behalf of kashif June 2, 2026 15:57 View session

kashif requested a review from albertvillanova June 2, 2026 15:57

kashif force-pushed the amd-ci branch from 1dfddd6 to 62746e7 Compare June 2, 2026 15:57

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread scripts/check_amd_flash_attn.py Outdated

kashif force-pushed the amd-ci branch from 62746e7 to bff6dc1 Compare June 2, 2026 15:59

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread scripts/run_amd_ci.sh Outdated

Comment thread scripts/run_amd_ci.sh Outdated

kashif force-pushed the amd-ci branch 4 times, most recently from cc8563c to f99cb8a Compare June 2, 2026 16:11

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread scripts/run_amd_ci.sh Outdated

kashif force-pushed the amd-ci branch from f99cb8a to c651450 Compare June 2, 2026 16:36

Add AMD/ROCm CI

e54bcd8

kashif force-pushed the amd-ci branch from c651450 to e54bcd8 Compare June 2, 2026 16:38

Abdennacer-Badaoui approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD/ROCm CI#5918

Add AMD/ROCm CI#5918
kashif wants to merge 1 commit into
huggingface:mainfrom
kashif:amd-ci

kashif commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Abdennacer-Badaoui left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kashif commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Abdennacer-Badaoui left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kashif commented Jun 2, 2026 •

edited by cursor Bot

Loading