Skip to content

Add AMD/ROCm CI#5918

Open
kashif wants to merge 1 commit into
huggingface:mainfrom
kashif:amd-ci
Open

Add AMD/ROCm CI#5918
kashif wants to merge 1 commit into
huggingface:mainfrom
kashif:amd-ci

Conversation

@kashif

@kashif kashif commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Start of AMD GPU CI for TRL. Pairs with #5917.

scripts/run_amd_ci.sh — a small local runner. Prints an environment report, runs the suite split per test file with pytest --report-log, and prints a pass/fail/error/skip summary. Defaults to the full suite to surface anything broken on AMD; serial by default so a single GPU isn't OOM'd by parallel workers. Modeled on slow-tests.yml.

.github/workflows/tests-amd.yml — scheduled + manual workflow on a self-hosted AMD runner, using the official rocm/pytorch image. It installs into the image's ROCm-torch Python (a fresh uv venv would reinstall a CUDA torch and quietly break ROCm) and asserts torch.version.hip as a guard. The runner group is a placeholder that needs wiring to an actual AMD runner.

scripts/check_amd_flash_attn.py — Flash Attention smoke check. kernels-community/flash-attn2 has no ROCm build, but kernels-community/aiter-flash-attn does, so the workflow runs a fwd+bwd through it (works on gfx942/gfx950; the kernel doesn't yet ship a gfx1150 config).

Validated locally on a Radeon 890M (gfx1150, ROCm 7.2): TRL trains end-to-end and the runner/report work.


Note

Low Risk
CI-only workflow and dependency install steps; no changes to library runtime, auth, or data handling.

Overview
Adds a scheduled and manual GitHub Actions workflow (.github/workflows/tests-amd.yml) that runs the same slow test suite as CUDA CI, but on self-hosted AMD ROCm hardware instead of on-push AWS GPUs.

The job uses the official rocm/pytorch container with /dev/kfd and /dev/dri device access and ROCR_VISIBLE_DEVICES, sets TEST_TYPE to single_gpu_amd, and installs dependencies with system pip (not a fresh uv venv) so the image’s ROCm PyTorch is not replaced by a CUDA build from PyPI. It also installs a pinned ROCm flash-attention wheel, uninstalls the kernels hub package (no ROCm builds), asserts torch.version.hip before tests, then runs make slow_tests and scripts/log_reports.py into the step summary.

The runner group amd-mi325 is wired in with a TODO to point at the final provisioned AMD runner.

Reviewed by Cursor Bugbot for commit e54bcd8. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dfddd6afb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/check_amd_flash_attn.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an initial AMD/ROCm CI path for TRL, centered around a scheduled/manual GitHub Actions workflow on self-hosted ROCm hardware plus local helper scripts for running and summarizing the test suite with pytest --report-log.

Changes:

  • Add a scheduled/manual AMD ROCm workflow (tests-amd.yml) using the rocm/pytorch container, validating HIP PyTorch, running a Flash Attention smoke test, and executing the full test suite via a new runner script.
  • Add scripts/run_amd_ci.sh to run tests split per file and generate a summarized report from report-log JSON lines.
  • Add a ROCm Flash Attention smoke-check script and update .gitignore to ignore generated AMD CI reports.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
scripts/run_amd_ci.sh New local/host AMD CI runner that executes pytest per-file and summarizes --report-log results.
scripts/check_amd_flash_attn.py New ROCm Flash Attention smoke test using the kernels-community/aiter-flash-attn Hub kernel.
.github/workflows/tests-amd.yml New scheduled/manual workflow to run AMD/ROCm CI on a self-hosted runner group inside rocm/pytorch.
.gitignore Ignore amd_ci_reports/ and fix the /.claude/skills ignore entry formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/run_amd_ci.sh Outdated
Comment thread scripts/run_amd_ci.sh Outdated
@kashif kashif force-pushed the amd-ci branch 4 times, most recently from cc8563c to f99cb8a Compare June 2, 2026 16:11

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f99cb8a. Configure here.

Comment thread scripts/run_amd_ci.sh Outdated

@Abdennacer-Badaoui Abdennacer-Badaoui left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kashif for this.
This fits into our ongoing partnership with AMD, as TRL is one of the libraries the AMD team asked us to make sure we support on AMD GPUs.
The one open item before this can actually run in CI: the runs-on group (amd-mi325) is a placeholder and needs wiring to a real provisioned AMD runner. I'll reach out once we're able to allocate some runners for TRL to enable the CI.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants