Add AMD/ROCm CI#5918
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dfddd6afb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR introduces an initial AMD/ROCm CI path for TRL, centered around a scheduled/manual GitHub Actions workflow on self-hosted ROCm hardware plus local helper scripts for running and summarizing the test suite with pytest --report-log.
Changes:
- Add a scheduled/manual AMD ROCm workflow (
tests-amd.yml) using therocm/pytorchcontainer, validating HIP PyTorch, running a Flash Attention smoke test, and executing the full test suite via a new runner script. - Add
scripts/run_amd_ci.shto run tests split per file and generate a summarized report from report-log JSON lines. - Add a ROCm Flash Attention smoke-check script and update
.gitignoreto ignore generated AMD CI reports.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
scripts/run_amd_ci.sh |
New local/host AMD CI runner that executes pytest per-file and summarizes --report-log results. |
scripts/check_amd_flash_attn.py |
New ROCm Flash Attention smoke test using the kernels-community/aiter-flash-attn Hub kernel. |
.github/workflows/tests-amd.yml |
New scheduled/manual workflow to run AMD/ROCm CI on a self-hosted runner group inside rocm/pytorch. |
.gitignore |
Ignore amd_ci_reports/ and fix the /.claude/skills ignore entry formatting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cc8563c to
f99cb8a
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f99cb8a. Configure here.
Abdennacer-Badaoui
left a comment
There was a problem hiding this comment.
Thanks @kashif for this.
This fits into our ongoing partnership with AMD, as TRL is one of the libraries the AMD team asked us to make sure we support on AMD GPUs.
The one open item before this can actually run in CI: the runs-on group (amd-mi325) is a placeholder and needs wiring to a real provisioned AMD runner. I'll reach out once we're able to allocate some runners for TRL to enable the CI.
LGTM!

Start of AMD GPU CI for TRL. Pairs with #5917.
scripts/run_amd_ci.sh— a small local runner. Prints an environment report, runs the suite split per test file withpytest --report-log, and prints a pass/fail/error/skip summary. Defaults to the full suite to surface anything broken on AMD; serial by default so a single GPU isn't OOM'd by parallel workers. Modeled onslow-tests.yml..github/workflows/tests-amd.yml— scheduled + manual workflow on a self-hosted AMD runner, using the officialrocm/pytorchimage. It installs into the image's ROCm-torch Python (a freshuv venvwould reinstall a CUDA torch and quietly break ROCm) and assertstorch.version.hipas a guard. The runner group is a placeholder that needs wiring to an actual AMD runner.scripts/check_amd_flash_attn.py— Flash Attention smoke check.kernels-community/flash-attn2has no ROCm build, butkernels-community/aiter-flash-attndoes, so the workflow runs a fwd+bwd through it (works on gfx942/gfx950; the kernel doesn't yet ship a gfx1150 config).Validated locally on a Radeon 890M (gfx1150, ROCm 7.2): TRL trains end-to-end and the runner/report work.
Note
Low Risk
CI-only workflow and dependency install steps; no changes to library runtime, auth, or data handling.
Overview
Adds a scheduled and manual GitHub Actions workflow (
.github/workflows/tests-amd.yml) that runs the same slow test suite as CUDA CI, but on self-hosted AMD ROCm hardware instead of on-push AWS GPUs.The job uses the official
rocm/pytorchcontainer with/dev/kfdand/dev/dridevice access andROCR_VISIBLE_DEVICES, setsTEST_TYPEtosingle_gpu_amd, and installs dependencies with systempip(not a freshuvvenv) so the image’s ROCm PyTorch is not replaced by a CUDA build from PyPI. It also installs a pinned ROCm flash-attention wheel, uninstalls thekernelshub package (no ROCm builds), assertstorch.version.hipbefore tests, then runsmake slow_testsandscripts/log_reports.pyinto the step summary.The runner group
amd-mi325is wired in with a TODO to point at the final provisioned AMD runner.Reviewed by Cursor Bugbot for commit e54bcd8. Bugbot is set up for automated code reviews on this repo. Configure here.