feat(vllm-omni): bump to 0.20.0rc1 and align with upstream vllm v0.20.0 by Yadan-Wei · Pull Request #6028 · aws/deep-learning-containers

Yadan-Wei · 2026-05-02T00:24:08Z

Purpose

Upgrades the vLLM-Omni DLC image from 0.18.0 to 0.20.0rc1 and aligns build-time pins with upstream vllm/docker/Dockerfile at v0.20.0.

vllm-omni v0.20.0 is still a release candidate; will bump to the final once it is published on PyPI.

Changes

Version bumps

Component	Before	After
CUDA	12.9.1	13.0.2 (matches upstream vllm v0.20.0)
vLLM	0.18.0	0.20.0
vLLM-Omni	0.18.0	0.20.0rc1 (install with `--prerelease=allow`)
FlashInfer	0.6.6	0.6.8.post1
nvcc_threads	2	8
runai-model-streamer	>=0.15.3	>=0.15.7
torch_cuda_arch_list	`7.0 7.5 8.0 8.9 9.0 10.0 12.0`	`7.5 8.0 8.6 8.9 9.0 10.0 11.0 12.0+PTX`

Other Dockerfile changes

Add numactl / numactl-libs / numactl-devel in runtime stage (needed by fastsafetensors in CUDA 13 upstream).
Add VLLM_ENABLE_CUDA_COMPATIBILITY=0 env (settable to 1 at runtime for hosts with NVIDIA drivers older than what CUDA 13 requires).
Drop sox system dep — vllm-omni v0.20.0rc1 removed sox from its deps (vllm-omni#2745).
Remove in-tree DeepGEMM build — upstream also removed it from the main Dockerfile.
Image-config files: cuda_version cu129 → cu130, prod_image 0.18 → 0.20.

Test workflow

--stage-init-timeout bumped 600 → 900 in reusable-vllm-omni-model-tests.yml to match the vllm-omni benchmark default after the Stage CLI refactor (vllm-omni#2020).

Entrypoint unchanged

scripts/vllm/omni_dockerd_entrypoint.sh and scripts/vllm/omni_sagemaker_entrypoint.sh still run vllm serve --omni. Upstream vllm v0.20.0 + vllm-omni v0.20.0rc1 support this invocation natively via vllm's --omni delegation (see vllm-omni#3082 and vllm#40744) — no script changes needed.

Test Plan

pre-commit run passes on all changed files (YAML validation, workflow lint, Docker format, security scans, typos).
Local image build on a CUDA 13.0.2 devbox — pending.
vllm serve --omni wiring verified inside the built image (vllm serve --help | grep -q -- --omni, python -c 'import torch; assert torch.version.cuda.startswith("13")').
EC2 and SageMaker smoke tests against the 5 model matrix (qwen3-tts-1.7b-customvoice, qwen3-tts-12hz-1.7b-base, flux2-klein-4b, wan2.1-t2v-1.3b, qwen2.5-omni-3b).

Notes / Follow-ups

Host driver ≥ 550 required at runtime for CUDA 13. For hosts with older drivers, start the container with -e VLLM_ENABLE_CUDA_COMPATIBILITY=1.
sccache cache miss on first build — the CUDA 13 / vLLM 0.20 combination has no prior sccache entries, so the first build will do a full source compile.
Benchmark thresholds untouched in this PR. Thresholds live in .github/config/model-tests/vllm-omni-model-tests.yml and are on a separate branch; they will be re-baselined after the first green CI run against the 0.20.0 image.
Docs intentionally untouched (docs/releasenotes/vllm-omni/*, docs/src/data/vllm-omni/*, docs/vllm-omni/index.md) — will be added in a follow-up PR that keeps the 0.18.0 entries alongside the new 0.20.0 ones.

- CUDA 12.9.1 → 13.0.2 (matches upstream vllm v0.20.0 default) - vllm 0.18.0 → 0.20.0 - vllm-omni 0.18.0 → 0.20.0rc1 (release candidate; install with --prerelease=allow) - flashinfer 0.6.6 → 0.6.8.post1 - torch_cuda_arch_list aligned with upstream: '7.5 8.0 8.6 8.9 9.0 10.0 11.0 12.0+PTX' - nvcc_threads 2 → 8 - runai-model-streamer >= 0.15.3 → 0.15.7 - Add numactl / numactl-libs / numactl-devel in runtime stage - Add VLLM_ENABLE_CUDA_COMPATIBILITY=0 env (settable to 1 at runtime for hosts with older NVIDIA drivers) - Drop sox system dep (vllm-omni v0.20.0rc1 removed sox from its deps) - Bump --stage-init-timeout in model-tests workflow 600 → 900 to match vllm-omni benchmark default after the stage CLI refactor - Image config files: cuda_version cu129 → cu130, prod_image 0.18 → 0.20 Entrypoint scripts (scripts/vllm/omni_*) unchanged — 'vllm serve --omni' still works in 0.20.0 via upstream vllm's --omni delegation. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Upstream vllm v0.20.0 moved requirements/build.txt → requirements/build/cuda.txt (the root build.txt no longer exists, causing 'uv pip install -r requirements/build.txt' to fail with exit code 2). Mirror upstream by installing cuda.txt and build/cuda.txt in two separate RUN steps. The base vllm Dockerfile does the same split across its 'base' and 'csrc-build' stages. Fixes the build failure: ERROR: failed to solve: process "/bin/sh -c uv pip install -r requirements/cuda.txt -r requirements/build.txt ..." did not complete successfully: exit code: 2 Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…sibling DLC Port hardening changes from the sibling docker/vllm/Dockerfile.amzn2023 and match upstream vllm v0.20.0 patterns for FlashInfer install: - Pre-built wheel stage: add 'rm -rf dist' before recreating the dist/ directory so stale artifacts from a previous build layer can't leak in. - SETUPTOOLS_SCM_PRETEND_VERSION is now overrideable via --build-arg (e.g. 0.20.0rc1+amzn2023.abcdef12 with git SHA), falling back to VLLM_VERSION+amzn2023 when not provided. - vLLM wheel install now picks the most recent vllm-*.whl by mtime and echoes which wheel it installs — defends against accidental glob matches when multiple wheels end up in the deps/ staging dir. - FlashInfer install matches upstream vllm v0.20.0 exactly: only flashinfer-jit-cache is explicitly installed (flashinfer-python and flashinfer-cubin are already pinned in requirements/cuda.txt), followed by 'flashinfer show-config && flashinfer download-cubin' to pre-download precompiled kernels so the first inference request does not pay JIT compile latency. - Serving extras: drop hf_transfer (superseded by HF_XET_HIGH_PERFORMANCE below) and match upstream's package set. - Switch HF Hub acceleration from HF_HUB_ENABLE_HF_TRANSFER=1 to HF_XET_HIGH_PERFORMANCE=1, matching the sibling vllm Dockerfile and HF's direction of travel. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Pair with the Dockerfile change that made SETUPTOOLS_SCM_PRETEND_VERSION overrideable via --build-arg. Compute the wheel version tag here so it gets auto-forwarded to docker buildx and encodes the pinned VLLM_REF (tag slug or commit SHA prefix) alongside VLLM_VERSION for traceability. Mirrors the same pattern in docker/vllm/versions.env. Example for the current pin: VLLM_REF=v0.20.0 VLLM_VERSION=0.20.0 -> SETUPTOOLS_SCM_PRETEND_VERSION=0.20.0+amzn2023.v0_20_0 Signed-off-by: Yadan Wei <yadanwei@amazon.com>

The blanket 'dnf upgrade --security' in the final stages only picks up fixes from AL2023's security-advisory channel. The cuda-compat-13-0 fix ships in NVIDIA's CUDA repo, which doesn't emit AL2023 advisory metadata, so --security misses it, leaving the HIGH CVE unpatched in the scanned image. Mirror the pattern already used in docker/base/v2/Dockerfile and docker/pytorch/Dockerfile.cuda: add an explicit 'dnf upgrade -y --releasever latest cuda-compat-13-0' step in both final stages (EC2 + SageMaker). Patches: CVE-2025-33219 (cuda-compat-13-0 580.95.05 -> 1:580.126.09-1.amzn2023) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Exercises the new /v1/audio/generate route introduced in upstream vllm-omni v0.20.0 via PR vllm-project/vllm-omni#1794. Validated on the 0.20.0 DLC image (CUDA 13, vllm 0.20.0): - Model loads in ~29s on 1x L4 (24GB VRAM, 3GB peak post-load) - /v1/audio/generate returns HTTP 200 in ~7s with a 5s audio_length - Response is valid WAV (PCM 16-bit stereo 44.1kHz, ~860KB for 5s) - Payload matches the example in the upstream PR description. Artifact pre-staged at: s3://dlc-cicd-models/omni-models/stable-audio-open-1.0.tar.gz (14.3 GB) Fleet x86-g6xl-runner (L4, compute 8.9) is used instead of g6exl because the model only needs ~3GB VRAM at runtime. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…TTS models Adds qwen3-tts-12hz-1.7b-base and cosyvoice3-0.5b to the smoke-test matrix. Both are voice-clone TTS models that require a base64-encoded reference audio in the request, which previously could not be expressed in the YAML test_request field (~450KB of base64 would exceed shell argument limits). Changes: - Smoke-test scripts accept '@/path/to/file' in the request argument so the workflow can hand in large, preprocessed payloads without hitting the ~128KB bash argv limit. - Workflow grows a 'Prepare test request' step that detects ref_audio_s3 in the YAML test_request, fetches the wav from S3 on the runner (same IAM path the existing download-model action uses), base64- encodes it, substitutes ref_audio, and docker-cps the result into the container for the smoke-test script to read. - CosyVoice3 is zero-shot voice-clone only; its reference fixture is mirrored from upstream tests/assets/cosyvoice3/zero_shot_prompt.wav to s3://dlc-cicd-models/test-fixtures/audio/cosyvoice3_ref.wav. - qwen3-tts-12hz-1.7b-base reuses the already-staged tts_ref_vivian.wav that the benchmark suite uses; ref_text MUST match the audio transcript exactly (upstream issue vllm-project/vllm-omni#3124). Existing entries with literal test_request strings keep working unchanged (no @ prefix => script treats argument as literal body). Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Previous change (00bce6f) solved the argv-limit problem between the workflow and the smoke-test script but re-introduced it one step later: the script still ran 'curl -d "${REQUEST}"' with REQUEST holding the ~450KB JSON, which fails with /usr/bin/curl: Argument list too long. Keep the body in a file throughout: - If invoked with @file, use that path directly. - Otherwise write the literal body to a temp file. - curl now uses '-d @${REQUEST_FILE}' for JSON and reads the urlencoded pairs from the file line-by-line for multipart form-data. Fixes voice-clone TTS smoke tests (qwen3-tts-12hz-1.7b-base and cosyvoice3-0.5b). Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- versions.env / Dockerfile: vllm-omni 0.20.0rc1 → 0.20.0, drop --prerelease=allow - model-tests.yml: add wan2.1-t2v-1.3b-sync, wan2.1-vace-1.3b, ernie-image-turbo on x86-g6exl-runner; pre-stage commented wan2.2-i2v-a14b for follow-up Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Unblocks pre-commit on PRs that merge into main: ruff-format reflow plus E731 (lambda → def) in find_match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

The merge of main into omni-0.20.0 left 24 unresolved conflict markers in scripts/autocurrency/agent-fix.py, breaking pre-commit. Restore our ruff-formatted copy from da00c60 — same content as our prior fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…te_code bypass) diffusers>=0.38.0 fixes the CVE but transitively requires safetensors>=0.8.0-rc.0 which uv/pip skip by default (only safetensors 0.8.0rc0 published; 0.7.0 is latest stable). Upstream vllm-omni#3349 tracks the bump and is waiting on safetensors 0.8.0 final. DLC loads only pre-staged S3 models we control and does not pass trust_remote_code=True from user code paths, so exploit surface is narrow. Re-evaluate once safetensors 0.8.0 final ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Document that we depend on vllm v0.20.0 still wiring args.middleware through to FastAPI app.add_middleware in build_app. vllm-omni v0.20.0's "delegate to upstream entrypoint" rebase (#3082, #3232) preserves this; if a future upstream change drops the loop, this loader silently no-ops and SageMaker /invocations returns 404 for non-default routes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Adds test_vllm_omni_video_async_endpoint covering the recommended production pattern for video on SageMaker: AsyncInferenceConfig + /v1/videos/sync via the custom-attributes routing middleware (which auto-converts JSON to multipart/form-data for FORM_DATA_ROUTES). Uses Wan-AI/Wan2.1-VACE-1.3B-diffusers on ml.g6e.xlarge (32 GB RAM avoids the host-RAM OOM that 16 GB instances hit during HF model load). Skips on SageMaker capacity errors (ResourceLimitExceeded / InsufficientInstanceCapacity / CapacityError) so the rest of the matrix still gets signal when newer GPU families are unavailable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…xes) origin/main #6065 refactored agent-fix.py to use the GitHub API for structured failure extraction and tweaked the system prompt format. Pull the upstream version verbatim so future merges from main don't conflict with our prior local ruff fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Drops the duplicated _build_async_endpoint helper (added in 2393207) and puts the SageMaker capacity-skip logic inside the async_endpoint fixture itself. Both async tests (TTS and video) now share the same fixture body and both gain the capacity-skip behavior. Net: 144 LOC deleted, 83 added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

# Conflicts: # scripts/autocurrency/agent-fix.py Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…per.so CVE-2026-33811 (cgo DNS resolver double-free on long CNAME) and CVE-2026-39820 (net/mail and time.ParseDate CPU/memory exhaustion). Both are go/stdlib issues vendored into mooncake/libetcd_wrapper.so; same unpatchable-without- mooncake-upgrade situation as the 5 existing mooncake entries above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

aws-deep-learning-containers-ci Bot added the authorized label May 2, 2026

Yadan Wei and others added 19 commits May 5, 2026 10:30

Merge branch 'main' into omni-0.20.0

52df3a3

Merge branch 'main' into omni-0.20.0

d60a22f

Merge remote-tracking branch 'origin/omni-0.20.0' into omni-0.20.0

1611a11

# Conflicts: # scripts/autocurrency/agent-fix.py Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm-omni): bump to 0.20.0rc1 and align with upstream vllm v0.20.0#6028

feat(vllm-omni): bump to 0.20.0rc1 and align with upstream vllm v0.20.0#6028
Yadan-Wei wants to merge 20 commits intomainfrom
omni-0.20.0

Yadan-Wei commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yadan-Wei commented May 2, 2026

Purpose

Changes

Version bumps

Other Dockerfile changes

Test workflow

Entrypoint unchanged

Test Plan

Notes / Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant