Apply lengthy-operation timeout to state-dict checkpoint save/export by jlamypoirier · Pull Request #543 · ServiceNow/Fast-LLM

jlamypoirier · 2026-06-17T01:00:26Z

Authored by Claude Opus 4.8 (via Claude Code), reviewed by @jlamypoirier.

Problem

StateDictCheckpointHandler.save() (used for fast_llm- and HuggingFace-format checkpoint export via Trainer and the offline convert) runs its per-parameter weight-gather collectives under the default DistributedConfig.timeout (60 s) instead of the lengthy-operation training.timeout (default 3600 s).

The load path does not have this problem: load() wraps its collectives in SafeLoad(timeout=config.timeout), and _save_checkpoint already passes training.timeout to the surrounding barriers and to get_save_config(...) — but nothing applied that timeout to the gather collectives inside save().

Saving a state-dict checkpoint is rank-0-serialized (rank 0 writes the model_*.safetensors files while the other ranks wait, with gathers interleaved between files). On a large model and/or slow/networked storage, a gather on the waiting ranks can exceed 60 s, at which point the NCCL collective watchdog fires and training aborts with:

RuntimeError: Desync detected for barrier export <step> exit (...)

The distributed (distributed-format) checkpoint is unaffected because every rank writes its own shard in parallel, so no single collective stalls past 60 s.

Observed multi-node (16×H100, 2 nodes) exporting a ~7B model to networked storage: the step-200 distributed checkpoint completed fine, while the fast_llm export at step 400 aborted ~60 s into the save.

Fix

Wrap the gather loop in save() with set_timeout(world_group, config.timeout), mirroring the load path. No config change is needed — training.timeout already defaults to 3600 s and flows into config.timeout.

Notes / scope

Verified by re-running the same multi-node export with the fix: the export at the previously-failing step completes and training continues.
iter_tensors() / iter_checkpoint() (the streaming weight-broadcast consumer) shares the same gather pattern but has different control flow (a generator driven by an external consumer) and its own timeout handling; left out of scope here. Worth a follow-up look.

StateDictCheckpointHandler.save() ran its weight-gather collectives under the default DistributedConfig timeout (60s) rather than the lengthy-operation training timeout, unlike load() which wraps its collectives in SafeLoad(timeout=config.timeout). Saving is rank-0-serialized, so on a large model or slow storage a gather on the waiting ranks can exceed 60s and the NCCL watchdog aborts with a barrier desync. Wrap the save gather loop in set_timeout(world_group, config.timeout) to match the load path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jlamypoirier merged commit bdfbb5a into main Jun 17, 2026
5 checks passed

jlamypoirier deleted the jlp_checkpoint_export_timeout branch June 17, 2026 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply lengthy-operation timeout to state-dict checkpoint save/export#543

Apply lengthy-operation timeout to state-dict checkpoint save/export#543
jlamypoirier merged 1 commit into
mainfrom
jlp_checkpoint_export_timeout

jlamypoirier commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Jun 17, 2026

Problem

Fix

Notes / scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant