Skip to content

fix: filter single-file safetensors by assigned layers before push#83

Closed
cjchanh wants to merge 1 commit into
evilsocket:mainfrom
cjchanh:fix/single-file-layer-filter
Closed

fix: filter single-file safetensors by assigned layers before push#83
cjchanh wants to merge 1 commit into
evilsocket:mainfrom
cjchanh:fix/single-file-layer-filter

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 14, 2026

Problem

When a Cake master distributes a single-file safetensors model to a worker, it pushes the entire file regardless of how many layers the worker is assigned. For Qwen2.5-7B-Instruct-4bit (4 GiB single file), an iPad worker with a 3 GiB jetsam budget receives the full 4 GiB, exceeds memory, and crashes with early eof.

The indexed model path (model.safetensors.index.json present) already filters correctly via weight_map. The single-file fallback at sharding/mod.rs unconditionally adds model.safetensors to the push list.

Fix

For single-file models with assigned layers, the push path now:

  1. Reads only the safetensors header to enumerate tensor names
  2. Filters tensors by assigned layer prefixes (same starts_with logic as the indexed path)
  3. Calls extract_layer_tensors to build a minimal safetensors blob containing only the needed tensors
  4. Pushes the reduced blob instead of the full file

Backward compatible: if layers is empty (no specific assignment), the full file is still pushed. If no tensors match assigned layers, falls back to full push with a warning.

Results

Tested with M5 Max master + iPad Air M3 worker, Qwen2.5-7B-Instruct-4bit:

Metric Before After
Push size 4 GiB (full model) 250.1 MiB (52 tensors, 2 layers)
iPad RSS jetsam kill 1.4 GiB (under 3 GiB limit)
Result crash (early eof) coherent output at 17.21 tok/s

Test plan

  • cargo test -p cake-core --lib — 641 tests pass (638 existing + 3 new)
  • cargo test -p cake-core --test unit — 235 tests pass
  • cargo clippy — zero new warnings
  • Integration: M5 master + iPad Air M3, 2 layers of 7B-4bit, verified 250.1 MiB push, 1.4 GiB RSS, correct inference
  • Extended inference: longer generation to verify sustained correctness across distributed layers

New unit tests

  • extract_layer_tensors_single_file_filters_correctly — 4 tensors in, request 2, verify only 2 in output with correct data bytes
  • extract_layer_tensors_single_file_all_layers — request all tensors, verify all present with correct total size
  • extract_layer_tensors_single_file_missing_tensor_errors — request nonexistent tensor, verify error

When a worker is assigned a subset of layers from a single-file
safetensors model, extract only the needed tensors instead of pushing
the entire file. For Qwen2.5-7B-4bit (4 GiB), a 2-layer iPad worker
now receives 250 MiB instead of 4 GiB — staying well under the 3 GiB
iOS jetsam limit.

The indexed model path already filtered correctly via weight_map.
This extends the same extraction to the single-file fallback by:
- Reading the safetensors header to enumerate tensor names
- Filtering by assigned layer prefixes
- Calling extract_layer_tensors to build a minimal blob
- Falling back to full push when layers is empty (backward compat)

Verified: M5 master + iPad Air M3 worker, 2 layers, 250.1 MiB push,
1.4 GiB RSS, coherent output at 17.21 tok/s.
@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented Apr 30, 2026

This fix is still relevant from my side. I attempted a conflict-only rebase against current main but found that recent upstream changes (PR #84's iOS TCP retry refactor and adjacent commits) introduce API drift beyond a simple merge — Strategy::assign_layers trait signature changed (7→8 params), Message::DeviceInfoRequest variant was removed, and the BUILD_HASH constant location shifted, producing 16 compile errors when ee01115 is rebased onto current main. Rather than ship a broken-build force-push, I'm leaving this PR in CONFLICTING state. Happy to either redo this as a fresh PR against current main (cherry-picking only the minimal safetensors-filter logic) or close this in favor of that — let me know which you'd prefer.

@cjchanh cjchanh closed this by deleting the head repository May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant