[GOLD] VLM support for GOLDTrainer#5969
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5fd183a. Configure here.
5fd183a to
a3a6a3f
Compare
|
New training runs from examples/scripts/gold_vlm.py:
2. Qwen 8B -> Qwen 2B, ULD loss, vLLM:
3. Qwen 8B -> LFM 1.6B, ULD loss, no-vllm:
Results are consistent with #5461 (comment) and kashif#6 (comment) |
|
nice! the scripts for experimental belong in the experimental trainer's folder for now |
|
To avoid opening a separate discussion or issue, I think I can address this idea here (but if you'd prefer otherwise, I'll create one):
Since GKD functionality is fundamentally a case of GOLD, when I think we could later combine everything under a single GOLDTrainer (basically dropping a separate GKDTrainer) to keep the whole distillation logic in one place and avoid having to change the JSD path in two separate places. This way, I wouldn't need to add separate VLM support to GKDTrainer and then update liger_kernel -- I'd only need to implement the second part. |




Clear PR, based on #5461 with changes introduced in 2ac060e
Adds VLM support to GOLDTrainer:
examples/scripts/gold_vlm.pywith documented same-family JSD and cross-family ULD examples.Motivation
The GOLD algorithm has no theoretical constraints against VLM-to-VLM distillation -- the barriers were purely engineering (incompatible image token formats, different tokenizers, raw image handling through the dataloader).
Key changes
_teacher_processoris stored and used incompute_lossto build teacher-compatible vision tensors from raw imagesteacher_tokenizer_name_or_pathexamples/scripts/gold_vlm.pywith two documented usage examples (same-family JSD + vLLM, cross-family ULD)Note
docs/source/gold_trainer.md-- will add if that's desirable, just let me know.Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@qgallouedec @kashif
Note
Medium Risk
Touches core distillation training, multimodal batching, and teacher/student forward paths; mitigated by extensive tests but behavior is complex and memory-sensitive.
Overview
GOLDTrainer now supports vision-language model (VLM) distillation, not just text LLMs.
For vision datasets it keeps raw PIL images in the dataloader via an identity collator, then collates per gradient-accumulation slice with a new
DataCollatorForVisionLanguageChatML(prompt/completion split,pixel_values, untemplated text for ULD, byte offsets). Same-family VLM pairs can use JSD with shared multimodal forwards; cross-architecture pairs requireuse_uld_lossand a separate_teacher_processorthat builds teacher inputs (including images) via_build_teacher_vlm_inputs. On-policy training adds VLM paths for vLLM (multimodal prompts) and localgenerate, with lazy slice materialization and eval fixes inprediction_step. Init validates VLM↔VLM pairing, rejects vision data on text-only students, and blocks Liger on VLMs.Adds
examples/scripts/gold_vlm.py(GEOQA, JSD vs ULD examples) and a largetest_gold_trainer.pyVLM regression suite (collation, ULD alignment, vLLM duplication, smoke train steps).Reviewed by Cursor Bugbot for commit 251cddc. Bugbot is set up for automated code reviews on this repo. Configure here.