-
Notifications
You must be signed in to change notification settings - Fork 46
Feat: add checkpoint loading mechanism #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
c6cf473
feat: checkpoint save & load
ArcaLunar 869ac57
format: format files in examples and infini_train
ArcaLunar 51cdbc1
feat: extract resuming to utils
ArcaLunar fceca79
feat: extract similar logic in ckpt_save
ArcaLunar 02f8367
feat(checkpoint): reorganize checkpoint code and improve robustness
JYMiracle305 775916c
feat(checkpoint): improve state dict serialization and resume safety
JYMiracle305 8c9a8df
feat: add checkpoint mechanism with robust resume validation
JYMiracle305 7c42a33
feat: add model config validation on resume
JYMiracle305 a392533
feat: ckpt and bin checkpoint formats are kept as an interim solution…
JYMiracle305 2f4c25e
refactor(checkpoint): centralize config, simplify prune, fix optimize…
JYMiracle305 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
kilinchange marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| #pragma once | ||
|
|
||
| #include <cstdint> | ||
| #include <filesystem> | ||
| #include <functional> | ||
| #include <memory> | ||
| #include <string> | ||
| #include <unordered_map> | ||
|
|
||
| namespace infini_train { | ||
| class Optimizer; | ||
| class Tensor; | ||
| namespace nn { | ||
| class Module; | ||
| } | ||
|
|
||
| struct TrainerState { | ||
| int64_t global_step = 0; | ||
| int64_t consumed_batches = 0; | ||
| // FIXME(jym): learning_rate should be restored from scheduler state, move `last_lr` from TrainerState to | ||
| // SchedulerState later | ||
| double last_lr = 0.0; | ||
|
kilinchange marked this conversation as resolved.
|
||
| int64_t n_layer = 0; | ||
| int64_t n_head = 0; | ||
| int64_t n_kv_head = 0; | ||
| int64_t n_embd = 0; | ||
| int64_t vocab_size = 0; | ||
| int ddp_size = 1; | ||
| int tp_size = 1; | ||
| int sp_size = 1; | ||
| int pp_size = 1; | ||
| }; | ||
|
|
||
| class Checkpoint { | ||
| public: | ||
| static void Save(const std::filesystem::path &checkpoint_dir, const nn::Module &model, const Optimizer *optimizer, | ||
| const TrainerState &state, bool save_optimizer_state); | ||
|
|
||
| static void Load(const std::filesystem::path &checkpoint_dir, nn::Module &model, Optimizer *optimizer, | ||
| TrainerState &state, bool load_optimizer_state); | ||
|
|
||
| private: | ||
| static void SaveStateDict(const std::filesystem::path &path, | ||
| const std::unordered_map<std::string, std::shared_ptr<Tensor>> &state_dict); | ||
|
|
||
| static std::unordered_map<std::string, std::shared_ptr<Tensor>> LoadStateDict(const std::filesystem::path &path); | ||
|
|
||
| static void SaveTrainerState(const std::filesystem::path &path, const TrainerState &state); | ||
| static TrainerState LoadTrainerState(const std::filesystem::path &path); | ||
| }; | ||
|
|
||
| } // namespace infini_train | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.