Skip to content

Client ckpt save#196

Open
vx120 wants to merge 4 commits into
modelscope:mainfrom
vx120:client_ckpt_save
Open

Client ckpt save#196
vx120 wants to merge 4 commits into
modelscope:mainfrom
vx120:client_ckpt_save

Conversation

@vx120
Copy link
Copy Markdown
Collaborator

@vx120 vx120 commented May 19, 2026

What Changed

  • Added support for client-specified checkpoint save directories
  • Propagated save_dir through training and adapter creation flows
  • Resolve relative save_dir values on the client side before sending requests, using the client process current working directory
  • Validate save_dir on the server side and return a clear error when the directory does not exist or is not a directory
  • Updated checkpoint metadata and path resolution so runs can still be discovered reliably
  • Refreshed the cookbook example and related server config for this workflow

Impact

  • Clients can now provide a custom save_dir when creating adapters or training runs
  • Existing metadata lookup remains compatible through the pointer file mechanism
  • Checkpoint save/load resolution now follows the actual persisted model location
  • Relative paths are interpreted from the client process, while the resolved absolute path must also exist on the server
  • Invalid server-side save directories now fail early with a clear HTTP 400 error

Example

save_dir can be an absolute path:

save_dir = "/model"

Or a relative path:

save_dir = "./outputs/checkpoints"

Relative paths are resolved from the client process current working directory before being sent to the server.

The resolved directory must already exist on the server. If it does not exist, the request fails with a clear error, for example:

save_dir does not exist on the server: /path/to/outputs/checkpoints

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to specify a custom save_dir when adding adapters to models, allowing users to control where model weights and metadata are stored. The changes span the client generator, server-side handlers, and the checkpoint management system, which now includes a pointer mechanism to track metadata across different filesystem locations. My feedback focuses on addressing a potential directory traversal vulnerability by validating client-provided adapter names and removing redundant path resolution logic in the checkpoint manager to improve code clarity.

Comment on lines +488 to +491
try:
resolved_save_dir = _resolve_client_save_dir(body.save_dir).as_posix() if body.save_dir else None
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The adapter_name provided by the client should be validated to prevent directory traversal attacks. Since this name is used to construct filesystem paths on the server (e.g., in training_run_manager.save), it's critical to ensure it doesn't contain malicious sequences like .. or absolute paths. Using the existing validate_user_path helper is recommended.

Suggested change
try:
resolved_save_dir = _resolve_client_save_dir(body.save_dir).as_posix() if body.save_dir else None
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))
if not validate_user_path(token, body.adapter_name):
raise HTTPException(status_code=400, detail=f'Invalid adapter_name: {body.adapter_name}')
try:
resolved_save_dir = _resolve_client_save_dir(body.save_dir).as_posix() if body.save_dir else None
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vx120 please fix this

import twinkle_client.types as types
from twinkle.data_format import InputFeature, Trajectory
from twinkle.server.common.checkpoint_factory import create_checkpoint_manager, create_training_run_manager
from twinkle.server.utils.checkpoint_base import _resolve_client_save_dir
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import validate_user_path to enable validation of client-provided adapter names, ensuring they are safe for filesystem operations.

Suggested change
from twinkle.server.utils.checkpoint_base import _resolve_client_save_dir
from twinkle.server.utils.checkpoint_base import _resolve_client_save_dir, validate_user_path

Comment thread src/twinkle/server/utils/checkpoint_base.py
Comment on lines +488 to +491
try:
resolved_save_dir = _resolve_client_save_dir(body.save_dir).as_posix() if body.save_dir else None
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vx120 please fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants