Multi turn rollout#193
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive framework for multi-turn agentic rollouts with trajectory compression, including a new MultiTurnCondenseRollout class, chunking utilities, and a keyword-based condenser. It also adds support for per-token oracle bonuses, entropy bonuses, and improved logging diagnostics for GRPO training. Additionally, it patches the Qwen3 chat template to resolve parsing issues with orphan </think> tags and includes a dataset builder for condensed SFT training.
…ti-turn-rollout # Conflicts: # src/twinkle/template/__init__.py # src/twinkle/template/base.py # src/twinkle/utils/transformers_utils.py
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request aligns the message and tool-call data formats with the OpenAI chat-completion schema and introduces the twinkle_agentic package to support multi-turn agentic rollouts with tool-use, chunking, and passage condensation. Infrastructure enhancements include a new exception notification system, support for entropy bonuses in GRPO loss, and robustness patches for Qwen3 chat templates. The review identifies a critical bug where zero temperature results in zeroed log-probability metrics and notes potential side effects from module-level multiprocessing configuration. Further feedback suggests refining token limit heuristics and improving template monotonicity assertions with better logging.
| scale = self.temperature | ||
| logps_f = logps.float() | ||
| if scale != 1.0: | ||
| logps_f = logps_f * scale |
There was a problem hiding this comment.
If self.temperature is set to 0.0 (which is common for greedy sampling), scale becomes 0.0. Multiplying logps_f by 0.0 will zero out all log-probabilities, causing metrics like policy_confidence to always be 1.0 and approx_kl to be 0.0, regardless of the actual model output. You should add a guard to ensure scale is only applied if it is greater than zero.
| scale = self.temperature | |
| logps_f = logps.float() | |
| if scale != 1.0: | |
| logps_f = logps_f * scale | |
| scale = self.temperature | |
| logps_f = logps.float() | |
| if scale > 0.0 and scale != 1.0: | |
| logps_f = logps_f * scale |
| from twinkle.data_format.sampling import SamplingParams | ||
|
|
||
| # CJK worst case ~2 tokens/char; budget is a soft char ceiling, not output truth. | ||
| max_new = max(256, budget * 2 + 128) |
There was a problem hiding this comment.
The heuristic for max_tokens calculation (budget * 2 + 128) might be too tight for certain languages or complex summary formats, especially when using Chain-of-Thought or detailed markdown. Consider making this multiplier or the buffer constant configurable, or use a more conservative estimate to avoid premature truncation of the summary.
| if not s_after.startswith(s_before): | ||
| raise RuntimeError('Canonical chat_template output for messages_after is not a ' | ||
| 'prefix-extension of messages_before; cannot compute bridge ' | ||
| 'delta. This indicates the template is non-monotonic in the ' | ||
| 'message list (e.g. reorders / rewrites earlier turns).\n' | ||
| f's_before tail: {s_before[-80:]!r}\n' | ||
| f's_after at same offset: ' | ||
| f'{s_after[max(0, len(s_before) - 80):len(s_before) + 80]!r}') |
There was a problem hiding this comment.
The bridge text computation relies on s_after.startswith(s_before). This assumes that the chat template is monotonic (i.e., adding a message only appends text and doesn't rewrite previous turns). While most modern templates (ChatML, Llama-3) are monotonic, some older or custom templates might inject system instructions or format headers differently based on the total number of messages. It would be safer to log the actual strings if this assertion fails to help users debug template issues.
…ti-turn-rollout # Conflicts: # src/twinkle/template/base.py
PR type
PR information
Features
Bug Fix
Experiment results
Paste your experiment result here(if needed).