Situational Rewards and New Domain: Mosaic Leaks Privacy Agent#146
Open
Alex-Gurung wants to merge 46 commits into
Open
Situational Rewards and New Domain: Mosaic Leaks Privacy Agent#146Alex-Gurung wants to merge 46 commits into
Alex-Gurung wants to merge 46 commits into
Conversation
Handle concrete failure modes from long all-call Privacy HopQA runs: split oversized preprocessed batches before SharedMemoryQueue writes, skip empty actor group publishes after zero-advantage filtering, and fail fast on rollout LLM infrastructure errors instead of converting vLLM shard crashes into bad rollouts.
# Conflicts: # pipelinerl/finetune/lora.py
This reverts commit 3a5858d.
# Conflicts: # pipelinerl/finetune_loop.py
8d0564e to
9f1ed7d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds two things:
privacy_hopqa).The domain (
privacy_hopqa)The agent from MosaicLeaks: a multi-hop QA task where each hop's answer is necessary to understand the next question, and answers are found across a mixed corpus of private/local enterprise documents (from DRBench) and public web (BrowseComp) documents.
For each hop the agent runs a four-stage loop: plan search queries, choose which retrieved documents to open, read documents to try and answer the question, determine if we have answered the question or if we need to search/read more. After that final step decides we have answered the question we move to the next hop. Retrieval runs in a helper service that hosts the private and web document searches.
Because each of those stage decisions is an LLM call, each rollout produces many training texts. We found outcome-based training (e.g. accuracy) to be unstable, leading us to propose situational rewards.
Situational rewards
By default we give one reward and one advantage per rollout. Situational rewards let a domain label each training text with a `situation' string, then compute the advantage baseline within each situation group. This way each LLM call (training_text) is scored against the same kind of decision in sibling rollouts, not against the whole rollout or by raw step position.
Example:
Each situation captures the step-type and the input to that step, not just its position. In
privacy_hopqa' the situationprivacy_hopqa:hop:2:stage:doc_choose:gold_visiblemeans that in the second hop, the document choose LLM call received the gold document in the candidate list. This training text is only compared against other rollouts in the same situation. Steps where the gold doc wasn't in the list at all (gold_absent`) are a separate situation (and in practice are dropped from training, since there is no useful training signal).Situational RL implementation details:
finetune/rl/__init__.py): a newstep_reward_advantagesflag swaps the default group-advantages-by-rollout baseline for_compute_step_reward_advantages, which buckets training texts bystep_advantage_groupand computes advantages within each bucket (still limited to training texts from the rollout group).pad_step_rewards_for_advantageoptionally pads shorter rollouts up to the group's max length with zero-gradient rows (held at the rollout's representative reward) that only shape the baseline, so every aligned position is compared across the full group instead of just the rollouts that ran that long.finetune/types.py,finetune/utils.py,rl_step): because a rollout now contributes many training texts, we add a (off-by-default)loss_weight_scaleto the batch (set whenbatch_size_normalization=rollouts) so the loss is rescaled by rollout count and each rollout still counts once. We found this helps stabilize training where different rollouts have significantly different lengths.With
step_reward_advantagesoff, none of this runs and training behaves exactly as before.Files affected outside the domain
The only non-domain changes are the situational-rewards machinery above:
finetune/rl/__init__.py,finetune/types.py, andfinetune/utils.py. All changes are opt-in so existing domains and default behaviour is unaffected.