Skip to content

Situational Rewards and New Domain: Mosaic Leaks Privacy Agent#146

Open
Alex-Gurung wants to merge 46 commits into
mainfrom
privacy_agent
Open

Situational Rewards and New Domain: Mosaic Leaks Privacy Agent#146
Alex-Gurung wants to merge 46 commits into
mainfrom
privacy_agent

Conversation

@Alex-Gurung

Copy link
Copy Markdown
Collaborator

This PR adds two things:

  1. MosaicLeaks privacy agent, a new PipelineRL domain (privacy_hopqa).
  2. Situational rewards, necessary for the domain, groups each training text into a 'situation' and computes group-relative advantages within those situations, instead of one advantage per rollout.

The domain (privacy_hopqa)

The agent from MosaicLeaks: a multi-hop QA task where each hop's answer is necessary to understand the next question, and answers are found across a mixed corpus of private/local enterprise documents (from DRBench) and public web (BrowseComp) documents.

For each hop the agent runs a four-stage loop: plan search queries, choose which retrieved documents to open, read documents to try and answer the question, determine if we have answered the question or if we need to search/read more. After that final step decides we have answered the question we move to the next hop. Retrieval runs in a helper service that hosts the private and web document searches.

Because each of those stage decisions is an LLM call, each rollout produces many training texts. We found outcome-based training (e.g. accuracy) to be unstable, leading us to propose situational rewards.

Situational rewards

By default we give one reward and one advantage per rollout. Situational rewards let a domain label each training text with a `situation' string, then compute the advantage baseline within each situation group. This way each LLM call (training_text) is scored against the same kind of decision in sibling rollouts, not against the whole rollout or by raw step position.

Example:
Each situation captures the step-type and the input to that step, not just its position. In privacy_hopqa' the situation privacy_hopqa:hop:2:stage:doc_choose:gold_visible means that in the second hop, the document choose LLM call received the gold document in the candidate list. This training text is only compared against other rollouts in the same situation. Steps where the gold doc wasn't in the list at all (gold_absent`) are a separate situation (and in practice are dropped from training, since there is no useful training signal).

Situational RL implementation details:

  • Trainer (finetune/rl/__init__.py): a new step_reward_advantages flag swaps the default group-advantages-by-rollout baseline for _compute_step_reward_advantages, which buckets training texts by step_advantage_group and computes advantages within each bucket (still limited to training texts from the rollout group). pad_step_rewards_for_advantage optionally pads shorter rollouts up to the group's max length with zero-gradient rows (held at the rollout's representative reward) that only shape the baseline, so every aligned position is compared across the full group instead of just the rollouts that ran that long.
  • Loss bookkeeping (finetune/types.py, finetune/utils.py, rl_step): because a rollout now contributes many training texts, we add a (off-by-default) loss_weight_scale to the batch (set when batch_size_normalization=rollouts) so the loss is rescaled by rollout count and each rollout still counts once. We found this helps stabilize training where different rollouts have significantly different lengths.

With step_reward_advantages off, none of this runs and training behaves exactly as before.

Files affected outside the domain

The only non-domain changes are the situational-rewards machinery above: finetune/rl/__init__.py, finetune/types.py, and finetune/utils.py. All changes are opt-in so existing domains and default behaviour is unaffected.

Alex-Gurung and others added 30 commits April 14, 2026 01:02
Handle concrete failure modes from long all-call Privacy HopQA runs: split oversized preprocessed batches before SharedMemoryQueue writes, skip empty actor group publishes after zero-advantage filtering, and fail fast on rollout LLM infrastructure errors instead of converting vLLM shard crashes into bad rollouts.
# Conflicts:
#	pipelinerl/finetune/lora.py
@Alex-Gurung Alex-Gurung force-pushed the privacy_agent branch 2 times, most recently from 8d0564e to 9f1ed7d Compare June 9, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants