Situational Rewards and New Domain: Mosaic Leaks Privacy Agent by Alex-Gurung · Pull Request #146 · ServiceNow/PipelineRL

Alex-Gurung · 2026-06-08T14:27:01Z

This PR adds two things:

MosaicLeaks privacy agent, a new PipelineRL domain (privacy_hopqa).
Situational rewards, necessary for the domain, groups each training text into a 'situation' and computes group-relative advantages within those situations, instead of one advantage per rollout.

The domain (`privacy_hopqa`)

The agent from MosaicLeaks: a multi-hop QA task where each hop's answer is necessary to understand the next question, and answers are found across a mixed corpus of private/local enterprise documents (from DRBench) and public web (BrowseComp) documents.

For each hop the agent runs a four-stage loop: plan search queries, choose which retrieved documents to open, read documents to try and answer the question, determine if we have answered the question or if we need to search/read more. After that final step decides we have answered the question we move to the next hop. Retrieval runs in a helper service that hosts the private and web document searches.

Because each of those stage decisions is an LLM call, each rollout produces many training texts. We found outcome-based training (e.g. accuracy) to be unstable, leading us to propose situational rewards.

Situational rewards

By default we give one reward and one advantage per rollout. Situational rewards let a domain label each training text with a `situation' string, then compute the advantage baseline within each situation group. This way each LLM call (training_text) is scored against the same kind of decision in sibling rollouts, not against the whole rollout or by raw step position.

Example:
Each situation captures the step-type and the input to that step, not just its position. In privacy_hopqa' the situation privacy_hopqa:hop:2:stage:doc_choose:gold_visible means that in the second hop, the document choose LLM call received the gold document in the candidate list. This training text is only compared against other rollouts in the same situation. Steps where the gold doc wasn't in the list at all (gold_absent`) are a separate situation (and in practice are dropped from training, since there is no useful training signal).

Situational RL implementation details:

Trainer (finetune/rl/__init__.py): a new step_reward_advantages flag swaps the default group-advantages-by-rollout baseline for _compute_step_reward_advantages, which buckets training texts by step_advantage_group and computes advantages within each bucket (still limited to training texts from the rollout group). pad_step_rewards_for_advantage optionally pads shorter rollouts up to the group's max length with zero-gradient rows (held at the rollout's representative reward) that only shape the baseline, so every aligned position is compared across the full group instead of just the rollouts that ran that long.
Loss bookkeeping (finetune/types.py, finetune/utils.py, rl_step): because a rollout now contributes many training texts, we add a (off-by-default) loss_weight_scale to the batch (set when batch_size_normalization=rollouts) so the loss is rescaled by rollout count and each rollout still counts once. We found this helps stabilize training where different rollouts have significantly different lengths.

With step_reward_advantages off, none of this runs and training behaves exactly as before.

Files affected outside the domain

The only non-domain changes are the situational-rewards machinery above: finetune/rl/__init__.py, finetune/types.py, and finetune/utils.py. All changes are opt-in so existing domains and default behaviour is unaffected.

Handle concrete failure modes from long all-call Privacy HopQA runs: split oversized preprocessed batches before SharedMemoryQueue writes, skip empty actor group publishes after zero-advantage filtering, and fail fast on rollout LLM infrastructure errors instead of converting vLLM shard crashes into bad rollouts.

# Conflicts: # pipelinerl/finetune/lora.py

This reverts commit 3a5858d.

# Conflicts: # pipelinerl/finetune_loop.py

…lint cleanup

Alex-Gurung and others added 30 commits April 14, 2026 01:02

initial drbench agent workflow

6a4ec5b

new privacy agent domain

63f2e66

Merge remote-tracking branch 'origin/main' into privacy_agent

dc2c2fc

# Conflicts: # pipelinerl/finetune/lora.py

Use Privacy HopQA final split and exact answer variants

c7510aa

Merge remote-tracking branch 'origin/main' into privacy_agent

833ddab

Clarify privacy hopqa local search planning prompt

3a5858d

Revert "Clarify privacy hopqa local search planning prompt"

38b6759

This reverts commit 3a5858d.

Clarify privacy hopqa retrieval planning contract

1f4e288

Move privacy hopqa planner reasoning cue

9e9d0c5

Update Privacy HopQA training support

3c6aeda

Tighten Privacy HopQA answer format prompts

26799b9

Update Privacy HopQA context and retrieval handling

ff39eaa

Include hop justifications in Privacy HopQA context

d46e0ce

improve retrieval planning and answer matching

fd7ec9a

Improve Privacy HopQA retrieval and reward metrics

5a4ceae

Update privacy hopqa q30 training setup

3f0daf8

Increase Privacy HopQA stage token limits

486aab0

Clarify Privacy HopQA planning and evidence prompts

aabcccb

Point Privacy HopQA defaults at shared q30 artifacts

4a8bd63

Add Privacy HopQA prefix step rewards

94fcf6e

Add privacy-aware HopQA planner prompt flag

8e12390

Set Privacy HopQA doc read concurrency to five

a802e38

Refine Privacy HopQA step reward training

b35bd7e

Expose Privacy HopQA error reward config

7d8a121

Implement Privacy HopQA hop-step rewards

da2b938

Fix Privacy HopQA doc-read window budget

817bd36

Refine Privacy HopQA hop-step reward handling

e176f10

Log zero-advantage filter fractions

9b0e4ba

working on detailed step rewards

1ff30c0

Alex-Gurung added 16 commits May 19, 2026 15:16

Add call-normalized step advantages

0351e94

Stabilize privacy HopQA training rewards

b654f88

Add hop-step resolver ablation controls

eb68ca9

Add Privacy HopQA outcome reward training support

c7cb991

Merge remote-tracking branch 'origin/main' into privacy_agent

68c6388

# Conflicts: # pipelinerl/finetune_loop.py

privacy hop qa domain refactor

3dc2a0a

code cleanup

1c9d6b1

simplify hop typing and dataset loader signature

f629c8c

simplify embedding cache and advantage-column guard

8a38acd

drop no-op advantage column select

74aff21

prune dead code in privacy_hopqa domain

6ee9005

clarify prefix-reward consistency and rename surface-form key

2568ff6

drop redundant raw_hop_accuracy metric and simplify metric coercion

9859753

unify hop_number dict keys on int in rollouts

5ab3b40

move parameters_override + padding_reward into domain; metrics fix + …

86fbbc7

…lint cleanup

clarify pad_step_rewards_for_advantage docstring

9f1ed7d

Alex-Gurung force-pushed the privacy_agent branch 2 times, most recently from 8d0564e to 9f1ed7d Compare June 9, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Situational Rewards and New Domain: Mosaic Leaks Privacy Agent#146

Situational Rewards and New Domain: Mosaic Leaks Privacy Agent#146
Alex-Gurung wants to merge 46 commits into
mainfrom
privacy_agent

Alex-Gurung commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex-Gurung commented Jun 8, 2026

The domain (privacy_hopqa)

Situational rewards

Files affected outside the domain

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The domain (`privacy_hopqa`)