feat(campaign): GEPA-distillation harness — distill a cheap analyst toward gold verdicts by drewstone · Pull Request #141 · tangle-network/agent-eval

drewstone · 2026-05-30T21:19:33Z

Teacher→student distillation, composed from existing primitives

The expensive workflow (the 70-agent skill-audit) is the teacher: its frozen verdicts are gold. A cheap single-shot LLM analyst is the student. gepaDriver optimizes the student's prompt toward agreement with the gold labels. This is the distillation loop — collapse an expensive multi-agent verdict into one cheap prompt that reproduces it.

It composes the shipped improvement-loop primitives and reimplements none of them:

Loop: runImprovementLoop (outer: optimize → holdout re-score → gate → optional PR)
Driver: gepaDriver (reflective prompt optimizer, Pareto-combine of complementary lessons)
Measurement: runCampaign inside the loop, scoring the student over the train split
Gate: caller-supplied — heldOutGate by default, or defaultProductionGate for the full red-team / reward-hacking / canary stack

autoOnPromote: 'none' is forced inside runDistillation — the loop never opens a PR; the caller (the distill CLI) decides what to do with the winning prompt.

New module: `src/campaign/distillation/` (generic + parameterized)

Distills any analyst against any gold JSONL of {input, label} — no skill data or domain logic is hardcoded in the module.

gold-scenarios.ts — GoldScenario (opaque input+label), loadGoldScenarios/parseGoldJsonl, deterministic splitGold (honors an explicit split field, else a modulo stride). Sanitizes : in ids (the real gold uses skill:foo ids) into __ so the campaign cellId (${id}:${rep}) that heldOutGate parses via cellId.split(':')[0] keeps recovering the right scenario — a colon in the id would otherwise collapse every scenario into one bucket and zero the holdout delta. The original id is preserved on a gold-id: tag.
agreement-judge.ts — buildAgreementJudge returns a real substrate JudgeConfig whose score({artifact, scenario, signal}) scores the student's produced label against the scenario's gold label via an injected comparator (domain-agnostic). Default fieldAgreement: exact-match on categorical fields + Jaccard set-overlap on array fields, averaged. Pure + unit-tested.
run-distillation.ts — runDistillation wires baselineSurface = student prompt, dispatchWithSurface = render the surface + scenario input → call the LLM via createChatClient → parse the produced JSON label → return as the artifact (reports cost + tokens via ctx.cost, so the cell is never a stub), driver = gepaDriver, judge = the agreement judge, scenarios = train split, holdoutScenarios = test split, gate = heldOutGate/defaultProductionGate. Returns the winning prompt + before/after agreement on the holdout.
cli.ts — distill --gold <path> --baseline <prompt-file> --model <m> [--optimizer-model m] [--generations N] [--population K] [--reps R] [--categorical ...] [--array ...]. The live, token-spending entry — invoked by hand, never in CI. Auth via LLM_API_KEY/TANGLE_API_KEY.

Public surface exported from src/index.ts: loadGoldScenarios, parseGoldJsonl, splitGold, buildAgreementJudge, fieldAgreement, runDistillation + their types.

Tests — exact agreement scores + shapes (no `toBeTruthy`)

src/campaign/distillation/distillation.test.ts:

fieldAgreement: exact match = 1.0; total mismatch = 0; partial array overlap = exact Jaccard (0.5).
Leak-detection regression — a produced verdict matching gold on value_verdict but missing a public_leak_risk=true scores 2/3, strictly less than the full match 1.0. Without this the GEPA loop would have no gradient to teach leak detection.
loadGoldScenarios/splitGold on inline fixtures: correct parse, deterministic train/test split, honors the explicit split field, and sanitizes colon ids (the real skill:foo regression).
buildAgreementJudge: .score(...) returns the exact comparator agreement for a produced-vs-gold pair, applies only to gold-kind scenarios, throws on an out-of-range comparator score.
DISTILL_LIVE-gated wiring test: runs runDistillation (population 2, 1 generation, 1 rep) on a 3-scenario inline gold set using a mock chat transport for the student and a stubbed fetch for the GEPA reflection — exercises the full loop end-to-end without real tokens, and pins the holdout agreement (2/3) through the campaign + asserts a winner surface + gate decision come back.

Intended live input

skills-internal/audits/gold/skill-verdicts.gold.jsonl (53 records) — referenced only; not copied into this repo. Verified locally that the harness loads all 53, splits them 41 train / 12 holdout (honoring the gold's own split tags), scores a self-match 1.0, and drops to 6/7 when the leak flag is flipped.

Run a real distillation

TANGLE_API_KEY=$(cat /tmp/.tk) pnpm tsx src/campaign/distillation/cli.ts \
  --gold ~/code/skills-internal/audits/gold/skill-verdicts.gold.jsonl \
  --baseline ./baseline-skill-analyst.txt \
  --model gpt-4o-mini --optimizer-model gpt-4o \
  --generations 3 --population 4 \
  --categorical value_verdict,quality_score,generalization_rating,public_leak_risk,write_target_rating,subagent_recommended \
  --array top_actions

Verification

pnpm typecheck — clean
pnpm vitest run src/campaign/distillation/distillation.test.ts — 15 passed / 1 skipped (DISTILL_LIVE off); 16 passed (DISTILL_LIVE on)
biome check on the module + src/index.ts — clean

…oward gold verdicts Compose the existing improvement-loop primitives into a teacher→student distillation loop: an expensive workflow's frozen gold verdicts are the teacher, a cheap single-shot LLM analyst is the student, and gepaDriver optimizes the student's prompt toward agreement with the gold labels. New module src/campaign/distillation/ (generic + parameterized — distills any analyst against any gold JSONL of {input, label}): - gold-scenarios.ts: GoldScenario (opaque input+label), loadGoldScenarios / parseGoldJsonl, deterministic splitGold (honors explicit split tags, else modulo). Sanitizes ':' in ids (real skill:foo gold) so the campaign cellId split that heldOutGate relies on stays correct. - agreement-judge.ts: buildAgreementJudge — a real JudgeConfig scoring the student's produced label against the scenario's gold label via an injected comparator. Default fieldAgreement: exact-match on categorical fields + Jaccard on array fields, averaged. Pure. - run-distillation.ts: runDistillation wires baselineSurface=student prompt, dispatchWithSurface=render+createChatClient call+parse JSON label, driver=gepaDriver, judge=agreement judge, train split as scenarios, test split as holdoutScenarios, gate=heldOutGate (default) or defaultProductionGate, autoOnPromote='none' (never opens a PR from inside the loop). Composes runImprovementLoop — reimplements none of it. - cli.ts: distill --gold --baseline --model [--generations --population ...]. The live token-spending entry; invoked by hand, not in CI. Tests assert exact agreement scores + shapes. The leak-detection regression proves a verdict matching value_verdict but missing public_leak_risk=true scores strictly lower than a full match (2/3 vs 1.0). A DISTILL_LIVE-gated wiring test runs the full loop on a mock chat transport + stubbed reflection fetch — no real tokens — and pins the holdout agreement math through the campaign. Intended live input: skills-internal/audits/gold/skill-verdicts.gold.jsonl (53 records). Verified the harness loads + splits it 41/12 and the agreement judge scores a self-match 1.0 and a flipped leak flag 6/7.

tangletools

Approved (verified locally). Composes runImprovementLoop+gepaDriver+heldOutGate (zero reimplementation, autoOnPromote:none forced). Generic/parameterized over any gold JSONL. typecheck clean; 15 pass/1 skip (16 with DISTILL_LIVE, full loop on mock transport — no real tokens). Leak-detection regression asserts real agreement gradient (fullMatch=1, missesLeak=2/3, leak dim=0). Caught + regression-tested a real scenarioId-colon bug that would have zeroed the holdout delta.

tangletools approved these changes May 30, 2026

View reviewed changes

drewstone merged commit 1dab44c into main May 30, 2026
1 check passed

drewstone deleted the feat/skill-distillation-loop branch May 30, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(campaign): GEPA-distillation harness — distill a cheap analyst toward gold verdicts#141

feat(campaign): GEPA-distillation harness — distill a cheap analyst toward gold verdicts#141
drewstone merged 1 commit into
mainfrom
feat/skill-distillation-loop

drewstone commented May 30, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented May 30, 2026

Teacher→student distillation, composed from existing primitives

New module: src/campaign/distillation/ (generic + parameterized)

Tests — exact agreement scores + shapes (no toBeTruthy)

Intended live input

Run a real distillation

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New module: `src/campaign/distillation/` (generic + parameterized)

Tests — exact agreement scores + shapes (no `toBeTruthy`)