feat(campaign): GEPA-distillation harness — distill a cheap analyst toward gold verdicts#141
Merged
Merged
Conversation
…oward gold verdicts
Compose the existing improvement-loop primitives into a teacher→student
distillation loop: an expensive workflow's frozen gold verdicts are the
teacher, a cheap single-shot LLM analyst is the student, and gepaDriver
optimizes the student's prompt toward agreement with the gold labels.
New module src/campaign/distillation/ (generic + parameterized — distills any
analyst against any gold JSONL of {input, label}):
- gold-scenarios.ts: GoldScenario (opaque input+label), loadGoldScenarios /
parseGoldJsonl, deterministic splitGold (honors explicit split tags, else
modulo). Sanitizes ':' in ids (real skill:foo gold) so the campaign cellId
split that heldOutGate relies on stays correct.
- agreement-judge.ts: buildAgreementJudge — a real JudgeConfig scoring the
student's produced label against the scenario's gold label via an injected
comparator. Default fieldAgreement: exact-match on categorical fields +
Jaccard on array fields, averaged. Pure.
- run-distillation.ts: runDistillation wires baselineSurface=student prompt,
dispatchWithSurface=render+createChatClient call+parse JSON label,
driver=gepaDriver, judge=agreement judge, train split as scenarios, test
split as holdoutScenarios, gate=heldOutGate (default) or defaultProductionGate,
autoOnPromote='none' (never opens a PR from inside the loop). Composes
runImprovementLoop — reimplements none of it.
- cli.ts: distill --gold --baseline --model [--generations --population ...].
The live token-spending entry; invoked by hand, not in CI.
Tests assert exact agreement scores + shapes. The leak-detection regression
proves a verdict matching value_verdict but missing public_leak_risk=true
scores strictly lower than a full match (2/3 vs 1.0). A DISTILL_LIVE-gated
wiring test runs the full loop on a mock chat transport + stubbed reflection
fetch — no real tokens — and pins the holdout agreement math through the
campaign.
Intended live input: skills-internal/audits/gold/skill-verdicts.gold.jsonl
(53 records). Verified the harness loads + splits it 41/12 and the agreement
judge scores a self-match 1.0 and a flipped leak flag 6/7.
tangletools
approved these changes
May 30, 2026
Contributor
tangletools
left a comment
There was a problem hiding this comment.
Approved (verified locally). Composes runImprovementLoop+gepaDriver+heldOutGate (zero reimplementation, autoOnPromote:none forced). Generic/parameterized over any gold JSONL. typecheck clean; 15 pass/1 skip (16 with DISTILL_LIVE, full loop on mock transport — no real tokens). Leak-detection regression asserts real agreement gradient (fullMatch=1, missesLeak=2/3, leak dim=0). Caught + regression-tested a real scenarioId-colon bug that would have zeroed the holdout delta.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Teacher→student distillation, composed from existing primitives
The expensive workflow (the 70-agent skill-audit) is the teacher: its frozen verdicts are gold. A cheap single-shot LLM analyst is the student.
gepaDriveroptimizes the student's prompt toward agreement with the gold labels. This is the distillation loop — collapse an expensive multi-agent verdict into one cheap prompt that reproduces it.It composes the shipped improvement-loop primitives and reimplements none of them:
runImprovementLoop(outer: optimize → holdout re-score → gate → optional PR)gepaDriver(reflective prompt optimizer, Pareto-combine of complementary lessons)runCampaigninside the loop, scoring the student over the train splitheldOutGateby default, ordefaultProductionGatefor the full red-team / reward-hacking / canary stackautoOnPromote: 'none'is forced insiderunDistillation— the loop never opens a PR; the caller (thedistillCLI) decides what to do with the winning prompt.New module:
src/campaign/distillation/(generic + parameterized)Distills any analyst against any gold JSONL of
{input, label}— no skill data or domain logic is hardcoded in the module.gold-scenarios.ts—GoldScenario(opaqueinput+label),loadGoldScenarios/parseGoldJsonl, deterministicsplitGold(honors an explicitsplitfield, else a modulo stride). Sanitizes:in ids (the real gold usesskill:fooids) into__so the campaign cellId (${id}:${rep}) thatheldOutGateparses viacellId.split(':')[0]keeps recovering the right scenario — a colon in the id would otherwise collapse every scenario into one bucket and zero the holdout delta. The original id is preserved on agold-id:tag.agreement-judge.ts—buildAgreementJudgereturns a real substrateJudgeConfigwhosescore({artifact, scenario, signal})scores the student's produced label against the scenario's gold label via an injected comparator (domain-agnostic). DefaultfieldAgreement: exact-match on categorical fields + Jaccard set-overlap on array fields, averaged. Pure + unit-tested.run-distillation.ts—runDistillationwiresbaselineSurface= student prompt,dispatchWithSurface= render the surface + scenario input → call the LLM viacreateChatClient→ parse the produced JSON label → return as the artifact (reports cost + tokens viactx.cost, so the cell is never a stub),driver=gepaDriver,judge= the agreement judge,scenarios= train split,holdoutScenarios= test split,gate=heldOutGate/defaultProductionGate. Returns the winning prompt + before/after agreement on the holdout.cli.ts—distill --gold <path> --baseline <prompt-file> --model <m> [--optimizer-model m] [--generations N] [--population K] [--reps R] [--categorical ...] [--array ...]. The live, token-spending entry — invoked by hand, never in CI. Auth viaLLM_API_KEY/TANGLE_API_KEY.Public surface exported from
src/index.ts:loadGoldScenarios,parseGoldJsonl,splitGold,buildAgreementJudge,fieldAgreement,runDistillation+ their types.Tests — exact agreement scores + shapes (no
toBeTruthy)src/campaign/distillation/distillation.test.ts:fieldAgreement: exact match =1.0; total mismatch =0; partial array overlap = exact Jaccard (0.5).value_verdictbut missing apublic_leak_risk=truescores2/3, strictly less than the full match1.0. Without this the GEPA loop would have no gradient to teach leak detection.loadGoldScenarios/splitGoldon inline fixtures: correct parse, deterministic train/test split, honors the explicitsplitfield, and sanitizes colon ids (the realskill:fooregression).buildAgreementJudge:.score(...)returns the exact comparator agreement for a produced-vs-gold pair, applies only togold-kind scenarios, throws on an out-of-range comparator score.DISTILL_LIVE-gated wiring test: runsrunDistillation(population 2, 1 generation, 1 rep) on a 3-scenario inline gold set using a mock chat transport for the student and a stubbedfetchfor the GEPA reflection — exercises the full loop end-to-end without real tokens, and pins the holdout agreement (2/3) through the campaign + asserts a winner surface + gate decision come back.Intended live input
skills-internal/audits/gold/skill-verdicts.gold.jsonl(53 records) — referenced only; not copied into this repo. Verified locally that the harness loads all 53, splits them 41 train / 12 holdout (honoring the gold's own split tags), scores a self-match1.0, and drops to6/7when the leak flag is flipped.Run a real distillation
Verification
pnpm typecheck— cleanpnpm vitest run src/campaign/distillation/distillation.test.ts— 15 passed / 1 skipped (DISTILL_LIVE off); 16 passed (DISTILL_LIVE on)biome checkon the module +src/index.ts— clean