Ship better agent prompts with statistical confidence. One function call returns a decision packet: lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list. Same shape whether you've got a closed improvement loop or just production logs.
TypeScript first-class, Python (
agent-eval-rpc) speaks the same wire protocol, hosted-tier-friendly, MIT, self-hostable, no SaaS dependency.
- What you get back
- Quick start
- How it compares
- Customer journeys
- Subpath entry points
- Concepts + design
- Hosted tier
- Install + run
- Stability + versioning
- License
Whether you call selfImprove() (closed loop) or analyzeRuns() (observed runs), the report has the same shape. Here's a real one, abridged:
The recommendations array is the human-readable layer; everything above it is the evidence. Read the recs, act on them, the numbers are the proof.
You have scenarios, a dispatch, judges, and want the loop to propose better prompts + tell you which to ship.
import { selfImprove } from '@tangle-network/agent-eval/contract'
const result = await selfImprove({
scenarios, // your scenario corpus
dispatch: async ({ scenario }) => // your agent — anything that returns an artifact
await myAgent.run(scenario),
judges: [myJudge], // any JudgeConfig — LLM, rule, ensemble
baselineSurface: { systemPrompt: currentPrompt },
})
result.gateDecision // 'ship' | 'hold' | 'need_more_work' | ...
result.lift // raw delta on holdout
result.insight // the full decision packet aboveYou don't have a closed loop yet — you have observed runs (production traces, an approve/reject corpus, a CSV gold set). Same report shape, no agent invocation.
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
const report = await analyzeRuns({
runs, // RunRecord[]
outcomeSignal: { // optional — closes the loop on real outcomes
metric: 'engagement_rate',
valueByRunId: enrichedFromProd,
},
canaryScenarios, // optional — contamination probe
analyst: myAnalystRegistry, // optional — AI-powered failure clustering
})
report.recommendations // ranked actions
report.failureClusters // grouped failure modes
report.outcomeCorrelation // judge↔outcome correlation + linear reward modelYou have data already. Don't reshape it — pipe it through an adapter.
import {
fromFeedbackTable,
fromOtelSpans,
analyzeRuns,
} from '@tangle-network/agent-eval/contract'
// Multi-rater approve/reject (Obsidian tags, Sheets, CSV, Postgres).
const { runs, raterScores } = fromFeedbackTable({
ratings: parseYourFeedbackTable(), // Array<{ runId, rater, rating }>
})
await analyzeRuns({ runs, raterScores })
// Production OTel traces — group by tangle.runId or traceId.
const runs2 = fromOtelSpans({ spans: yourOtelStream })
await analyzeRuns({ runs: runs2 })Both intake adapters preserve every signal in the source — multi-rater scores stay rater-keyed so the report can compute inter-rater agreement and surface the disagreement triage list.
| LangSmith | Braintrust | Phoenix | agent-eval | |
|---|---|---|---|---|
| Closed-loop self-improvement | ✱ human-in-loop | ✱ experiment-driven | — | ✓ autonomous + gated |
| Statistical lift CI (paired bootstrap) | — | partial | — | ✓ |
| Judge calibration + bias detection | — | — | — | ✓ |
| Inter-rater agreement + disagreement triage | — | — | — | ✓ |
| Contamination / canary check | — | — | — | ✓ |
| AI-driven failure clustering | partial | — | partial | ✓ |
| Cost-quality Pareto | — | — | — | ✓ |
| Multi-language clients (TS + Python) | TS only | TS only | TS + Py | ✓ TS + Py |
| Self-hostable / no-SaaS option | — | — | OSS | ✓ MIT, OSS |
| Substrate vs SaaS shape | SaaS | SaaS | OSS server | library |
| Hosted tier (optional) | required | required | optional | optional |
Position: agent-eval is the substrate (one library, decision-grade output) the others are SaaS around the substrate. If you want a closed loop that ships your prompt under statistical confidence, you call agent-eval. If you want a dashboard rendered from your data, you pipe agent-eval into the hosted tier or your own renderer.
Three runnable examples — each is self-contained, each shows the actual output.
| Journey | Example | Who it's for |
|---|---|---|
| Closed loop — improve a prompt under statistical confidence | examples/selfimprove-quickstart/ |
Teams with scenarios + judges + agent in hand |
| Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights | examples/customer-feedback-loop/ |
Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop |
| Production OTel traces — analyze logs you already have, no closed loop required | examples/customer-otel-traces/ |
Teams running agents in prod with observability, no eval discipline yet |
Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.
| Subpath | What it gives you |
|---|---|
@tangle-network/agent-eval/contract |
The headline surface. selfImprove, analyzeRuns, runImprovementLoop, runCampaign, runEval, diffRuns, intake adapters (fromFeedbackTable, fromOtelSpans), drivers (gepaDriver, evolutionaryDriver), gates (defaultProductionGate, heldOutGate, composeGate), storage. New code starts here. |
@tangle-network/agent-eval/hosted |
Hosted-tier wire-format types + createHostedClient to ship eval-run events + trace spans to any orchestrator speaking the spec |
@tangle-network/agent-eval/adapters/otel |
createOtelBridge — forwards OpenTelemetry-shape spans into the hosted-tier ingest |
@tangle-network/agent-eval/adapters/langchain |
LangChain runnable → Dispatch adapter |
@tangle-network/agent-eval/adapters/http |
httpDispatch + runDispatchServer for distributed campaigns across machines |
@tangle-network/agent-eval/campaign |
Lower-level campaign primitives (storage, drivers, types) |
@tangle-network/agent-eval/multishot |
N-shot persona × shot matrix runner |
@tangle-network/agent-eval/control |
Agent control loop primitives (runAgentControlLoop, action policy, propose/review) |
@tangle-network/agent-eval/traces |
Trace stores, emitters, OTLP-JSONL replay |
@tangle-network/agent-eval/reporting |
Release confidence, paired stats, sequential e-values, launch reports |
@tangle-network/agent-eval/rl |
RL bridge — verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, auto-research |
@tangle-network/agent-eval/matrix |
N-axis cartesian over substrate types |
@tangle-network/agent-eval/wire |
HTTP/RPC server + Zod schemas (same protocol the Python client speaks) |
@tangle-network/agent-eval/benchmarks |
Benchmark adapter contracts and reference wrappers |
The root export remains available for backward compatibility; new code should prefer focused subpaths. Anything under /rl, /pipelines, /meta-eval, /prm, or /builder-eval is only reachable via its subpath.
docs/concepts.md— five types, three top-level functions, the layering rule, the wire protocol contractdocs/insight-report.md— annotated walkthrough of every section of the decision packetdocs/customer-journeys.md— three end-to-end journeys with code + expected outputdocs/adapters-observability.md— composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAIdocs/wire-protocol.md— the HTTP/RPC contract Python (and any future language) speaksdocs/hosted-ingest-spec.md— the hosted-tier wire format, frozen at2026-05-26.v1docs/design/— RFCs + architectural notes
The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.
Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:
await selfImprove({
scenarios, dispatch, judges, baselineSurface,
hostedTenant: {
endpoint: 'https://intelligence.tangle.tools',
apiKey: process.env.TANGLE_API_KEY!,
tenantId: 'your-tenant',
},
})The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.
pnpm add @tangle-network/agent-eval
# or, from Python:
pip install agent-eval-rpcRun an example:
pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.tsRun the test suite:
pnpm install
pnpm build
pnpm testPublic exports carry JSDoc stability markers visible in IDE hover + .d.ts:
| Tag | Meaning |
|---|---|
@stable |
API frozen at this major. Breaking changes require a major bump. |
@experimental |
Interface may evolve before becoming @stable. Pin the patch version if you depend on it. |
@internal |
Not part of the public contract. Use the documented subpath instead. |
CHANGELOG.md tracks every release with what's new / additive / breaking.
MIT. See LICENSE.
{ "n": 80, // runs analyzed "composite": { // distributional summary "mean": 0.62, "p50": 0.65, "p95": 0.88, "stddev": 0.17, "histogram": [/* 12 bins */] }, "lift": { // paired bootstrap "baselineMean": 0.58, "candidateMean": 0.65, "delta": 0.07, "ci95": [0.04, 0.10], // 95% CI on the delta "pValue": 0.0008, // paired-t "cohensD": 0.41, "n": 40, "mde": 0.06, // min detectable effect at 80% power "requiredN": 38 // n needed to detect observed delta }, "judges": { // per-judge calibration "domain-expert": { "n": 80, "meanScore": 0.64 }, "helpfulness-llm": { "n": 80, "meanScore": 0.61 } }, "interRater": { // multi-rater agreement "raters": 3, "jointlyRated": 80, "kappa": 0.71, "disagreementCases": [/* top 20 ranked by spread */] }, "costQuality": { // cost-vs-quality "cost": { "mean": 0.024, "p95": 0.041, /* ... */ }, "pareto": { /* ParetoFigureSpec the dashboard renders */ } }, "failureClusters": { // when an AnalystRegistry is wired "totalFailures": 11, "clusters": [ { "name": "off-topic-drift", "share": 0.45, "exemplars": ["run-12", "run-19"] }, { "name": "over-confidence", "share": 0.27, "exemplars": ["run-3"] }, { "name": "format-mismatch", "share": 0.18, "exemplars": ["run-41"] } ] }, "contamination": { "leaks": 0, "holdoutAuditPassed": true }, "outcomeCorrelation": { // when downstream metric supplied "metric": "engagement_rate", "n": 80, "pearson": 0.72, "spearman": 0.69, "rewardModel": { "intercept": 0.04, "slope": 1.93, "r2": 0.52 } }, "release": { "status": "pass", "axes": [ { "name": "quality-lift", "status": "pass" }, { "name": "contamination", "status": "pass" }, { "name": "composite-distribution","status": "pass" } ] }, "recommendations": [ { "priority": "critical", "kind": "ship", "title": "Ship — lift 0.070 (95% CI 0.040..0.100)", "detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41)." }, { "priority": "high", "kind": "investigate", "title": "Top failure cluster: off-topic-drift (45% of failures)", "detail": "11 runs failed. Drill into exemplars run-12 / run-19 to identify the pattern." } ] }