`@tangle-network/agent-eval`

Ship better agent prompts with statistical confidence. One function call returns a decision packet: lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list. Same shape whether you've got a closed improvement loop or just production logs.

TypeScript first-class, Python (agent-eval-rpc) speaks the same wire protocol, hosted-tier-friendly, MIT, self-hostable, no SaaS dependency.

What you get back: the decision packet

Whether you call selfImprove() (closed loop) or analyzeRuns() (observed runs), the report has the same shape. Here's a real one, abridged:

{
  "n": 80,                                            // runs analyzed
  "composite": {                                       // distributional summary
    "mean": 0.62, "p50": 0.65, "p95": 0.88, "stddev": 0.17,
    "histogram": [/* 12 bins */]
  },
  "lift": {                                            // paired bootstrap
    "baselineMean": 0.58, "candidateMean": 0.65,
    "delta": 0.07,
    "ci95": [0.04, 0.10],                              // 95% CI on the delta
    "pValue": 0.0008,                                  // paired-t
    "cohensD": 0.41,
    "n": 40,
    "mde": 0.06,                                       // min detectable effect at 80% power
    "requiredN": 38                                    // n needed to detect observed delta
  },
  "judges": {                                          // per-judge calibration
    "domain-expert": { "n": 80, "meanScore": 0.64 },
    "helpfulness-llm": { "n": 80, "meanScore": 0.61 }
  },
  "interRater": {                                      // multi-rater agreement
    "raters": 3, "jointlyRated": 80, "kappa": 0.71,
    "disagreementCases": [/* top 20 ranked by spread */]
  },
  "costQuality": {                                     // cost-vs-quality
    "cost": { "mean": 0.024, "p95": 0.041, /* ... */ },
    "pareto": { /* ParetoFigureSpec the dashboard renders */ }
  },
  "failureClusters": {                                 // when an AnalystRegistry is wired
    "totalFailures": 11,
    "clusters": [
      { "name": "off-topic-drift",  "share": 0.45, "exemplars": ["run-12", "run-19"] },
      { "name": "over-confidence",  "share": 0.27, "exemplars": ["run-3"] },
      { "name": "format-mismatch",  "share": 0.18, "exemplars": ["run-41"] }
    ]
  },
  "contamination": { "leaks": 0, "holdoutAuditPassed": true },
  "outcomeCorrelation": {                              // when downstream metric supplied
    "metric": "engagement_rate", "n": 80,
    "pearson": 0.72, "spearman": 0.69,
    "rewardModel": { "intercept": 0.04, "slope": 1.93, "r2": 0.52 }
  },
  "release": {
    "status": "pass",
    "axes": [
      { "name": "quality-lift",          "status": "pass" },
      { "name": "contamination",         "status": "pass" },
      { "name": "composite-distribution","status": "pass" }
    ]
  },
  "recommendations": [
    { "priority": "critical", "kind": "ship",
      "title": "Ship — lift 0.070 (95% CI 0.040..0.100)",
      "detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41)." },
    { "priority": "high", "kind": "investigate",
      "title": "Top failure cluster: off-topic-drift (45% of failures)",
      "detail": "11 runs failed. Drill into exemplars run-12 / run-19 to identify the pattern." }
  ]
}

The recommendations array is the human-readable layer; everything above it is the evidence. Read the recs, act on them, the numbers are the proof.

Quick start

Closed loop — `selfImprove()`

You have scenarios, a dispatch, judges, and want the loop to propose better prompts + tell you which to ship.

import { selfImprove } from '@tangle-network/agent-eval/contract'

const result = await selfImprove({
  scenarios,                                // your scenario corpus
  dispatch: async ({ scenario }) =>          // your agent — anything that returns an artifact
    await myAgent.run(scenario),
  judges: [myJudge],                         // any JudgeConfig — LLM, rule, ensemble
  baselineSurface: { systemPrompt: currentPrompt },
})

result.gateDecision         // 'ship' | 'hold' | 'need_more_work' | ...
result.lift                 // raw delta on holdout
result.insight              // the full decision packet above

Observed runs — `analyzeRuns()`

You don't have a closed loop yet — you have observed runs (production traces, an approve/reject corpus, a CSV gold set). Same report shape, no agent invocation.

import { analyzeRuns } from '@tangle-network/agent-eval/contract'

const report = await analyzeRuns({
  runs,                                     // RunRecord[]
  outcomeSignal: {                          // optional — closes the loop on real outcomes
    metric: 'engagement_rate',
    valueByRunId: enrichedFromProd,
  },
  canaryScenarios,                          // optional — contamination probe
  analyst: myAnalystRegistry,               // optional — AI-powered failure clustering
})

report.recommendations    // ranked actions
report.failureClusters    // grouped failure modes
report.outcomeCorrelation // judge↔outcome correlation + linear reward model

Existing data — intake adapters

You have data already. Don't reshape it — pipe it through an adapter.

import {
  fromFeedbackTable,
  fromOtelSpans,
  analyzeRuns,
} from '@tangle-network/agent-eval/contract'

// Multi-rater approve/reject (Obsidian tags, Sheets, CSV, Postgres).
const { runs, raterScores } = fromFeedbackTable({
  ratings: parseYourFeedbackTable(),         // Array<{ runId, rater, rating }>
})
await analyzeRuns({ runs, raterScores })

// Production OTel traces — group by tangle.runId or traceId.
const runs2 = fromOtelSpans({ spans: yourOtelStream })
await analyzeRuns({ runs: runs2 })

Both intake adapters preserve every signal in the source — multi-rater scores stay rater-keyed so the report can compute inter-rater agreement and surface the disagreement triage list.

How it compares

	LangSmith	Braintrust	Phoenix	agent-eval
Closed-loop self-improvement	✱ human-in-loop	✱ experiment-driven	—	✓ autonomous + gated
Statistical lift CI (paired bootstrap)	—	partial	—	✓
Judge calibration + bias detection	—	—	—	✓
Inter-rater agreement + disagreement triage	—	—	—	✓
Contamination / canary check	—	—	—	✓
AI-driven failure clustering	partial	—	partial	✓
Cost-quality Pareto	—	—	—	✓
Multi-language clients (TS + Python)	TS only	TS only	TS + Py	✓ TS + Py
Self-hostable / no-SaaS option	—	—	OSS	✓ MIT, OSS
Substrate vs SaaS shape	SaaS	SaaS	OSS server	library
Hosted tier (optional)	required	required	optional	optional

Position: agent-eval is the substrate (one library, decision-grade output) the others are SaaS around the substrate. If you want a closed loop that ships your prompt under statistical confidence, you call agent-eval. If you want a dashboard rendered from your data, you pipe agent-eval into the hosted tier or your own renderer.

Customer journeys

Three runnable examples — each is self-contained, each shows the actual output.

Journey	Example	Who it's for
Closed loop — improve a prompt under statistical confidence	`examples/selfimprove-quickstart/`	Teams with scenarios + judges + agent in hand
Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights	`examples/customer-feedback-loop/`	Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop
Production OTel traces — analyze logs you already have, no closed loop required	`examples/customer-otel-traces/`	Teams running agents in prod with observability, no eval discipline yet

Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.

Subpath entry points

Subpath	What it gives you
`@tangle-network/agent-eval/contract`	The headline surface. `selfImprove`, `analyzeRuns`, `runImprovementLoop`, `runCampaign`, `runEval`, `diffRuns`, intake adapters (`fromFeedbackTable`, `fromOtelSpans`), drivers (`gepaDriver`, `evolutionaryDriver`), gates (`defaultProductionGate`, `heldOutGate`, `composeGate`), storage. New code starts here.
`@tangle-network/agent-eval/hosted`	Hosted-tier wire-format types + `createHostedClient` to ship eval-run events + trace spans to any orchestrator speaking the spec
`@tangle-network/agent-eval/adapters/otel`	`createOtelBridge` — forwards OpenTelemetry-shape spans into the hosted-tier ingest
`@tangle-network/agent-eval/adapters/langchain`	LangChain runnable → `Dispatch` adapter
`@tangle-network/agent-eval/adapters/http`	`httpDispatch` + `runDispatchServer` for distributed campaigns across machines
`@tangle-network/agent-eval/campaign`	Lower-level campaign primitives (storage, drivers, types)
`@tangle-network/agent-eval/multishot`	N-shot persona × shot matrix runner
`@tangle-network/agent-eval/control`	Agent control loop primitives (`runAgentControlLoop`, action policy, propose/review)
`@tangle-network/agent-eval/traces`	Trace stores, emitters, OTLP-JSONL replay
`@tangle-network/agent-eval/reporting`	Release confidence, paired stats, sequential e-values, launch reports
`@tangle-network/agent-eval/rl`	RL bridge — verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, auto-research
`@tangle-network/agent-eval/matrix`	N-axis cartesian over substrate types
`@tangle-network/agent-eval/wire`	HTTP/RPC server + Zod schemas (same protocol the Python client speaks)
`@tangle-network/agent-eval/benchmarks`	Benchmark adapter contracts and reference wrappers

The root export remains available for backward compatibility; new code should prefer focused subpaths. Anything under /rl, /pipelines, /meta-eval, /prm, or /builder-eval is only reachable via its subpath.

Concepts + design

docs/concepts.md — five types, three top-level functions, the layering rule, the wire protocol contract
docs/insight-report.md — annotated walkthrough of every section of the decision packet
docs/customer-journeys.md — three end-to-end journeys with code + expected output
docs/adapters-observability.md — composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAI
docs/wire-protocol.md — the HTTP/RPC contract Python (and any future language) speaks
docs/hosted-ingest-spec.md — the hosted-tier wire format, frozen at 2026-05-26.v1
docs/design/ — RFCs + architectural notes

The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.

Hosted tier

Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:

await selfImprove({
  scenarios, dispatch, judges, baselineSurface,
  hostedTenant: {
    endpoint: 'https://intelligence.tangle.tools',
    apiKey: process.env.TANGLE_API_KEY!,
    tenantId: 'your-tenant',
  },
})

The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.

Install + run

pnpm add @tangle-network/agent-eval
# or, from Python:
pip install agent-eval-rpc

Run an example:

pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.ts

Run the test suite:

pnpm install
pnpm build
pnpm test

Stability + versioning

Public exports carry JSDoc stability markers visible in IDE hover + .d.ts:

Tag	Meaning
`@stable`	API frozen at this major. Breaking changes require a major bump.
`@experimental`	Interface may evolve before becoming `@stable`. Pin the patch version if you depend on it.
`@internal`	Not part of the public contract. Use the documented subpath instead.

CHANGELOG.md tracks every release with what's new / additive / breaking.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.claude/skills/agent-eval		.claude/skills/agent-eval
.evolve		.evolve
.github/workflows		.github/workflows
.husky		.husky
clients/python		clients/python
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`@tangle-network/agent-eval`

Table of contents

What you get back: the decision packet

Quick start

Closed loop — `selfImprove()`

Observed runs — `analyzeRuns()`

Existing data — intake adapters

How it compares

Customer journeys

Subpath entry points

Concepts + design

Hosted tier

Install + run

Stability + versioning

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@tangle-network/agent-eval

Table of contents

What you get back: the decision packet

Quick start

Closed loop — selfImprove()

Observed runs — analyzeRuns()

Existing data — intake adapters

How it compares

Customer journeys

Subpath entry points

Concepts + design

Hosted tier

Install + run

Stability + versioning

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`@tangle-network/agent-eval`

Closed loop — `selfImprove()`

Observed runs — `analyzeRuns()`

Packages