Skip to content

tangle-network/agent-eval

Repository files navigation

@tangle-network/agent-eval

Ship better agent prompts with statistical confidence. One function call returns a decision packet: lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list. Same shape whether you've got a closed improvement loop or just production logs.

npm pypi tests license: MIT

TypeScript first-class, Python (agent-eval-rpc) speaks the same wire protocol, hosted-tier-friendly, MIT, self-hostable, no SaaS dependency.


Table of contents


What you get back: the decision packet

Whether you call selfImprove() (closed loop) or analyzeRuns() (observed runs), the report has the same shape. Here's a real one, abridged:

{
  "n": 80,                                            // runs analyzed
  "composite": {                                       // distributional summary
    "mean": 0.62, "p50": 0.65, "p95": 0.88, "stddev": 0.17,
    "histogram": [/* 12 bins */]
  },
  "lift": {                                            // paired bootstrap
    "baselineMean": 0.58, "candidateMean": 0.65,
    "delta": 0.07,
    "ci95": [0.04, 0.10],                              // 95% CI on the delta
    "pValue": 0.0008,                                  // paired-t
    "cohensD": 0.41,
    "n": 40,
    "mde": 0.06,                                       // min detectable effect at 80% power
    "requiredN": 38                                    // n needed to detect observed delta
  },
  "judges": {                                          // per-judge calibration
    "domain-expert": { "n": 80, "meanScore": 0.64 },
    "helpfulness-llm": { "n": 80, "meanScore": 0.61 }
  },
  "interRater": {                                      // multi-rater agreement
    "raters": 3, "jointlyRated": 80, "kappa": 0.71,
    "disagreementCases": [/* top 20 ranked by spread */]
  },
  "costQuality": {                                     // cost-vs-quality
    "cost": { "mean": 0.024, "p95": 0.041, /* ... */ },
    "pareto": { /* ParetoFigureSpec the dashboard renders */ }
  },
  "failureClusters": {                                 // when an AnalystRegistry is wired
    "totalFailures": 11,
    "clusters": [
      { "name": "off-topic-drift",  "share": 0.45, "exemplars": ["run-12", "run-19"] },
      { "name": "over-confidence",  "share": 0.27, "exemplars": ["run-3"] },
      { "name": "format-mismatch",  "share": 0.18, "exemplars": ["run-41"] }
    ]
  },
  "contamination": { "leaks": 0, "holdoutAuditPassed": true },
  "outcomeCorrelation": {                              // when downstream metric supplied
    "metric": "engagement_rate", "n": 80,
    "pearson": 0.72, "spearman": 0.69,
    "rewardModel": { "intercept": 0.04, "slope": 1.93, "r2": 0.52 }
  },
  "release": {
    "status": "pass",
    "axes": [
      { "name": "quality-lift",          "status": "pass" },
      { "name": "contamination",         "status": "pass" },
      { "name": "composite-distribution","status": "pass" }
    ]
  },
  "recommendations": [
    { "priority": "critical", "kind": "ship",
      "title": "Ship — lift 0.070 (95% CI 0.040..0.100)",
      "detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41)." },
    { "priority": "high", "kind": "investigate",
      "title": "Top failure cluster: off-topic-drift (45% of failures)",
      "detail": "11 runs failed. Drill into exemplars run-12 / run-19 to identify the pattern." }
  ]
}

The recommendations array is the human-readable layer; everything above it is the evidence. Read the recs, act on them, the numbers are the proof.


Quick start

Closed loop — selfImprove()

You have scenarios, a dispatch, judges, and want the loop to propose better prompts + tell you which to ship.

import { selfImprove } from '@tangle-network/agent-eval/contract'

const result = await selfImprove({
  scenarios,                                // your scenario corpus
  dispatch: async ({ scenario }) =>          // your agent — anything that returns an artifact
    await myAgent.run(scenario),
  judges: [myJudge],                         // any JudgeConfig — LLM, rule, ensemble
  baselineSurface: { systemPrompt: currentPrompt },
})

result.gateDecision         // 'ship' | 'hold' | 'need_more_work' | ...
result.lift                 // raw delta on holdout
result.insight              // the full decision packet above

Observed runs — analyzeRuns()

You don't have a closed loop yet — you have observed runs (production traces, an approve/reject corpus, a CSV gold set). Same report shape, no agent invocation.

import { analyzeRuns } from '@tangle-network/agent-eval/contract'

const report = await analyzeRuns({
  runs,                                     // RunRecord[]
  outcomeSignal: {                          // optional — closes the loop on real outcomes
    metric: 'engagement_rate',
    valueByRunId: enrichedFromProd,
  },
  canaryScenarios,                          // optional — contamination probe
  analyst: myAnalystRegistry,               // optional — AI-powered failure clustering
})

report.recommendations    // ranked actions
report.failureClusters    // grouped failure modes
report.outcomeCorrelation // judge↔outcome correlation + linear reward model

Existing data — intake adapters

You have data already. Don't reshape it — pipe it through an adapter.

import {
  fromFeedbackTable,
  fromOtelSpans,
  analyzeRuns,
} from '@tangle-network/agent-eval/contract'

// Multi-rater approve/reject (Obsidian tags, Sheets, CSV, Postgres).
const { runs, raterScores } = fromFeedbackTable({
  ratings: parseYourFeedbackTable(),         // Array<{ runId, rater, rating }>
})
await analyzeRuns({ runs, raterScores })

// Production OTel traces — group by tangle.runId or traceId.
const runs2 = fromOtelSpans({ spans: yourOtelStream })
await analyzeRuns({ runs: runs2 })

Both intake adapters preserve every signal in the source — multi-rater scores stay rater-keyed so the report can compute inter-rater agreement and surface the disagreement triage list.


How it compares

LangSmith Braintrust Phoenix agent-eval
Closed-loop self-improvement ✱ human-in-loop ✱ experiment-driven ✓ autonomous + gated
Statistical lift CI (paired bootstrap) partial
Judge calibration + bias detection
Inter-rater agreement + disagreement triage
Contamination / canary check
AI-driven failure clustering partial partial
Cost-quality Pareto
Multi-language clients (TS + Python) TS only TS only TS + Py ✓ TS + Py
Self-hostable / no-SaaS option OSS ✓ MIT, OSS
Substrate vs SaaS shape SaaS SaaS OSS server library
Hosted tier (optional) required required optional optional

Position: agent-eval is the substrate (one library, decision-grade output) the others are SaaS around the substrate. If you want a closed loop that ships your prompt under statistical confidence, you call agent-eval. If you want a dashboard rendered from your data, you pipe agent-eval into the hosted tier or your own renderer.


Customer journeys

Three runnable examples — each is self-contained, each shows the actual output.

Journey Example Who it's for
Closed loop — improve a prompt under statistical confidence examples/selfimprove-quickstart/ Teams with scenarios + judges + agent in hand
Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights examples/customer-feedback-loop/ Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop
Production OTel traces — analyze logs you already have, no closed loop required examples/customer-otel-traces/ Teams running agents in prod with observability, no eval discipline yet

Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.


Subpath entry points

Subpath What it gives you
@tangle-network/agent-eval/contract The headline surface. selfImprove, analyzeRuns, runImprovementLoop, runCampaign, runEval, diffRuns, intake adapters (fromFeedbackTable, fromOtelSpans), drivers (gepaDriver, evolutionaryDriver), gates (defaultProductionGate, heldOutGate, composeGate), storage. New code starts here.
@tangle-network/agent-eval/hosted Hosted-tier wire-format types + createHostedClient to ship eval-run events + trace spans to any orchestrator speaking the spec
@tangle-network/agent-eval/adapters/otel createOtelBridge — forwards OpenTelemetry-shape spans into the hosted-tier ingest
@tangle-network/agent-eval/adapters/langchain LangChain runnable → Dispatch adapter
@tangle-network/agent-eval/adapters/http httpDispatch + runDispatchServer for distributed campaigns across machines
@tangle-network/agent-eval/campaign Lower-level campaign primitives (storage, drivers, types)
@tangle-network/agent-eval/multishot N-shot persona × shot matrix runner
@tangle-network/agent-eval/control Agent control loop primitives (runAgentControlLoop, action policy, propose/review)
@tangle-network/agent-eval/traces Trace stores, emitters, OTLP-JSONL replay
@tangle-network/agent-eval/reporting Release confidence, paired stats, sequential e-values, launch reports
@tangle-network/agent-eval/rl RL bridge — verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, auto-research
@tangle-network/agent-eval/matrix N-axis cartesian over substrate types
@tangle-network/agent-eval/wire HTTP/RPC server + Zod schemas (same protocol the Python client speaks)
@tangle-network/agent-eval/benchmarks Benchmark adapter contracts and reference wrappers

The root export remains available for backward compatibility; new code should prefer focused subpaths. Anything under /rl, /pipelines, /meta-eval, /prm, or /builder-eval is only reachable via its subpath.


Concepts + design

The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.


Hosted tier

Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:

await selfImprove({
  scenarios, dispatch, judges, baselineSurface,
  hostedTenant: {
    endpoint: 'https://intelligence.tangle.tools',
    apiKey: process.env.TANGLE_API_KEY!,
    tenantId: 'your-tenant',
  },
})

The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.


Install + run

pnpm add @tangle-network/agent-eval
# or, from Python:
pip install agent-eval-rpc

Run an example:

pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.ts

Run the test suite:

pnpm install
pnpm build
pnpm test

Stability + versioning

Public exports carry JSDoc stability markers visible in IDE hover + .d.ts:

Tag Meaning
@stable API frozen at this major. Breaking changes require a major bump.
@experimental Interface may evolve before becoming @stable. Pin the patch version if you depend on it.
@internal Not part of the public contract. Use the documented subpath instead.

CHANGELOG.md tracks every release with what's new / additive / breaking.


License

MIT. See LICENSE.

About

Domain-agnostic evaluation framework for Tangle agent apps

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors