docs: finding 0001 — OCH pack cuts agent token usage 2–4× (live Move 2 data)#273
Merged
Conversation
…2 data) First live measurement from the Move 2 variance probe on Bedrock. Two tasks × N=5 × Claude Sonnet 4.5 against an isolated @opencodehub/policy snapshot: the pack cut total token usage 2.18×–4.08× and cost 1.9×–3.3×, driven almost entirely by cache tokens (exploration the agent skipped because the pack handed it the structure). Output tokens barely moved, so the saving is exploration-avoided, not shorter answers. Reframes the headline from variance to token efficiency: output_hash dispersion came back null because it compares answer TEXT and a frontier model rephrases prose every run (saturated) — the variance question waits on the judge oracle. Token efficiency is a directly-measured resource number with no saturation, and it replicated across both task regimes. Scoped honestly as preliminary (N=5, 1 repo, 1 agent, 2 tasks — a signal, not a benchmark) with a reproduce recipe and next steps.
Merged
theagenticguy
pushed a commit
that referenced
this pull request
Jun 30, 2026
🤖 Automated release via release-please --- <details><summary>root: 0.10.6</summary> ## [0.10.6](root-v0.10.5...root-v0.10.6) (2026-06-30) ### Bug Fixes * **eval:** count cache tokens + per-harness model in variance probe ([#271](#271)) ([df12cf9](df12cf9)) ### Documentation * finding 0001 — OCH pack cuts agent token usage 2–4× (live Move 2 data) ([#273](#273)) ([c712ebc](c712ebc)) </details> <details><summary>cli: 0.10.6</summary> ## [0.10.6](cli-v0.10.5...cli-v0.10.6) (2026-06-30) ### Bug Fixes * **eval:** count cache tokens + per-harness model in variance probe ([#271](#271)) ([df12cf9](df12cf9)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The first live measurement from the Move 2 variance probe, written up as a findings doc. On real tasks, an OCH pack cut a coding agent's total token usage 2.18×–4.08× and cost 1.9×–3.3× — by replacing the agent's repo exploration with the pack's prebuilt structure map.
The saving is almost entirely cache tokens — the file-reads and tool-runs the agent does to reconstruct structure. Output tokens barely moved, so it's exploration-avoided, not shorter answers. This is the opposite of the arXiv "+10% tokens" framing: OCH's pack replaces exploration rather than adding to it.
Why this reframes the headline
Move 2 was specced to measure answer variance. On these tasks
output_hashdispersion came back null — because it compares answer text, and a frontier model rephrases prose every run (saturated). Measuring decision convergence on prose needs thejudgeoracle, which the CLI doesn't yet wire. Token efficiency, by contrast, is a directly-measured resource number with no saturation, and it replicated across both task regimes. It's the more defensible claim.Honest scope
Marked Preliminary: N=5, one small repo, one agent (Claude Sonnet 4.5), two tasks. A signal, not a benchmark. The doc says so plainly, with a reproduce recipe and next steps (wire the judge oracle; scale to more tasks/repos/Codex/larger N).
Docs-only — no code change.
🤖 Generated with Claude Code