fix(eval): count cache tokens + per-harness model in variance probe#271
Merged
Conversation
Two bugs surfaced by the first live Bedrock run of the Move 2 probe. Bug 1 — token overhead undercounted. parseClaudeOutput read only usage.input_tokens, ignoring the ~27K cached system tokens Claude Code injects per call (cache_creation_input_tokens + cache_read_input_tokens); Codex's cached_input_tokens was likewise dropped. The token-overhead headline — the exact "~10% more tokens" cost the variance claim trades against — was therefore wrong. RunTokens/ArmTokens gain a `cacheTokens` field, both parsers capture it, and totalTokens() counts it toward the overhead ratio. formatReport now prints the in/out/cache split per arm so the overhead is auditable. Bug 2 — global --model can't serve two agents. Claude and Codex take different Bedrock ids (a us.-prefixed Anthropic profile vs an openai.* model), so one --model handed Claude's id to Codex. Replaced with --model-claude / --model-codex, threaded as a per-harness models: Partial<Record<Harness,string>> the runner factory reads by harness. +4 tests (cache in parse + overhead, per-harness model in argv + routing); all 18 packages green. No behavior change to the decision-equivalence or pack paths.
Merged
theagenticguy
pushed a commit
that referenced
this pull request
Jun 30, 2026
🤖 Automated release via release-please --- <details><summary>root: 0.10.6</summary> ## [0.10.6](root-v0.10.5...root-v0.10.6) (2026-06-30) ### Bug Fixes * **eval:** count cache tokens + per-harness model in variance probe ([#271](#271)) ([df12cf9](df12cf9)) ### Documentation * finding 0001 — OCH pack cuts agent token usage 2–4× (live Move 2 data) ([#273](#273)) ([c712ebc](c712ebc)) </details> <details><summary>cli: 0.10.6</summary> ## [0.10.6](cli-v0.10.5...cli-v0.10.6) (2026-06-30) ### Bug Fixes * **eval:** count cache tokens + per-harness model in variance probe ([#271](#271)) ([df12cf9](df12cf9)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two bugs the first live Bedrock run of the Move 2 probe surfaced (unit tests missed them because the stubbed spawn never produced real Claude usage blocks).
Bug 1 — token overhead undercounted
parseClaudeOutputread onlyusage.input_tokensand ignored the ~27K cached system tokens Claude Code injects per call (cache_creation_input_tokens+cache_read_input_tokens); Codex'scached_input_tokenswas dropped too. So the token-overhead headline — the exact "~10% more tokens" cost the variance claim trades against — was wrong.RunTokens/ArmTokensgain acacheTokensfield.totalTokens()counts it toward the overhead ratio.formatReportnow prints thein + out + cachesplit per arm so the overhead is auditable.Bug 2 — global
--modelcan't serve two agentsClaude and Codex take different Bedrock ids (a
us.-prefixed Anthropic profile vs anopenai.*model), so one--modelhanded Claude's id to Codex (forcing per-harness runs in the live test).--modelwith--model-claude/--model-codex.models: Partial<Record<Harness, string>>the runner factory reads by harness.Why it matters
These are the two findings from the live dogfood run. The probe machinery worked end-to-end on Bedrock; these fixes make its cost number correct and let a single invocation drive both agents with their right models — prerequisites for a trustworthy with/without delta on the re-run.
Validation
biome ci .✓ (713 files) ·tsc -b✓ · full build ✓ (eval inlined into CLI) · all 18 packagesfail 0parseClaude/parseCodex+ the overhead total; per-harness model inbuildArgvargv + the command's per-harness routing🤖 Generated with Claude Code